### Bayesian estimator

**
In estimation theory and decision theory, a ****Bayes estimator** or a **Bayes action** is an estimator or decision rule that minimizes the posterior expected value of a loss function (i.e., the **posterior expected loss**). Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is Maximum a posteriori estimation.

## Contents

## Definition

Suppose an unknown parameter θ is known to have a prior distribution $\backslash pi$. Let $\backslash delta\; =\; \backslash delta(x)$ be an estimator of θ (based on some measurements *x*), and let $L(\backslash theta,\backslash delta)$ be a loss function, such as squared error. The **Bayes risk** of $\backslash delta$ is defined as $E\_\backslash pi\; \backslash \{\; L(\backslash theta,\; \backslash delta)\; \backslash \}$, where the expectation is taken over the probability distribution of $\backslash theta$: this defines the risk function as a function of $\backslash delta$. An estimator $\backslash delta$ is said to be a *Bayes estimator* if it minimizes the Bayes risk among all estimators. Equivalently, the estimator which minimizes the posterior expected loss $E\; \backslash \{\; L(\backslash theta,\backslash delta)\; |\; x\; \backslash \}$ *for each x* also minimizes the Bayes risk and therefore is a Bayes estimator.^{[1]}

If the prior is improper then an estimator which minimizes the posterior expected loss *for each x* is called a **generalized Bayes estimator**.^{[2]}

## Examples

### Minimum mean square error estimation

The most common risk function used for Bayesian estimation is the mean square error (MSE), also called squared error risk. The MSE is defined by

- $\backslash mathrm\{MSE\}\; =\; E\backslash left[\; (\backslash widehat\{\backslash theta\}(x)\; -\; \backslash theta)^2\; \backslash right],$

where the expectation is taken over the joint distribution of $\backslash theta$ and $x$.

#### Posterior mean

Using the MSE as risk, the Bayes estimate of the unknown parameter is simply the mean of the posterior distribution,

- $\backslash widehat\{\backslash theta\}(x)\; =\; E[\backslash theta\; |x]=\backslash int\; \backslash theta\; \backslash pi(\backslash theta\; |x)\backslash ,d\backslash theta.$

This is known as the *minimum mean square error* (MMSE) estimator. The Bayes risk, in this case, is the posterior variance.

### Bayes estimators for conjugate priors

If there is no inherent reason to prefer one prior probability distribution over another, a conjugate prior is sometimes chosen for simplicity. A conjugate prior is defined as a prior distribution belonging to some parametric family, for which the resulting posterior distribution also belongs to the same family. This is an important property, since the Bayes estimator, as well as its statistical properties (variance, confidence interval, etc.), can all be derived from the posterior distribution.

Conjugate priors are especially useful for sequential estimation, where the posterior of the current measurement is used as the prior in the next measurement. In sequential estimation, unless a conjugate prior is used, the posterior distribution typically becomes more complex with each added measurement, and the Bayes estimator cannot usually be calculated without resorting to numerical methods.

Following are some examples of conjugate priors.

- If x|θ is normal, x|θ ~ N(θ,σ
^{2}), and the prior is normal, θ ~ N(μ,τ^{2}), then the posterior is also normal and the Bayes estimator under MSE is given by

- $\backslash widehat\{\backslash theta\}(x)=\backslash frac\{\backslash sigma^\{2\}\}\{\backslash sigma^\{2\}+\backslash tau^\{2\}\}\backslash mu+\backslash frac\{\backslash tau^\{2\}\}\{\backslash sigma^\{2\}+\backslash tau^\{2\}\}x.$

- If x
_{1},...,x_{n}are iid Poisson random variables x_{i}|θ ~ P(θ), and if the prior is Gamma distributed θ ~ G(a,b), then the posterior is also Gamma distributed, and the Bayes estimator under MSE is given by

- $\backslash widehat\{\backslash theta\}(X)=\backslash frac\{n\backslash overline\{X\}+a\}\{n+\backslash frac\{1\}\{b\}\}.$

- If x
_{1},...,x_{n}are iid uniformly distributed x_{i}|θ~U(0,θ), and if the prior is Pareto distributed θ~Pa(θ_{0},a), then the posterior is also Pareto distributed, and the Bayes estimator under MSE is given by

- $\backslash widehat\{\backslash theta\}(X)=\backslash frac\{(a+n)\backslash max\{(\backslash theta\_0,x\_1,...,x\_n)\}\}\{a+n-1\}.$

### Alternative risk functions

Risk functions are chosen depending on how one measures the distance between the estimate and the unknown parameter. The MSE is the most common risk function in use, primarily due to its simplicity. However, alternative risk functions are also occasionally used. The following are several examples of such alternatives. We denote the posterior generalized distribution function by $F$.

#### Posterior median and other quantiles

- A "linear" loss function, with $a>0$, which yields the posterior median as the Bayes' estimate:

- $L(\backslash theta,\backslash widehat\{\backslash theta\})\; =\; a|\backslash theta-\backslash widehat\{\backslash theta\}|$
- $F(\backslash widehat\{\backslash theta\; \}(x)|X)\; =\; \backslash tfrac\{1\}\{2\}.$

- Another "linear" loss function, which assigns different "weights" $a,b>0$ to over or sub estimation. It yields a quantile from the posterior distribution, and is a generalization of the previous loss function:

- $L(\backslash theta,\backslash widehat\{\backslash theta\})\; =\; \backslash begin\{cases\}$

a|\theta-\widehat{\theta}|, & \mbox{for }\theta-\widehat{\theta} \ge 0 \\ b|\theta-\widehat{\theta}|, & \mbox{for }\theta-\widehat{\theta} < 0 \end{cases}

- $F(\backslash widehat\{\backslash theta\; \}(x)|X)\; =\; \backslash frac\{a\}\{a+b\}.$

#### Posterior mode

- The following loss function is trickier: it yields either the posterior mode, or a point close to it depending on the curvature and properties of the posterior distribution. Small values of the parameter $K>0$ are recommended, in order to use the mode as an approximation ($L>0$):

- $L(\backslash theta,\backslash widehat\{\backslash theta\})\; =\; \backslash begin\{cases\}$

0, & \mbox{for }|\theta-\widehat{\theta}| < K \\ L, & \mbox{for }|\theta-\widehat{\theta}| \ge K. \end{cases}

Other loss functions can be conceived, although the mean squared error is the most widely used and validated.

## Generalized Bayes estimators

**
**

The prior distribution $\backslash pi$ has thus far been assumed to be a true probability distribution, in that

- $\backslash int\; \backslash pi(\backslash theta)\; d\backslash theta\; =\; 1.$

However, occasionally this can be a restrictive requirement. For example, there is no distribution (covering the set, **R**, of all real numbers) for which every real number is equally likely. Yet, in some sense, such a "distribution" seems like a natural choice for a non-informative prior, i.e., a prior distribution which does not imply a preference for any particular value of the unknown parameter. One can still define a function $\backslash pi(\backslash theta)\; =\; 1$, but this would not be a proper probability distribution since it has infinite mass,

- $\backslash int\{\backslash pi(\backslash theta)d\backslash theta\}=\backslash infty.$

Such measures $\backslash pi(\backslash theta)$, which are not probability distributions, are referred to as improper priors.

The use of an improper prior means that the Bayes risk is undefined (since the prior is not a probability distribution and we cannot take an expectation under it). As a consequence, it is no longer meaningful to speak of a Bayes estimator that minimizes the Bayes risk. Nevertheless, in many cases, one can define the posterior distribution

- $\backslash pi(\backslash theta|x)\; =\; \backslash frac\{p(x|\backslash theta)\; \backslash pi(\backslash theta)\}\{\backslash int\; p(x|\backslash theta)\; \backslash pi(\backslash theta)\; d\backslash theta\}.$

This is a definition, and not an application of Bayes' theorem, since Bayes' theorem can only be applied when all distributions are proper. However, it is not uncommon for the resulting "posterior" to be a valid probability distribution. In this case, the posterior expected loss

- $\backslash int\{L(\backslash theta,a)\backslash pi(\backslash theta|x)d\backslash theta\}$

is typically well-defined and finite. Recall that, for a proper prior, the Bayes estimator minimizes the posterior expected loss. When the prior is improper, an estimator which minimizes the posterior expected loss is referred to as a **generalized Bayes estimator**.^{[2]}

### Example

A typical example is estimation of a location parameter with a loss function of the type $L(a-\backslash theta)$. Here $\backslash theta$ is a location parameter, i.e., $p(x|\backslash theta)\; =\; f(x-\backslash theta)$.

It is common to use the improper prior $\backslash pi(\backslash theta)=1$ in this case, especially when no other more subjective information is available. This yields

- $\backslash pi(\backslash theta|x)\; =\; \backslash frac\{p(x|\backslash theta)\; \backslash pi(\backslash theta)\}\{p(x)\}\; =\; \backslash frac\{f(x-\backslash theta)\}\{p(x)\}$

so the posterior expected loss equals

- $E[L(a-\backslash theta)]\; =\; \backslash int\{L(a-\backslash theta)\; \backslash pi(\backslash theta|x)\; d\backslash theta\}\; =\; \backslash frac\{1\}\{p(x)\}\; \backslash int\; L(a-\backslash theta)\; f(x-\backslash theta)\; d\backslash theta.$

The generalized Bayes estimator is the value $a(x)$ that minimizes this expression for all $x$. This is equivalent to minimizing

- $\backslash int\; L(a-\backslash theta)\; f(x-\backslash theta)\; d\backslash theta$ for all $x.$ (1)

In this case it can be shown that the generalized Bayes estimator has the form $x+a\_0$, for some constant $a\_0$. To see this, let $a\_0$ be the value minimizing (1) when $x=0$. Then, given a different value $x\_1$, we must minimize

- $\backslash int\; L(a-\backslash theta)\; f(x\_1-\backslash theta)\; d\backslash theta\; =\; \backslash int\; L(a-x\_1-\backslash theta\text{'})\; f(-\backslash theta\text{'})\; d\backslash theta\text{'}.$ (2)

This is identical to (1), except that $a$ has been replaced by $a-x\_1$. Thus, the expression minimizing is given by $a-x\_1\; =\; a\_0$, so that the optimal estimator has the form

- $a(x)\; =\; a\_0\; +\; x.\backslash ,\backslash !$

## Empirical Bayes estimators

A Bayes estimator derived through the empirical Bayes method is called an *empirical Bayes estimator*. Empirical Bayes methods enable the use of auxiliary empirical data, from observations of related parameters, in the development of a Bayes estimator. This is done under the assumption that the estimated parameters are obtained from a common prior. For example, if independent observations of different parameters are performed, then the estimation performance of a particular parameter can sometimes be improved by using data from other observations.

There are parametric and non-parametric approaches to empirical Bayes estimation. Parametric empirical Bayes is usually preferable since it is more applicable and more accurate on small amounts of data.^{[3]}

### Example

The following is a simple example of parametric empirical Bayes estimation. Given past observations $x\_1,\backslash ldots,x\_n$ having conditional distribution $f(x\_i|\backslash theta\_i)$, one is interested in estimating $\backslash theta\_\{n+1\}$ based on $x\_\{n+1\}$. Assume that the $\backslash theta\_i$'s have a common prior $\backslash pi$ which depends on unknown parameters. For example, suppose that $\backslash pi$ is normal with unknown mean $\backslash mu\_\backslash pi\backslash ,\backslash !$ and variance $\backslash sigma\_\backslash pi\backslash ,\backslash !.$ We can then use the past observations to determine the mean and variance of $\backslash pi$ in the following way.

First, we estimate the mean $\backslash mu\_m\backslash ,\backslash !$ and variance $\backslash sigma\_m\backslash ,\backslash !$ of the marginal distribution of $x\_1,\; \backslash ldots,\; x\_n$ using the maximum likelihood approach:

- $\backslash widehat\{\backslash mu\}\_m=\backslash frac\{1\}\{n\}\backslash sum\{x\_i\},$
- $\backslash widehat\{\backslash sigma\}\_m^\{2\}=\backslash frac\{1\}\{n\}\backslash sum\{(x\_i-\backslash widehat\{\backslash mu\}\_m)^\{2\}\}.$

Next, we use the relation

- $\backslash mu\_m=E\_\backslash pi[\backslash mu\_f(\backslash theta)]\; \backslash ,\backslash !,$
- $\backslash sigma\_m^\{2\}=E\_\backslash pi[\backslash sigma\_f^\{2\}(\backslash theta)]+E\_\backslash pi[\backslash mu\_f(\backslash theta)-\backslash mu\_m],$

where $\backslash mu\_f(\backslash theta)$ and $\backslash sigma\_f(\backslash theta)$ are the moments of the conditional distribution $f(x\_i|\backslash theta\_i)$, which are assumed to be known. In particular, suppose that $\backslash mu\_f(\backslash theta)\; =\; \backslash theta$ and that $\backslash sigma\_f^\{2\}(\backslash theta)\; =\; K$; we then have

- $\backslash mu\_\backslash pi=\backslash mu\_m\; \backslash ,\backslash !,$
- $\backslash sigma\_\backslash pi^\{2\}=\backslash sigma\_m^\{2\}-\backslash sigma\_f^\{2\}=\backslash sigma\_m^\{2\}-K\; .$

Finally, we obtain the estimated moments of the prior,

- $\backslash widehat\{\backslash mu\}\_\backslash pi=\backslash widehat\{\backslash mu\}\_m,$
- $\backslash widehat\{\backslash sigma\}\_\backslash pi^\{2\}=\backslash widehat\{\backslash sigma\}\_m^\{2\}-K.$

For example, if $x\_i|\backslash theta\_i\; \backslash sim\; N(\backslash theta\_i,1)$, and if we assume a normal prior (which is a conjugate prior in this case), we conclude that $\backslash theta\_\{n+1\}\backslash sim\; N(\backslash widehat\{\backslash mu\}\_\backslash pi,\backslash widehat\{\backslash sigma\}\_\backslash pi^\{2\})$, from which the Bayes estimator of $\backslash theta\_\{n+1\}$ based on $x\_\{n+1\}$ can be calculated.

## Properties

### Admissibility

**
Bayes rules having finite Bayes risk are typically admissible. The following are some specific examples of admissibility theorems.
**

- If a Bayes rule is unique then it is admissible.
^{[4]}For example, as stated above, under mean squared error (MSE) the Bayes rule is unique and therefore admissible. - If θ belongs to a discrete set, then all Bayes rules are admissible.
- If θ belongs to a continuous (non-discrete set), and if the risk function R(θ,δ) is continuous in θ for every δ, then all Bayes rules are admissible.

By contrast, generalized Bayes rules often have undefined Bayes risk in the case of improper priors. These rules are often inadmissible and the verification of their admissibility can be difficult. For example, the generalized Bayes estimator of a location parameter θ based on Gaussian samples (described in the "Generalized Bayes estimator" section above) is inadmissible for $p>2$; this is known as Stein's phenomenon.

### Asymptotic efficiency

Let θ be an unknown random variable, and suppose that $x\_1,x\_2,\backslash ldots$ are iid samples with density $f(x\_i|\backslash theta)$. Let $\backslash delta\_n\; =\; \backslash delta\_n(x\_1,\backslash ldots,x\_n)$ be a sequence of Bayes estimators of θ based on an increasing number of measurements. We are interested in analyzing the asymptotic performance of this sequence of estimators, i.e., the performance of $\backslash delta\_n$ for large *n*.

To this end, it is customary to regard θ as a deterministic parameter whose true value is $\backslash theta\_0$. Under specific conditions,^{[5]} for large samples (large values of *n*), the posterior density of θ is approximately normal. In other words, for large *n*, the effect of the prior probability on the posterior is negligible. Moreover, if δ is the Bayes estimator under MSE risk, then it is asymptotically unbiased and it converges in distribution to the normal distribution:

- $\backslash sqrt\{n\}(\backslash delta\_n\; -\; \backslash theta\_0)\; \backslash to\; N\backslash left(0\; ,\; \backslash frac\{1\}\{I(\backslash theta\_0)\}\backslash right),$

where *I*(θ_{0}) is the fisher information of θ_{0}.
It follows that the Bayes estimator δ_{n} under MSE is asymptotically efficient.

Another estimator which is asymptotically normal and efficient is the maximum likelihood estimator (MLE). The relations between the maximum likelihood and Bayes estimators can be shown in the following simple example.

Consider the estimator of θ based on binomial sample *x*~b(θ,*n*) where θ denotes the probability for success. Assuming θ is distributed according to the conjugate prior, which in this case is the Beta distribution B(*a*,*b*), the posterior distribution is known to be B(a+x,b+n-x). Thus, the Bayes estimator under MSE is

- $\backslash delta\_n(x)=E[\backslash theta|x]=\backslash frac\{a+x\}\{a+b+n\}.$

The MLE in this case is x/n and so we get,

- $\backslash delta\_n(x)=\backslash frac\{a+b\}\{a+b+n\}E[\backslash theta]+\backslash frac\{n\}\{a+b+n\}\backslash delta\_\{MLE\}.$

The last equation implies that, for *n* → ∞, the Bayes estimator (in the described problem) is close to the MLE.

On the other hand, when *n* is small, the prior information is still relevant to the decision problem and affects the estimate. To see the relative weight of the prior information, assume that *a*=*b*; in this case each measurement brings in 1 new bit of information; the formula above shows that the prior information has the same weight as *a+b* bits of the new information. In applications, one often knows very little about fine details of the prior distribution; in particular, there is no reason to assume that it coincides with B(*a*,*b*) exactly. In such a case, one possible interpretation of this calculation is: "there is a non-pathological prior distribution with the mean value 0.5 and the standard deviation *d* which gives the weight of prior information equal to 1/(4*d*^{2})-1 bits of new information."

## Practical example of Bayes estimators

The Internet Movie Database has used a formula for calculating and comparing the ratings of films by its users, including their Top Rated 250 Titles which is claimed to give "a true Bayesian estimate":^{[6]}

- $W\; =\; \{Rv\; +\; Cm\backslash over\; v+m\}\backslash $

where:

- $W\backslash $ = weighted rating
- $R\backslash $ = average for the movie as a number from 0 to 10 (mean) = (Rating)
- $v\backslash $ = number of votes for the movie = (votes)
- $m\backslash $ = minimum votes required to be listed in the Top 250 (currently 25000)
- $C\backslash $ = the mean vote across the whole report (currently 7.1)

As the number of ratings surpasses "m", the weighted bayesian rating (W) approaches a straight average (R). The closer "v" (the number of ratings for the film) is to zero, the closer "W" gets to "C", where W is the weighted rating and C is the average rating of all films. So, in simpler terms, films with very few ratings/votes will have a rating weighted towards the average across all films, while films with many ratings/votes will have a rating weighted towards its average rating. IMDB's use of Bayesian estimates ensures that a film with only a few hundred ratings, all at 10, would not rank above "the Godfather", for example, with a 9.2 average from over 500,000 ratings. The fewer ratings/votes a film has, the closer its weighted "bayesian" rating is to the mean rating of all films on IMDB, while the more votes/ratings a film gets, the closer its weighted "bayesian" rating gets to the pure average/mean for that individual film.

## See also

- Admissible decision rule
- Recursive Bayesian estimation
- Empirical Bayes method
- Conjugate prior
- Generalized expected utility

## Notes

## References

## External links

- Bayesian estimation on cnx.org
- Template:Springer