Template:Bayesian statistics
In statistics, Bayesian inference is a method of inference in which Bayes' rule is used to update the probability estimate for a hypothesis as additional evidence is acquired. Bayesian updating is an important technique throughout statistics, and especially in mathematical statistics. For some cases, exhibiting a Bayesian derivation for a statistical method automatically ensures that the method works as well as any competing method.
Bayesian updating is especially important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a range of fields including science, engineering, philosophy, medicine and law.
In the philosophy of decision theory, Bayesian inference is closely related to discussions of subjective probability, often called "Bayesian probability". Bayesian probability provides a rational method for updating beliefs.
Introduction to Bayes' rule
Main article:
Bayes' rule
Formal
Bayesian inference derives the posterior probability as a consequence of two antecedents, a prior probability and a "likelihood function" derived from a probability model for the data to be observed. Bayesian inference computes the posterior probability according to Bayes' rule:
 $P(HE)\; =\; \backslash frac\{P(EH)\; \backslash cdot\; P(H)\}\{P(E)\}$
where
 $\backslash textstyle\; $ means given.
 $\backslash textstyle\; H$ stands for any hypothesis whose probability may be affected by data (called evidence below). Often there are competing hypotheses, from which one chooses the most probable.
 the evidence $\backslash textstyle\; E$ corresponds to new data that were not used in computing the prior probability.
 $\backslash textstyle\; P(H)$, the prior probability, is the probability of $\backslash textstyle\; H$ before $\backslash textstyle\; E$ is observed. This indicates one's previous estimate of the probability that a hypothesis is true, before gaining the current evidence.
 $\backslash textstyle\; P(HE)$, the posterior probability, is the probability of $\backslash textstyle\; H$ given $\backslash textstyle\; E$, i.e., after $\backslash textstyle\; E$ is observed. This tells us what we want to know: the probability of a hypothesis given the observed evidence.
 $\backslash textstyle\; P(EH)$, the probability of observing $\backslash textstyle\; E$ given $\backslash textstyle\; H$, is also known as the likelihood. It indicates the compatibility of the evidence with the given hypothesis.
 $\backslash textstyle\; P(E)$ is sometimes termed the marginal likelihood or "model evidence". This factor is the same for all possible hypotheses being considered. (This can be seen by the fact that the hypothesis $\backslash textstyle\; H$ does not appear anywhere in the symbol, unlike for all the other factors.) This means that this factor does not enter into determining the relative probabilities of different hypotheses.
Note that what affect the value of $\backslash textstyle\; P(HE)$ for different values of $\backslash textstyle\; H$ are only the factors $\backslash textstyle\; P(H)$ and $\backslash textstyle\; P(EH)$, which both appear in the numerator, and hence the posterior probability is proportional to both. In words:
 (more exactly) The posterior probability of a hypothesis is determined by a combination of the inherent likeliness of a hypothesis (the prior) and the compatibility of the observed evidence with the hypothesis (the likelihood).
 (more concisely) Posterior is proportional to likelihood times prior.
Note that Bayes' rule can also be written as follows:
 $P(HE)\; =\; \backslash frac\{P(EH)\}\{P(E)\}\; \backslash cdot\; P(H)$
where the factor $\backslash textstyle\; \backslash frac\{P(EH)\}\{P(E)\}$ represents the impact of $E$ on the probability of $H$.
Informal
Rationally, Bayes' rule makes a great deal of sense. If the evidence does not match up with a hypothesis, one should reject the hypothesis. But if a hypothesis is extremely unlikely a priori, one should also reject it, even if the evidence does appear to match up.
For example, imagine that I have various hypotheses about the nature of a newborn baby of a friend, including:
 $\backslash textstyle\; H\_1$: the baby is a brownhaired boy.
 $\backslash textstyle\; H\_2$: the baby is a blondhaired girl.
 $\backslash textstyle\; H\_3$: the baby is a dog.
Then consider two scenarios:
 I'm presented with evidence in the form of a picture of a blondhaired baby girl. I find this evidence supports $\backslash textstyle\; H\_2$ and opposes $\backslash textstyle\; H\_1$ and $\backslash textstyle\; H\_3$.
 I'm presented with evidence in the form of a picture of a baby dog. Although this evidence, treated in isolation, supports $\backslash textstyle\; H\_3$, my prior belief in this hypothesis (that a human can give birth to a dog) is extremely small, so the posterior probability is nevertheless small.
The critical point about Bayesian inference, then, is that it provides a principled way of combining new evidence with prior beliefs, through the application of Bayes' rule. (Contrast this with frequentist inference, which relies only on the evidence as a whole, with no reference to prior beliefs.) Furthermore, Bayes' rule can be applied iteratively: after observing some evidence, the resulting posterior probability can then be treated as a prior probability, and a new posterior probability computed from new evidence. This allows for Bayesian principles to be applied to various kinds of evidence, whether viewed all at once or over time. This procedure is termed "Bayesian updating".
Bayesian updating
Bayesian updating is widely used and computationally convenient. However, it is not the only updating rule that might be considered "rational".
Ian Hacking noted that traditional "Dutch book" arguments did not specify Bayesian updating: they left open the possibility that nonBayesian updating rules could avoid Dutch books. Hacking wrote^{[1]} "And neither the Dutch book argument, nor any other in the personalist arsenal of proofs of the probability axioms, entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour."
Indeed, there are nonBayesian updating rules that also avoid Dutch books (as discussed in the literature on "probability kinematics" following the publication of
Formal description of Bayesian inference
Definitions
 $x$, a data point in general. This may in fact be a vector of values.
 $\backslash theta$, the parameter of the data point's distribution, i.e., $x\; \backslash sim\; p(x\backslash theta)$ . This may in fact be a vector of parameters.
 $\backslash alpha$, the hyperparameter of the parameter, i.e., $\backslash theta\; \backslash sim\; p(\backslash theta\backslash alpha)$ . This may in fact be a vector of hyperparameters.
 $\backslash mathbf\{X\}$, a set of $n$ observed data points, i.e., $x\_1,\backslash ldots,x\_n$ .
 $\backslash tilde\{x\}$, a new data point whose distribution is to be predicted.
Bayesian inference
 The prior distribution is the distribution of the parameter(s) before any data is observed, i.e. $p(\backslash theta\backslash alpha)$ .
 The prior distribution might not be easily determined. In this case, we can use the Jeffreys prior to obtain the posterior distribution before updating them with newer observations.
 The sampling distribution is the distribution of the observed data conditional on its parameters, i.e. $p(\backslash mathbf\{X\}\backslash theta)$ . This is also termed the likelihood, especially when viewed as a function of the parameter(s), sometimes written $\backslash operatorname\{L\}(\backslash theta;\backslash mathbf\{X\})\; =\; p(\backslash mathbf\{X\}\backslash theta)$ .
 The marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s), i.e. $p(\backslash mathbf\{X\}\backslash alpha)\; =\; \backslash int\_\{\backslash theta\}\; p(\backslash mathbf\{X\}\backslash theta)\; p(\backslash theta\backslash alpha)\; \backslash operatorname\{d\}\backslash !\backslash theta$ .
 The posterior distribution is the distribution of the parameter(s) after taking into account the observed data. This is determined by Bayes' rule, which forms the heart of Bayesian inference:
 $p(\backslash theta\backslash mathbf\{X\},\backslash alpha)\; =\; \backslash frac\{p(\backslash mathbf\{X\}\backslash theta)\; p(\backslash theta\backslash alpha)\}\{p(\backslash mathbf\{X\}\backslash alpha)\}\; \backslash propto\; p(\backslash mathbf\{X\}\backslash theta)\; p(\backslash theta\backslash alpha)$
Note that this is expressed in words as "posterior is proportional to likelihood times prior", or sometimes as "posterior = likelihood times prior, over evidence".
Bayesian prediction
 The posterior predictive distribution is the distribution of a new data point, marginalized over the posterior:
 $p(\backslash tilde\{x\}\backslash mathbf\{X\},\backslash alpha)\; =\; \backslash int\_\{\backslash theta\}\; p(\backslash tilde\{x\}\backslash theta)\; p(\backslash theta\backslash mathbf\{X\},\backslash alpha)\; \backslash operatorname\{d\}\backslash !\backslash theta$
 The prior predictive distribution is the distribution of a new data point, marginalized over the prior:
 $p(\backslash tilde\{x\}\backslash alpha)\; =\; \backslash int\_\{\backslash theta\}\; p(\backslash tilde\{x\}\backslash theta)\; p(\backslash theta\backslash alpha)\; \backslash operatorname\{d\}\backslash !\backslash theta$
Bayesian theory calls for the use of the posterior predictive distribution to do predictive inference, i.e., to predict the distribution of a new, unobserved data point. That is, instead of a fixed point as a prediction, a distribution over possible points is returned. Only this way is the entire posterior distribution of the parameter(s) used. By comparison, prediction in frequentist statistics often involves finding an optimum point estimate of the parameter(s)—e.g., by maximum likelihood or maximum a posteriori estimation (MAP)—and then plugging this estimate into the formula for the distribution of a data point. This has the disadvantage that it does not account for any uncertainty in the value of the parameter, and hence will underestimate the variance of the predictive distribution.
(In some instances, frequentist statistics can work around this problem. For example, confidence intervals and prediction intervals in frequentist statistics when constructed from a normal distribution with unknown mean and variance are constructed using a Student's tdistribution. This correctly estimates the variance, due to the fact that (1) the average of normally distributed random variables is also normally distributed; (2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a student's tdistribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least, to an arbitrary level of precision, when numerical methods are used.)
Note that both types of predictive distributions have the form of a compound probability distribution (as does the marginal likelihood). In fact, if the prior distribution is a conjugate prior, and hence the prior and posterior distributions come from the same family, it can easily be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the conjugate prior article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.
Inference over exclusive and exhaustive possibilities
If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.
General formulation
Suppose a process is generating independent and identically distributed events $E\_n$, but the probability distribution is unknown. Let the event space $\backslash Omega$ represent the current state of belief for this process. Each model is represented by event $M\_m$. The conditional probabilities $P(E\_nM\_m)$ are specified to define the models. $P(M\_m)$ is the degree of belief in $M\_m$. Before the first inference step, $\backslash \{P(M\_m)\backslash \}$ is a set of initial prior probabilities. These must sum to 1, but are otherwise arbitrary.
Suppose that the process is observed to generate $\backslash textstyle\; E\; \backslash in\; \backslash \{E\_n\backslash \}$. For each $M\; \backslash in\; \backslash \{M\_m\backslash \}$, the prior $P(M)$ is updated to the posterior $P(ME)$. From Bayes' theorem:^{[3]}
 $P(ME)\; =\; \backslash frac\{P(EM)\}\{\backslash sum\_m\; \{P(EM\_m)\; P(M\_m)\}\}\; \backslash cdot\; P(M)$
Upon observation of further evidence, this procedure may be repeated.
Multiple observations
For a set of independent and identically distributed observations $\backslash mathbf\{E\}\; =\; \backslash \{e\_1,\; \backslash dots,\; e\_n\backslash \}$, it may be shown that repeated application of the above is equivalent to
 $P(M\backslash mathbf\{E\})\; =\; \backslash frac\{P(\backslash mathbf\{E\}M)\}\{\backslash sum\_m\; \{P(\backslash mathbf\{E\}M\_m)\; P(M\_m)\}\}\; \backslash cdot\; P(M)$
Where
 $P(\backslash mathbf\{E\}M)\; =\; \backslash prod\_k\{P(e\_kM)\}.$
This may be used to optimize practical calculations.
Parametric formulation
By parametrizing the space of models, the belief in all models may be updated in a single step. The distribution of belief over the model space may then be thought of as a distribution of belief over the parameter space. The distributions in this section are expressed as continuous, represented by probability densities, as this is the usual situation. The technique is however equally applicable to discrete distributions.
Let the vector $\backslash mathbf\{\backslash theta\}$ span the parameter space. Let the initial prior distribution over $\backslash mathbf\{\backslash theta\}$ be $p(\backslash mathbf\{\backslash theta\}\backslash mathbf\{\backslash alpha\})$, where $\backslash mathbf\{\backslash alpha\}$ is a set of parameters to the prior itself, or hyperparameters. Let $\backslash mathbf\{E\}\; =\; \backslash \{e\_1,\; \backslash dots,\; e\_n\backslash \}$ be a set of independent and identically distributed event observations, where all $e\_i$ are distributed as $p(e\backslash mathbf\{\backslash theta\})$ for some $\backslash mathbf\{\backslash theta\}$. Bayes' theorem is applied to find the posterior distribution over $\backslash mathbf\{\backslash theta\}$:
 $$
\begin{align}
p(\mathbf{\theta}\mathbf{E},\mathbf{\alpha}) &= \frac{p(\mathbf{E}\mathbf{\theta},\mathbf{\alpha})}{p(\mathbf{E}\mathbf{\alpha})} \cdot p(\mathbf{\theta}\mathbf{\alpha}) \\
&= \frac{p(\mathbf{E}\mathbf{\theta},\mathbf{\alpha})}{\int_\mathbf{\theta} p(\mathbf{E}\mathbf{\theta},\mathbf{\alpha}) p(\mathbf{\theta}\mathbf{\alpha}) \, d\mathbf{\theta}} \cdot p(\mathbf{\theta}\mathbf{\alpha})
\end{align}
Where
 $p(\backslash mathbf\{E\}\backslash mathbf\{\backslash theta\},\backslash mathbf\{\backslash alpha\})\; =\; \backslash prod\_k\; p(e\_k\backslash mathbf\{\backslash theta\})$
Mathematical properties
Interpretation of factor
$\backslash textstyle\; \backslash frac\{P(EM)\}\{P(E)\}\; >\; 1\; \backslash Rightarrow\; \backslash textstyle\; P(EM)\; >\; P(E)$. That is, if the model were true, the evidence would be more likely than is predicted by the current state of belief. The reverse applies for a decrease in belief. If the belief does not change, $\backslash textstyle\; \backslash frac\{P(EM)\}\{P(E)\}\; =\; 1\; \backslash Rightarrow\; \backslash textstyle\; P(EM)\; =\; P(E)$. That is, the evidence is independent of the model. If the model were true, the evidence would be exactly as likely as predicted by the current state of belief.
Cromwell's rule
If $P(M)=0$ then $P(ME)=0$. If $P(M)=1$, then $P(ME)=1$. This can be interpreted to mean that hard convictions are insensitive to counterevidence.
The former follows directly from Bayes' theorem. The latter can be derived by applying the first rule to the event "not $M$" in place of "$M$", yielding "if $1\; \; P(M)=0$, then $1\; \; P(ME)=0$", from which the result immediately follows.
Asymptotic behaviour of posterior
Consider the behaviour of a belief distribution as it is updated a large number of times with independent and identically distributed trials. For sufficiently nice prior probabilities, the Bernsteinvon Mises theorem gives that in the limit of infinite trials and the posterior converges to a Gaussian distribution independent of the initial prior under some conditions firstly outlined and rigorously proven by Joseph L. Doob in 1948, namely if the random variable in consideration has a finite probability space. The more general results were obtained later by the statistician David A. Freedman who published in two seminal research papers in 1963 and 1965 when and under what circumstances the asymptotic behaviour of posterior is guaranteed. His 1963 paper treats, like Doob (1949), the finite case and comes to a satisfactory conclusion. However, if the random variable has an infinite but countable probability space (i.e., corresponding to a die with infinite many faces) the 1965 paper demonstrates that for a dense subset of priors the Bernsteinvon Mises theorem is not applicable. In this case there is almost surely no asymptotic convergence. Later in the 1980s and 1990s Freedman and Persi Diaconis continued to work on the case of infinite countable probability spaces.^{[4]} To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.
Conjugate priors
In parameterized form, the prior distribution is often assumed to come from a family of distributions called conjugate priors. The usefulness of a conjugate prior is that the corresponding posterior distribution will be in the same family, and the calculation may be expressed in closed form.
Estimates of parameters and predictions
It is often desired to use a posterior distribution to estimate a parameter or variable. Several methods of Bayesian estimation select measurements of central tendency from the posterior distribution.
For onedimensional problems, a unique median exists for practical continuous problems. The posterior median is attractive as a robust estimator.^{[5]}
If there exists a finite mean for the posterior distribution, then the posterior mean is a method of estimation.
 $\backslash tilde\; \backslash theta\; =\; \backslash operatorname\{E\}[\backslash theta]\; =\; \backslash int\_\backslash theta\; \backslash theta\; \backslash ,\; p(\backslash theta\backslash mathbf\{X\},\backslash alpha)\; \backslash ,\; d\backslash theta$
Taking a value with the greatest probability defines maximum a posteriori (MAP) estimates:
 $\backslash \{\; \backslash theta\_\{\backslash text\{MAP\}\}\backslash \}\; \backslash subset\; \backslash arg\; \backslash max\_\backslash theta\; p(\backslash theta\backslash mathbf\{X\},\backslash alpha)\; .$
There are examples where no maximum is attained, in which case the set of MAP estimates is empty.
There are other methods of estimation that minimize the posterior risk (expectedposterior loss) with respect to a loss function, and these are of interest to statistical decision theory using the sampling distribution ("frequentist statistics").
The posterior predictive distribution of a new observation $\backslash tilde\{x\}$ (that is independent of previous observations) is determined by
 $p(\backslash tilde\{x\}\backslash mathbf\{X\},\backslash alpha)\; =\; \backslash int\_\backslash theta\; p(\backslash tilde\{x\},\backslash theta\backslash mathbf\{X\},\backslash alpha)\; \backslash ,\; d\backslash theta\; =\; \backslash int\_\backslash theta\; p(\backslash tilde\{x\}\backslash theta)\; p(\backslash theta\backslash mathbf\{X\},\backslash alpha)\; \backslash ,\; d\backslash theta\; .$
Examples
Probability of a hypothesis
Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?
Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let $H\_1$ correspond to bowl #1, and $H\_2$ to bowl #2.
It is given that the bowls are identical from Fred's point of view, thus $P(H\_1)=P(H\_2)$, and the two must add up to 1, so both are equal to 0.5.
The event $E$ is the observation of a plain cookie. From the contents of the bowls, we know that $P(EH\_1)\; =\; 30/40\; =\; 0.75$ and $P(EH\_2)\; =\; 20/40\; =\; 0.5$. Bayes' formula then yields
 $\backslash begin\{align\}\; P(H\_1E)\; \&=\; \backslash frac\{P(EH\_1)\backslash ,P(H\_1)\}\{P(EH\_1)\backslash ,P(H\_1)\backslash ;+\backslash ;P(EH\_2)\backslash ,P(H\_2)\}\; \backslash \backslash \; \backslash \backslash \; \backslash \; \&\; =\; \backslash frac\{0.75\; \backslash times\; 0.5\}\{0.75\; \backslash times\; 0.5\; +\; 0.5\; \backslash times\; 0.5\}\; \backslash \backslash \; \backslash \backslash \; \backslash \; \&\; =\; 0.6\; \backslash end\{align\}$
Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, $P(H\_1)$, which was 0.5. After observing the cookie, we must revise the probability to $P(H\_1E)$, which is 0.6.
Making a prediction
An archaeologist is working at a site thought to be from the medieval period, between the 11th century to the 16th century. However, it is uncertain exactly when in this period the site was inhabited. Fragments of pottery are found, some of which are glazed and some of which are decorated. It is expected that if the site were inhabited during the early medieval period, then 1% of the pottery would be glazed and 50% of its area decorated, whereas if it had been inhabited in the late medieval period then 81% would be glazed and 5% of its area decorated. How confident can the archaeologist be in the date of inhabitation as fragments are unearthed?
The degree of belief in the continuous variable $C$ (century) is to be calculated, with the discrete set of events $\backslash \{GD,G\; \backslash bar\; D,\; \backslash bar\; G\; D,\; \backslash bar\; G\; \backslash bar\; D\backslash \}$ as evidence. Assuming linear variation of glaze and decoration with time, and that these variables are independent,
 $P(E=GDC=c)\; =\; (0.01\; +\; 0.16(c11))(0.5\; \; 0.09(c11))$
 $P(E=G\; \backslash bar\; DC=c)\; =\; (0.01\; +\; 0.16(c11))(0.5\; +\; 0.09(c11))$
 $P(E=\backslash bar\; G\; DC=c)\; =\; (0.99\; \; 0.16(c11))(0.5\; \; 0.09(c11))$
 $P(E=\backslash bar\; G\; \backslash bar\; DC=c)\; =\; (0.99\; \; 0.16(c11))(0.5\; +\; 0.09(c11))$
Assume a uniform prior of $\backslash textstyle\; f\_C(c)\; =\; 0.2$, and that trials are independent and identically distributed. When a new fragment of type $e$ is discovered, Bayes' theorem is applied to update the degree of belief for each $c$:
$f\_C(cE=e)\; =\; \backslash frac\{P(E=eC=c)\}\{P(E=e)\}f\_C(c)\; =\; \backslash frac\{P(E=eC=c)\}\{\backslash int\_\{11\}^\{16\}\{P(E=eC=c)f\_C(c)dc\}\}f\_C(c)$
A computer simulation of the changing belief as 50 fragments are unearthed is shown on the graph. In the simulation, the site was inhabited around 1520, or $c=15.2$. By calculating the area under the relevant portion of the graph for 50 trials, the archaeologist can say that there is practically no chance the site was inhabited in the 11th and 12th centuries, about 1% chance that it was inhabited during the 13th century, 63% chance during the 14th century and 36% during the 15th century. Note that the Bernsteinvon Mises theorem asserts here the asymptotic convergence to the "true" distribution because the probability space corresponding to the discrete set of events $\backslash \{GD,G\; \backslash bar\; D,\; \backslash bar\; G\; D,\; \backslash bar\; G\; \backslash bar\; D\backslash \}$ is finite (see above section on asymptotic behaviour of the posterior).
In frequentist statistics and decision theory
A decisiontheoretic justification of the use of Bayesian inference was given by Abraham Wald, who proved that every Bayesian procedure is admissible. Conversely, every admissible statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures.^{[6]}
Wald characterized admissible procedures as Bayesian procedures (and limits of Bayesian procedures), making the Bayesian formalism a central technique in such areas of frequentist inference as parameter estimation, hypothesis testing, and computing confidence intervals.^{[7]} For example:
 "Under some conditions, all admissible procedures are either Bayes procedures or limits of Bayes procedures (in various senses). These remarkable results, at least in their original form, are due essentially to Wald. They are useful because the property of being Bayes is easier to analyze than admissibility."^{[6]}
 "In decision theory, a quite general method for proving admissibility consists in exhibiting a procedure as a unique Bayes solution."^{[8]}
 "In the first chapters of this work, prior distributions with finite support and the corresponding Bayes procedures were used to establish some of the main theorems relating to the comparison of experiments. Bayes procedures with respect to more general prior distributions have played a very important role in the development of statistics, including its asymptotic theory." "There are many problems where a glance at posterior distributions, for suitable priors, yields immediately interesting information. Also, this technique can hardly be avoided in sequential analysis."^{[9]}
 "A useful fact is that any Bayes decision rule obtained by taking a proper prior over the whole parameter space must be admissible"^{[10]}
 "An important area of investigation in the development of admissibility ideas has been that of conventional samplingtheory procedures, and many interesting results have been obtained."^{[11]}
Model selection
Applications
Computer applications
Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s. There is also an ever growing connection between Bayesian methods and simulationbased Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolis–Hastings algorithm schemes.^{[12]} Recently Bayesian inference has gained popularity amongst the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously.
As applied to statistical classification, Bayesian inference has been used in recent years to develop algorithms for identifying email spam. Applications which make use of Bayesian inference for spam filtering include CRM114, DSPAM, Bogofilter, SpamAssassin, SpamBayes, and Mozilla. Spam classification is treated in more detail in the article on the naive Bayes classifier.
Solomonoff's Inductive inference is the theory of prediction based on observations; for example, predicting the next symbol based upon a given series of symbols. The only assumption is that the environment follows some unknown but computable probability distribution. It is a formal inductive framework that combines two wellstudied principles of inductive inference: Bayesian statistics and Occam’s Razor.^{[13]}
Solomonoff's universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion.^{[14]}^{[15]}
In the courtroom
Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for 'beyond a reasonable doubt'.^{[16]}^{[17]}^{[18]} Bayes' theorem is applied successively to all evidence presented, with the posterior from one stage becoming the prior for the next. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence. It may be appropriate to explain Bayes' theorem to jurors in odds form, as betting odds are more widely understood than probabilities. Alternatively, a logarithmic approach, replacing multiplication with addition, might be easier for a jury to handle.
If the existence of the crime is not in doubt, only the identity of the culprit, it has been suggested that the prior should be uniform over the qualifying population.^{[19]} For example, if 1,000 people could have committed the crime, the prior probability of guilt would be 1/1000.
The use of Bayes' theorem by jurors is controversial. In the United Kingdom, a defence expert witness explained Bayes' theorem to the jury in R v Adams. The jury convicted, but the case went to appeal on the basis that no means of accumulating evidence had been provided for jurors who did not wish to use Bayes' theorem. The Court of Appeal upheld the conviction, but it also gave the opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task."
GardnerMedwin^{[20]} argues that the criterion on which a verdict in a criminal trial should be based is not the probability of guilt, but rather the probability of the evidence, given that the defendant is innocent (akin to a frequentist pvalue). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions:
 A The known facts and testimony could have arisen if the defendant is guilty
 B The known facts and testimony could have arisen if the defendant is innocent
 C The defendant is guilty.
GardnerMedwin argues that the jury should believe both A and notB in order to convict. A and notB implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox.
Bayesian epistemology
Bayesian epistemology is an epistemological movement that uses techniques of Bayesian inference as a means of justifying the rules of inductive logic.
Karl Popper and David Miller have rejected the alleged rationality of Bayesianism, i.e. using Bayes rule to make epistemological inferences:^{[21]} It is prone to the same vicious circle as any other justificationist epistemology, because it presupposes what it attempts to justify. According to this view, a rational interpretation of Bayesian inference would see it merely as a probabilistic version of falsification, rejecting the belief, commonly held by Bayesianists, that high likelihood achieved by a series of Bayesian updates would prove the hypothesis beyond any reasonable doubt, or even with likelihood greater than 0.
Other
Bayes and Bayesian inference
The problem considered by Bayes in Proposition 9 of his essay, "An Essay towards solving a Problem in the Doctrine of Chances", is the posterior distribution for the parameter a (the success rate) of the binomial distribution.
What is "Bayesian" about Proposition 9 is that Bayes presented it as a probability for the parameter $a$. That is, not only can one compute probabilities for experimental outcomes, but also for the parameter which governs them, and the same algebra is used to make inferences of either kind. Interestingly, Bayes actually states his question in a way that might make the idea of assigning a probability distribution to a parameter palatable to a frequentist. He supposes that a billiard ball is thrown at random onto a billiard table, and that the probabilities p and q are the probabilities that subsequent billiard balls will fall above or below the first ball. By making the binomial parameter $a\backslash ,$ depend on a random event, he cleverly escapes a philosophical quagmire that was an issue he most likely was not even aware of.
History
The term Bayesian refers to Thomas Bayes (1702–1761), who proved a special case of what is now called Bayes' theorem. However, it was PierreSimon Laplace (1749–1827) who introduced a general version of the theorem and used it to approach problems in celestial mechanics, medical statistics, reliability, and jurisprudence.^{[23]} Early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called "inverse probability" (because it infers backwards from observations to parameters, or from effects to causes^{[24]}). After the 1920s, "inverse probability" was largely supplanted by a collection of methods that came to be called frequentist statistics.^{[24]}
In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to objective and subjective currents in Bayesian practice. In the objective or "noninformative" current, the statistical analysis depends on only the model assumed, the data analyzed,^{[25]} and the method assigning the prior, which differs from one objective Bayesian to another objective Bayesian. In the subjective or "informative" current, the specification of the prior depends on the belief (that is, propositions on which the analysis is prepared to act), which can summarize information from experts, previous studies, etc.
In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications.^{[26]} Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics.^{[27]} Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of machine learning.^{[28]}
See also
Notes
References
 Aster, Richard; Borchers, Brian, and Thurber, Clifford (2012). Parameter Estimation and Inverse Problems, Second Edition, Elsevier. ISBN 0123850487, ISBN 9780123850485

 Box, G. E. P. and Tiao, G. C. (1973) Bayesian Inference in Statistical Analysis, Wiley, ISBN 0471574287


 Link to Fragmentary Edition of March 1996).


Further reading
Elementary
The following books are listed in ascending order of probabilistic sophistication:
 Stone, JV (2013). Chapter 1 of book "Bayes’ Rule: A Tutorial Introduction", University of Sheffield, Psychology.



 Bolstad, William M. (2007) Introduction to Bayesian Statistics: Second Edition, John Wiley ISBN 0471270202
 Updated classic textbook. Bayesian theory clearly presented.
 Lee, Peter M. Bayesian Statistics: An Introduction. Fourth Edition (2012), John Wiley ISBN 9781118332573


Intermediate or advanced


 DeGroot, Morris H., Optimal Statistical Decisions. Wiley Classics Library. 2004. (Originally published (1970) by McGrawHill.) ISBN 047168029X.

 Jaynes, E. T. (1998) .
 O'Hagan, A. and Forster, J. (2003) Kendall's Advanced Theory of Statistics, Volume 2B: Bayesian Inference. Arnold, New York. ISBN 0340529229.

 Glenn Shafer and Pearl, Judea, eds. (1988) Probabilistic Reasoning in Intelligent Systems, San Mateo, CA: Morgan Kaufmann.
External links
 Template:Springer
 Bayesian Statistics from Scholarpedia.
 Introduction to Bayesian probability from Queen Mary University of London
 Mathematical Notes on Bayesian Statistics and Markov Chain Monte Carlo
 Tom Griffiths
 A. Hajek and S. Hartmann: Bayesian Epistemology, in: J. Dancy et al. (eds.), A Companion to Epistemology. Oxford: Blackwell 2010, 93106.
 S. Hartmann and J. Sprenger: Bayesian Epistemology, in: S. Bernecker and D. Pritchard (eds.), Routledge Companion to Epistemology. London: Routledge 2010, 609620.
 : "Inductive Logic"
 Bayesian Confirmation Theory
 What Is Bayesian Learning?
This article was sourced from Creative Commons AttributionShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, EGovernment Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a nonprofit organization.