### Significance test

Statistical significance can refer to two separate notions:

A fixed number, most often 0.05, is referred to as a significance level or level of significance. Such a number may be used either in the first sense, as a cutoff mark for p-values (each p-value is calculated from the data), or in the second sense as a desired parameter in the test design (α depends only on the test design, and is not calculated from observed data).

These two notions reflect distinct aspects of statistical analysis and measure different quantities which cannot be compared. However, they are often conflated. In the first approach p is often compared to 0.05 ($p < 0.05$ is checked), and in the second approach α is often set to 0.05 ($\alpha = 0.05$), so combining these equations yields "$p < \alpha$", which is not a meaningful comparison. Due to this confusion, the notation α is sometimes used for a cutoff value of p even when the Neyman–Pearson approach is not being used. This confusion is particularly rampant in social and biological sciences, as opposed to engineering where the term false alarm rate is popularly used to denote the type I error rate.

In this article, "statistical significance" is used in the sense of p-value (Fisher). See statistical hypothesis testing for further discussion.

## Statistical significance in the sense of Fisher

### Motivation

If $X$ is the observed data and $H$ is the hypothesis under consideration, then the Fisher's statistical significance is given by the conditional probability $Pr\left(X|H\right),$ which gives the likelihood of the observation if the hypothesis is assumed to be correct. A statistical hypothesis is always expressed as a probability distribution that is assumed to govern the observed data. Higher the value of this conditional probability, $Pr\left(X|H\right),$ higher is our confidence that the data can be explained by the hypothesis. Similarly, smaller value of this conditional probability means that the chances of the data being explained by our hypothesis is smaller, thus leading to one of the following conclusions: Either (1) we admit that a very rare event has occurred if we assume our hypothesis to be true, or (2) our hypothesis may not explain the observation adequately and that an alternative hypothesis might be needed to explain the observed data. If the conditional probability is small enough, we say that the result is significant enough so as to prompt us to reconsider our hypothesis. When used in statistics, the word significant does not mean important or meaningful, as it does in everyday speech: with sufficient data, a statistically significant result may be very small in magnitude.

For example, tossing a coin 3 times and obtaining 3 heads would not be considered an extreme result. However, tossing a coin 10 times and finding that all 10 tosses land the same way up would be considered an extreme result. Let us suppose that our hypothesis, $H,$ is that the coin is fair, i.e., the probability of landing head $p=1/2$. From this hypothesis, it follows that the probability that we get all heads in 10 tosses is

$Pr\left(10 \; heads\; in \; 10 \; tosses|p=1/2\right) = \left \left( \tfrac\left\{1\right\}\left\{2\right\} \right \right) ^\left\{10\right\} \approx 0.00097$

which is rare. The result may therefore be considered statistically significant evidence that our hypothesis cannot explain the observed data and that the coin is not fair.

Every experimental observation is subject to random error. In statistical testing, a result is deemed statistically significant if it is so extreme (without external variables which would influence the correlation results of the test) that such a result would be expected only in rare circumstances, given that the hypothesis is assumed to be true. Hence the result provides enough statistical evidence to reject the hypothesis. Usually, a small but arbitrary threshold $\alpha$ is set up before hand such that if $Pr\left(X|H\right) \leq \alpha,$ then the hypothesis $H$ is rejected. The value of $\alpha$ is often referred to as the significance level. The setting of the value of $\alpha$ depends on the consensus of the research community and can vary from one field to another.

### Relation with p-value

If $X$ is a continuous random variable, and we observed an instance $x$, then $Pr\left(X=x|H\right)=0.$ Thus we need to change the definition to accommodate the continuous random variables. Usually, instead of the actual observations, $X$ is instead a test statistic. A test statistic is a scalar function of all the observations. Thus the p-value is defined as the probability, under the assumption of hypothesis $H$, of obtaining a result equal to or more extreme than what was actually observed. Depending on how we look at it, the "more extreme than what was actually observed" can either mean $\\left\{ X \geq x \\right\}$ (right tail event) or $\\left\{ X \leq x \\right\}$ (left tail event) or the "smaller" of $\\left\{ X \leq x\\right\}$ and $\\left\{ X \geq x \\right\}$ (double tailed event). Thus the test of significance as given by the p-value is

• $Pr\left(X \geq x |H\right)$ for right tail event,
• $Pr\left(X \leq x |H\right)$ for left tail event,
• $2\min\left\left(Pr\left(X \leq x |H\right),Pr\left(X \geq x |H\right)\right\right)$ for double tail event.

The hypothesis $H$ is rejected if any of these probabilities is less than the level of significance $\alpha$.

The test statistic follows a distribution determined by the function used to define that test statistic. When the data are hypothesized to follow the normal distribution, depending on the nature of the test statistic, and thus our underlying hypothesis of the test statistic, different null hypothesis tests have been developed. Some such tests are z-test for normal distribution, t-test for Student's t-distribution, f-test for f-distribution. When the data do not follow a normal distribution, it can still be possible to approximate the distribution of these tests statistics by a normal distribution by invoking the central limit theorem.

### Null hypothesis

Here the rejection of hypothesis $H$ does not entail the acceptance of another alternative hypothesis as with Neyman-Pearson hypothesis testing. The only hypothesis $H$ in this test is usually referred to as the null hypothesis. However, since an alternative hypothesis is not formulated in this test, it is may seem meaningless to refer the hypothesis $H$ as the null hypothesis, at least in the sense of Neyman-Pearson where the word "null" is used merely as a label for one of the many contending hypotheses. Nonetheless, due to considerations apart from statistics, it is standard practice to refer to the only hypothesis in the Fisherian test as the null hypothesis, intending to mean that an experiment will produce null result. That is, the experiment will not produce anything of out of ordinary. In an experimental setting, the null effect can be studied using a "control group". Often the intention of an experiment is to invalidate the null hypothesis, so as to conclude that the experiment has discovered something out of ordinary. What exactly is meant by a null result depends on the particular field of study and needs to be rigorously specified in statistical language prior to the analysis of the experimental data. The calculated statistical significance of a result is in principle only valid if the hypothesis was specified before any data were examined. If, instead, the hypothesis was specified after some of the data were examined, and specifically tuned to match the direction in which the early data appeared to point, the calculation would overestimate statistical significance.

### Sample size

Researchers focusing solely on whether individual test results are significant or not may miss important response patterns which individually fall under the threshold set for tests of significance. Therefore along with tests of significance, it is preferable to examine effect-size statistics, which describe how large the effect is and the uncertainty around that estimate, so that the practical importance of the effect may be gauged by the reader.

## History

The phrase test of significance was coined by Ronald Fisher.[1] The term significance, used in a statistical sense, dates back to 1885.[2]

## Use in practice

Popular levels of significance are 10% (0.1), 5% (0.05), 1% (0.01), 0.5% (0.005), and 0.1% (0.001). If a test of significance gives a p-value lower than or equal to the significance level,[3] the null hypothesis is rejected at that level. Such results are informally referred to as 'statistically significant (at the p = 0.05 level, etc.)'. For example, if someone argues that "there's only one chance in a thousand this could have happened by coincidence", a 0.001 level of statistical significance is being stated. The lower the significance level chosen, the stronger the evidence required. The choice of significance level is somewhat arbitrary, but for many applications, a level of 5% is chosen by convention.[4][5]

In some situations it is convenient to express the complementary statistical significance (so 0.95 instead of 0.05), which corresponds to a quantile of the test statistic. In general, when interpreting a stated significance, one must be careful to note what, precisely, is being tested statistically.

Different levels of cutoff trade off countervailing effects. Lower levels – such as 0.01 instead of 0.05 – are stricter, and increase confidence in the determination of significance, but run an increased risk of failing to reject a false null hypothesis. Evaluation of a given p-value of data requires a degree of judgment, and rather than a strict cutoff, one may instead simply consider lower p-values as more significant.

Graphically, statistical significance is often indicated by the use of here).

## In terms of σ (sigma)

In some fields, for example nuclear and particle physics, it is common to express statistical significance in units of the standard deviation σ of a normal distribution. A statistical significance of "$n\sigma$" can be converted into a p-value by use of the cumulative distribution function Φ of the standard normal distribution, through the relation:

$\!p = 2 \left(1 - \Phi \left(n\right)\right),$ (this formula varies depending on whether a one-tailed or a two-tailed test is appropriate)

or via use of the error function:

$p = 1 - \operatorname\left\{erf\right\}\left\left(n/\sqrt\left\{2\right\}\right\right) .$

Tabulated values of these functions are often found in statistics text books: see standard normal table. The use of σ implicitly assumes a normal distribution of measurement values. For example, if a theory predicts that a parameter has a value of, say, 109 ± 3, and the parameter measures 100, then one might report the measurement as a "3σ deviation" from the theoretical prediction. In terms of p-value, this statement is equivalent to saying that "assuming the theory is true, the likelihood of obtaining the experimental result by coincidence is 0.27%" (since 1 − erf(3/√2) = 0.0027) (again depending on whether a one-tailed test or two-tailed test is appropriate).

Fixed significance levels such as those mentioned above may be regarded as useful in exploratory data analyses. However, modern practice is to quote the p-value explicitly, where the outcome of a test is essentially the final outcome of an experiment or other study. And, importantly, it should be stated whether the p-value is judged significant. This allows transferring the maximum information from a summary of the study into meta-analyses.

## Pitfalls and criticism

The scientific literature contains extensive discussion of the concept of statistical significance and in particular of its potential misuse and abuse.

## Signal–noise ratio conceptualisation of significance

Statistical significance can be considered the confidence one has in a given result. In a comparison study, it is dependent on the relative difference between the groups compared, the amount of measurement and the noise associated with the measurement. In other words, the confidence one has in a given result being non-random (i.e., it is not a consequence of chance) depends on the signal-to-noise ratio (SNR) and the sample size.

Expressed mathematically, the confidence that a result is not by random chance is given by the following formula by Sackett:[6]

$\mathrm\left\{confidence\right\} = \frac\left\{\mathrm\left\{signal\right\}\right\}\left\{\mathrm\left\{noise\right\}\right\} \times \sqrt\left\{\mathrm\left\{sample\ size\right\}\right\}.$

For clarity, the above formula is presented in tabular form below.

Dependence of confidence with noise, signal and sample size (tabular form)

Parameter Parameter increases Parameter decreases
Noise Confidence decreases Confidence increases
Signal Confidence increases Confidence decreases
Sample size Confidence increases Confidence decreases

In words, the dependence of confidence is high if the noise is low and/or the sample size is large and/or the effect size (signal) is large. The confidence of a result (and its associated confidence interval) is not dependent on effect size alone. If the sample size is large and the noise is low a small effect size can be measured with great confidence. Whether a small effect size is considered important is dependent on the context of the events compared.

In medicine, small effect sizes (reflected by small increases of risk) are often considered clinically relevant and are frequently used to guide treatment decisions if there is great confidence in them. Whether a given treatment is considered a worthy endeavour is dependent on the risks, benefits and costs.

## Does order of procedure affect statistical significance?

Order refers to which comes first: the test data or the specification of the hypotheses to be tested. When the hypotheses come first the test is "prospective" and when the data come first the test is "retrospective". Traditionally, prospective tests have been required.[7][8] However, there is a well-known generally accepted hypothesis test in which the data preceded the hypotheses.[9][dubious ] In that study the statistical significance was calculated the same as it would have been had the hypotheses preceded the data. A retrospective significance test can be used to separate promising and unpromising treatments, but a perspective test is required to justify scientific conclusions. "The reasoning behind statistical significance works well if you decide what effect you are seeking, design an experiment or sample to search for it, and use a test of significance to weigh the evidence that you get."[10] (p 465) "You cannot legitimately test a hypothesis on the same data that first suggested that hypothesis."[10] (p 466) A related question in use of statistics in the physical sciences is whether probability theory applies to the known past in the same way that it applies to the unknown future. Although these questions have been discussed,[11] there are few references in this area of statistics. It hardly seems reasonable to accord the same status to a hypothesis that explains the results of an experiment after the results are known as to a hypothesis that predicts the results of an experiment before they are known. This is because it is well known that predicting an event before it occurs is more difficult than explaining it after it occurs.