“This it seems to me is characteristic of most judgments involving physical processes — the law of the excluded middle is not a valid description of our actual physical experience — there has to be a third category of doubtful, in addition to positive or negative.”

P. W. Bridgman, The Nature of Physical Theory

1) The multiple testing issue

With the emergence of titanic biological datasets generated by the latest omics technologies, the problem of correction of the significance threshold (the conventional \alpha = 0.05) for multiple testing has required a careful revision and update. The simplest way to correct for multiple testing is the Bonferroni correction (a derivation is provided in par. 5): in this method, the threshold for statistical significance is given by


where the denominator represents the number of independent variables. Now, in the case of genetic testing, how should we set the value of m? Let’s say that we are comparing two populations of individuals (one with a disease, the other made up by healthy subjects) and we have genotyped N variants in each one of them. Once we have associated a p value to every single variant, we must face the problem of correcting it for multiple testing by using Eq. 1. What should we do? Well, it is simple, we put m\,=\,N, right? Wrong!

2) Linkage disequilibrium

In general, variants belonging to the same chromosome are not inherited in an independent fashion: we know that whole chunks of chromosomes of about 50 kb pass from one generation to the other. These blocks are called haplotypes and there has been a considerable effort in classifying them for various human populations in the last 20 years. A map of human haplotypes is now available (HapMap) and it allows for the calculation of the real number of independent variants (often called tag SNPs) on the human Genome, and for other pivotal purposes I won’t write here about, like for instance imputation (a technique that dramatically reduces the number of variants we must genotype in order to scan the genome of a human being) (The International HapMap Consortium 2005), (The International HapMap Consortium, 2009). Visit the page of the International HapMap Consortium here. For us, what is important here is the fact that the knowledge of the haplotypes gives us a measure of the linkage disequilibrium between any possible pair of variants on the same chromosome. But what is linkage disequilibrium? Let’s say that we have two loci L_1 and L_2 on the same chromosome. L_1 has the two alleles A and a, while L_2 presents the two alleles B and b of frequencies f(A),\,f(a),\,f(B),\,f(b), respectively. Let’s define now the random variable X whose value is one if L_1 is occupied by A, zero otherwise; similarly, let’s define the random variable Y such that it is one if L_2 is occupied by B, zero otherwise. We can say that both these variables have a Bernoulli distribution with expectations given by E[X]\,=\,f(A) and E[Y]\,=\,f(B), respectively. If we now define f(A,\,B) the frequency of the haplotype AB, and if we similarly define f(A,\,b),\,f(a,\,b),\,f(a,B), we can then write the following relationships:

2)\;\;\begin{cases} f(Ab)\,=\,f(A)\,-\,f(aB)\\f(aB)\,=\,f(B)\,-\,f(AB)\\f(ab)\,=\,1\,-\,f(B)\,-\,f(A)\,+\,f(AB)\\f(AB)\,+\,f(Ab)\,+\,f(aB)\,+\,f(ab)\,=\,1\\f(A)\,+\,f(a)\,=\,f(B)\,+\,f(b)\,=\,1 \end{cases}\

where we have considered that f(A,\,B)\,=\,E[XY], f(A,\,b)\,=\,E[X(1-Y)]\,= E[X]\,-\,E[XY]\,.... That said, the covariance between X,\,Y is given by:

3)\;\; Cov[X,\,Y]\,=\,E[X\,Y]\,-\,E[X]E[Y]\,=

\,=\, f(AB)\,-\,f(A)f(B)\,=\,f(Ab)\,-\,f(A)f(b)\,=


Or also (with some algebraic manipulation, by substituting the relationship of Eq. 2):


That said, one definition of Linkage disequilibrium is just


Figure 1. Left: the values admitted for d are those that fall within the shell; in particular, for f(aB) = f(Ab) = 0 (coupling) we have that d assumes the values of the curve coloured blue, while for f(AB) = f(ab) = 0 (repulsion) then d describes the curve in orange. These two curves are plotted in Figure 2, too.

This definition presents a problem, namely the fact that when we have a complete dependence between L_1 and L_2, its value varies in ]0,0.25] (see Figure 1, Figure 2, and paragraph 6 for a detailed discussion) and this does not seem to have a particular biological meaning. It would be more interesting to have a definition such that Linkage disequilibrium assumes a unique value (for example 1) when we have complete dependence between the two loci. Note that we can accomplish this by considering the correlation coefficient between X\,=\,Y instead, which is given by:

6)\;\;\rho_{XY}\,=\,\frac{Cov[X,\,Y]}{\sqrt {Var(X)Var(Y)}}\,=\frac{d}{\sqrt {Var(X)Var(Y)}},

Thus, in the present article we consider the following definition for linkage disequilibrium:


where we have considered that Var[X]\,=\,E[X](1-E[X])\,=\,f(A)f(a) and Var[Y]\,=\,E[Y](1-E[Y])\,=\,f(B)f(b). For further details on this topic, go to paragraph 6. See also (Reginald D Smith 2020).

Figure 2. For f(aB) = f(Ab) = 0 (coupling) we have that d assumes the values of the curve coloured blue, while for f(AB) = f(ab) = 0 (repulsion) then d describes the curve in orange.

3) The search for the number of independent loci

Consider now that the number of loci in linkage equilibrium one with the other is the value m in Eq. 1, the one we are interested in. From an operational point of view, how can we determine if two loci are in linkage equilibrium so that they can be considered independent one from the other? We can randomly build the genome of 2,000 human beings with the only restrain that haplotypes must be respected (this is where the HapMap comes into play), we then randomly assign them to two populations and we calculate for each locus the p value for the comparison of these two populations. Let’s say that locus L_i has a p value p_i. Since these two groups are identical by construction, it must be p\,>\,\frac{\alpha}{m} or, in other words:


If we repeat this inequality for each locus, we have a lower bound for m. That number is the number of independent loci we were searching for (Pe’er, I. et al. 2008).

4) Experimental results

I discuss here the calculation of m performed in a recent study, the best I have been able to find on this topic (Fadista, J. et al. 2016). The strength of this study is that they calculated m for two kinds of sequencing techniques: whole-genome sequencing (WGS) and whole-exome sequencing (WES). Not only that, but they also provided the results for different European and non-European populations, for different definitions of linkage disequilibrium, and for different thresholds for minor allele frequency (MAF). These results are plotted in their paper and they are also available in xlsx format (here, even though I must say that they omitted part of raw data on different ethnicities). Nevertheless, I have elaborated them in my own way (the script that draws the diagrams is available below) by using a different arrangement of the data and by providing a fancy polynomial interpolation, by cubic splines, as to have smooth and elegant surfaces and curves. In Figure 3 I plotted m in function of the threshold for the minor allele frequency (MAF) and for the linkage disequilibrium, in the case of whole-genome sequencing (WGS). In that plot, it is assumed that MAF is the lower bound for the minor allele frequency while the axis ‘Linkage disequilibrium’ is the upper limit for the value \rho^2 in Eq. 7. Figure 4 is exactly the same diagram, but in this case, it has been built by considering data from whole-exome sequencing (WES). Some observations: the number of independent variants is bigger in WGS than it is in WES (which is due to the fact that in the former case we are considering more genetic material than in the latter one); when we reduce MAF, m increases (reducing MAF means considering more variants, so again, this leads to a high number of independent variants); m decreases with Linkage disequilibrium (the threshold for considering two variants as independent one from the other). This last point deserves a further explanation: in this context, when we say that LD is, let’s say, 0.8 it means that we consider two variants to be independent when their linkage disequilibrium (calculated as in Eq. 7) is below 0.8; as we reduce this value, we define a more stringent criterion for independence between pair of variants, thus the number m must decrease.

Figure 3. The number of independent variants in whole-genome sequencing, in function of the lower limit for minor allele frequency and of the upper limit for Linkage disequilibrium.
Figure 4. The number of independent variants in whole-exome sequencing, in function of the lower limit for minor allele frequency and of the upper limit for Linkage disequilibrium.
\geq\,0.05 782,3616.39\cdot10^{-8} 7.19
\geq\,0.01 1,580,6503.16\cdot10^{-8} 7.50
\geq\,0.005 2,093,8872.39\cdot10^{-8} 7.62
\geq\,0.001 4,126,1251.21\cdot10^{-8} 7.92
Table 1. This table gives the values m, p in function of the lower limit for minor allele frequency, in the case of whole-genome sequencing. It is assumed that \rho^2\,\leq\,0.8 .

If we consider the upper limit for \rho^2 to be 0.8, we obtain Table 1 and Table 2, where both m and corresponding Bonferroni corrections are collected, for various MAF thresholds.

\geq\,0.05 39,3831.27\cdot10^{-6} 5.90
\geq\,0.01 69,2657.22\cdot10^{-7} 6.14
\geq\,0.005 86,9275.75\cdot10^{-7} 6.24
\geq\,0.001 161,8333.09\cdot10^{-7} 6.51
Table 2. This table gives the values m, p in function of the lower limit for minor allele frequency, in the case of whole-exome sequencing. It is assumed that \rho^2\,\leq\,0.8 .

In Figures 5, 6 you can find these same data after interpolation, arranged as a curve where p is expressed in function of m and MAF.

Figure 5. The corrected threshold for statistical significance in function of the number of independent variants and of the lower limit for minor allele frequency, for WGS.
Figure 6. The corrected threshold for statistical significance in function of the number of independent variants and of the lower limit for minor allele frequency, for WES.

5) How to derive the Bonferroni correction

If we have a threshold \alpha' for statistical significance and we perform m tests, then the probability that one test is positive by chance is P\,<\,\alpha' and the probability that one test is negative by chance is 1\,-\,P\,>\,1\,-\,\alpha'. This also means that the probability that all the m tests are negative is (1\,-\,P)^m\,>\,(1\,-\,\alpha')^m. In conclusion, the probability that at least one test is positive pureley by chance is


We want this probability to be lower than \alpha\,=\,0.05 , which leads to the following inequality:


This value \alpha' is the corrected threshold for statistical significance, accorrding to the Bonferroni correction. But this is not the correction in Eq. 1, you might note. Yes, that is right. But let’s consider the well known expansion by series:

11)\;\;(1\,-\,\alpha)^{\frac{1}{m}}\,=\,\sum_{k\,=\,0}^{\infty}(-\,\alpha)^k\begin{pmatrix} \frac{1}{m}\\k \end{pmatrix}

If we take only the first two addends of the sum and if we substitute them in the expression already found for \alpha', we obtain Eq. 1. The error (its absolute value) that we make when we consider Eq. 1 instead of Eq. 8 is the one plotted in Figure 7, in function of m, for \alpha\,=\,0.05 .

Figure 7. The absolute value of 1\,-\,\sqrt[m]{1\,-\,\alpha}\,-\,\frac{\alpha}{m} in function of m for \alpha\,=\,0.05

6) Mathematical notes on linkage disequilibrium

By substituting the fourth relationship of Eq. 2 in Eq. 4, we obtain the surface



which is a quadratic form in the space of the frequencies of the haplotypes ab, Ab, aB. More precisely, and even though this is of no importance for us, this is a hyperbolic cylinder (as the reader can easily verify by studying its associated matrix). If we now ask that f(Ab)\,\in\,[0,1] and f(AB)\,\in\,[0,1] for f(aB)\,\in\,[0,1], f(ab)\,\in\,[0,1], we find the relationships

\begin{cases} d\,\geq\,-f(aB)\,+\,f(aB)f(ab)\,+\,f(aB)^2\\d\,\leq\,f(ab)\,+\,f(aB)f(ab)\,+\,f(aB)^2\\d\,\geq\,-\,f(aB)\,-\,f(aB)f(ab)\,-\,f(ab)^2\\d\,\leq\,f(ab)\,-\,f(aB)f(ab)\,-\,f(ab)^2\\f(aB)\,\in\,[0,1],\,f(ab)\,\in\,[0,1]\end{cases}

Taken togheter, these inequations say that d can assume only the values included within the elegant surface in Figure 1. In particular, when f(Ab)\,=\,f(aB)\,=\,0 (a condition of perfect disequilibrium which goes under the name of coupling), then d assumes the values of the curve coloured blue, while when f(AB)\,=\,f(ab)\,=\,0 (repulsion) then d assumes the value on the curve coloured orange. These two courves are plotted in Figure 2 too. Note that f(Ab)\,=\,0\,\Rightarrow\,f(aB)\,=\,0; moreover, f(AB)\,=\,0\,\Rightarrow\,f(ab)\,=\,0 (this can be proved easily by using Eq. 2).

Note that the application of the method of the Lagrange multipliers for restrained optimization to Eq. 4, with the restrains defined by the fourth relationship of Eq. 2, would not lead to any conclusion in this case, because it is equal to the search for the extremes of a function whose gradient is never zero.

7) Supplementary Material

The script that plots Figures 1 and 2 can be downloaded here.

The script that plots Figures 3, 4, and 5 can be downloaded here.

The script that plots Figure 7 can be downloaded here.

Data from (Fadista, J. et al. 2016) can be downloaded here.

The equations of this blog post were written using \LaTeX (see this article)


2 thoughts on “P-value threshold for genetic studies

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s