“This it seems to me is characteristic of most judgments involving physical processes — the law of the excluded middle is not a valid description of our actual physical experience — there has to be a third category of doubtful, in addition to positive or negative.”

P. W. Bridgman, The Nature of Physical Theory

1) The multiple testing issue

With the emergence of titanic biological datasets generated by the latest omics technologies, the problem of correction of the significance threshold (the conventional $\alpha = 0.05$) for multiple testing has required a careful revision and update. The simplest way to correct for multiple testing is the Bonferroni correction (a derivation is provided in par. 5): in this method, the threshold for statistical significance is given by

$1)\;\;p\,<\,\frac{\alpha}{m}$

where the denominator represents the number of independent variables. Now, in the case of genetic testing, how should we set the value of $m$? Let’s say that we are comparing two populations of individuals (one with a disease, the other made up by healthy subjects) and we have genotyped N variants in each one of them. Once we have associated a p value to every single variant, we must face the problem of correcting it for multiple testing by using Eq. 1. What should we do? Well, it is simple, we put $m\,=\,N$, right? Wrong!

In general, variants belonging to the same chromosome are not inherited in an independent fashion: we know that whole chunks of chromosomes of about 50 kb pass from one generation to the other. These blocks are called haplotypes and there has been a considerable effort in classifying them for various human populations in the last 20 years. A map of human haplotypes is now available (HapMap) and it allows for the calculation of the real number of independent variants (often called tag SNPs) on the human Genome, and for other pivotal purposes I won’t write here about, like for instance imputation (a technique that dramatically reduces the number of variants we must genotype in order to scan the genome of a human being) (The International HapMap Consortium 2005), (The International HapMap Consortium, 2009). Visit the page of the International HapMap Consortium here. For us, what is important here is the fact that the knowledge of the haplotypes gives us a measure of the linkage disequilibrium between any possible pair of variants on the same chromosome. But what is linkage disequilibrium? Let’s say that we have two loci $L_1$ and $L_2$ on the same chromosome. $L_1$ has the two alleles $A$ and $a$, while $L_2$ presents the two alleles $B$ and $b$ of frequencies $f(A),\,f(a),\,f(B),\,f(b)$, respectively. Let’s define now the random variable $X$ whose value is one if $L_1$ is occupied by $A$, zero otherwise; similarly, let’s define the random variable $Y$ such that it is one if $L_2$ is occupied by $B$, zero otherwise. We can say that both these variables have a Bernoulli distribution with expectations given by $E[X]\,=\,f(A)$ and $E[Y]\,=\,f(B)$, respectively. If we now define $f(A,\,B)$ the frequency of the haplotype $AB$, and if we similarly define $f(A,\,b),\,f(a,\,b),\,f(a,B)$, we can then write the following relationships:

$2)\;\;\begin{cases} f(Ab)\,=\,f(A)\,-\,f(aB)\\f(aB)\,=\,f(B)\,-\,f(AB)\\f(ab)\,=\,1\,-\,f(B)\,-\,f(A)\,+\,f(AB)\\f(AB)\,+\,f(Ab)\,+\,f(aB)\,+\,f(ab)\,=\,1\\f(A)\,+\,f(a)\,=\,f(B)\,+\,f(b)\,=\,1 \end{cases}\$

where we have considered that $f(A,\,B)\,=\,E[XY]$, $f(A,\,b)\,=\,E[X(1-Y)]\,=$ $E[X]\,-\,E[XY]\,...$. That said, the covariance between $X,\,Y$ is given by:

$3)\;\; Cov[X,\,Y]\,=\,E[X\,Y]\,-\,E[X]E[Y]\,=$

$\,=\, f(AB)\,-\,f(A)f(B)\,=\,f(Ab)\,-\,f(A)f(b)\,=$

$=\,f(aB)\,-\,f(a)f(B)\,=\,f(ab)\,-\,f(a)f(b)$

Or also (with some algebraic manipulation, by substituting the relationship of Eq. 2):

$4)\;\;Cov[X,\,Y]\,=\,f(AB)f(ab)\,-\,f(Ab)f(aB)$

That said, one definition of Linkage disequilibrium is just

$5)\;\;d\,=\,Cov[X,\,Y]$

This definition presents a problem, namely the fact that when we have a complete dependence between $L_1$ and $L_2$, its value varies in $]0,0.25]$ (see Figure 1, Figure 2, and paragraph 6 for a detailed discussion) and this does not seem to have a particular biological meaning. It would be more interesting to have a definition such that Linkage disequilibrium assumes a unique value (for example 1) when we have complete dependence between the two loci. Note that we can accomplish this by considering the correlation coefficient between $X\,=\,Y$ instead, which is given by:

$6)\;\;\rho_{XY}\,=\,\frac{Cov[X,\,Y]}{\sqrt {Var(X)Var(Y)}}\,=\frac{d}{\sqrt {Var(X)Var(Y)}},$

Thus, in the present article we consider the following definition for linkage disequilibrium:

$7)\;\;linkage\,disequilibrium\,=\,\rho^2\,=\,\frac{d^2}{f(A)f(a)f(B)f(b)}$

where we have considered that $Var[X]\,=\,E[X](1-E[X])\,=\,f(A)f(a)$ and $Var[Y]\,=\,E[Y](1-E[Y])\,=\,f(B)f(b)$. For further details on this topic, go to paragraph 6. See also (Reginald D Smith 2020).

3) The search for the number of independent loci

Consider now that the number of loci in linkage equilibrium one with the other is the value m in Eq. 1, the one we are interested in. From an operational point of view, how can we determine if two loci are in linkage equilibrium so that they can be considered independent one from the other? We can randomly build the genome of 2,000 human beings with the only restrain that haplotypes must be respected (this is where the HapMap comes into play), we then randomly assign them to two populations and we calculate for each locus the p value for the comparison of these two populations. Let’s say that locus $L_i$ has a p value $p_i$. Since these two groups are identical by construction, it must be $p\,>\,\frac{\alpha}{m}$ or, in other words:

$8)\;\;m\,>\,\frac{\alpha}{p_i}$

If we repeat this inequality for each locus, we have a lower bound for m. That number is the number of independent loci we were searching for (Pe’er, I. et al. 2008).

4) Experimental results

I discuss here the calculation of m performed in a recent study, the best I have been able to find on this topic (Fadista, J. et al. 2016). The strength of this study is that they calculated m for two kinds of sequencing techniques: whole-genome sequencing (WGS) and whole-exome sequencing (WES). Not only that, but they also provided the results for different European and non-European populations, for different definitions of linkage disequilibrium, and for different thresholds for minor allele frequency (MAF). These results are plotted in their paper and they are also available in xlsx format (here, even though I must say that they omitted part of raw data on different ethnicities). Nevertheless, I have elaborated them in my own way (the script that draws the diagrams is available below) by using a different arrangement of the data and by providing a fancy polynomial interpolation, by cubic splines, as to have smooth and elegant surfaces and curves. In Figure 3 I plotted m in function of the threshold for the minor allele frequency (MAF) and for the linkage disequilibrium, in the case of whole-genome sequencing (WGS). In that plot, it is assumed that MAF is the lower bound for the minor allele frequency while the axis ‘Linkage disequilibrium’ is the upper limit for the value $\rho^2$ in Eq. 7. Figure 4 is exactly the same diagram, but in this case, it has been built by considering data from whole-exome sequencing (WES). Some observations: the number of independent variants is bigger in WGS than it is in WES (which is due to the fact that in the former case we are considering more genetic material than in the latter one); when we reduce MAF, m increases (reducing MAF means considering more variants, so again, this leads to a high number of independent variants); m decreases with Linkage disequilibrium (the threshold for considering two variants as independent one from the other). This last point deserves a further explanation: in this context, when we say that LD is, let’s say, 0.8 it means that we consider two variants to be independent when their linkage disequilibrium (calculated as in Eq. 7) is below 0.8; as we reduce this value, we define a more stringent criterion for independence between pair of variants, thus the number m must decrease.

If we consider the upper limit for $\rho^2$ to be 0.8, we obtain Table 1 and Table 2, where both m and corresponding Bonferroni corrections are collected, for various MAF thresholds.

In Figures 5, 6 you can find these same data after interpolation, arranged as a curve where p is expressed in function of m and MAF.

5) How to derive the Bonferroni correction

If we have a threshold $\alpha'$ for statistical significance and we perform m tests, then the probability that one test is positive by chance is $P\,<\,\alpha'$ and the probability that one test is negative by chance is $1\,-\,P\,>\,1\,-\,\alpha'$. This also means that the probability that all the m tests are negative is $(1\,-\,P)^m\,>\,(1\,-\,\alpha')^m$. In conclusion, the probability that at least one test is positive pureley by chance is

$9)\;\;p\,=\,1\,-\,(1\,-\,P)^m\,<\,1\,-\,(1\,-\,\alpha')^m$

We want this probability to be lower than $\alpha\,=\,0.05$, which leads to the following inequality:

$10)\;\;\alpha'\,<\,1\,-\,\sqrt[m]{1\,-\,\alpha}$

This value $\alpha'$ is the corrected threshold for statistical significance, accorrding to the Bonferroni correction. But this is not the correction in Eq. 1, you might note. Yes, that is right. But let’s consider the well known expansion by series:

$11)\;\;(1\,-\,\alpha)^{\frac{1}{m}}\,=\,\sum_{k\,=\,0}^{\infty}(-\,\alpha)^k\begin{pmatrix} \frac{1}{m}\\k \end{pmatrix}$

If we take only the first two addends of the sum and if we substitute them in the expression already found for $\alpha'$, we obtain Eq. 1. The error (its absolute value) that we make when we consider Eq. 1 instead of Eq. 8 is the one plotted in Figure 7, in function of m, for $\alpha\,=\,0.05$.

6) Mathematical notes on linkage disequilibrium

By substituting the fourth relationship of Eq. 2 in Eq. 4, we obtain the surface

$f(ab)\,-\,f(ab)f(Ab)\,-\,f(ab)f(aB)\,-$

$-\,f^2(ab)\,-\,f(Ab)f(aB)\,-\,d\,=\,0$

which is a quadratic form in the space of the frequencies of the haplotypes ab, Ab, aB. More precisely, and even though this is of no importance for us, this is a hyperbolic cylinder (as the reader can easily verify by studying its associated matrix). If we now ask that $f(Ab)\,\in\,[0,1]$ and $f(AB)\,\in\,[0,1]$ for $f(aB)\,\in\,[0,1]$, $f(ab)\,\in\,[0,1]$, we find the relationships

$\begin{cases} d\,\geq\,-f(aB)\,+\,f(aB)f(ab)\,+\,f(aB)^2\\d\,\leq\,f(ab)\,+\,f(aB)f(ab)\,+\,f(aB)^2\\d\,\geq\,-\,f(aB)\,-\,f(aB)f(ab)\,-\,f(ab)^2\\d\,\leq\,f(ab)\,-\,f(aB)f(ab)\,-\,f(ab)^2\\f(aB)\,\in\,[0,1],\,f(ab)\,\in\,[0,1]\end{cases}$

Taken togheter, these inequations say that d can assume only the values included within the elegant surface in Figure 1. In particular, when $f(Ab)\,=\,f(aB)\,=\,0$ (a condition of perfect disequilibrium which goes under the name of coupling), then d assumes the values of the curve coloured blue, while when $f(AB)\,=\,f(ab)\,=\,0$ (repulsion) then d assumes the value on the curve coloured orange. These two courves are plotted in Figure 2 too. Note that $f(Ab)\,=\,0\,\Rightarrow\,f(aB)\,=\,0$; moreover, $f(AB)\,=\,0\,\Rightarrow\,f(ab)\,=\,0$ (this can be proved easily by using Eq. 2).

Note that the application of the method of the Lagrange multipliers for restrained optimization to Eq. 4, with the restrains defined by the fourth relationship of Eq. 2, would not lead to any conclusion in this case, because it is equal to the search for the extremes of a function whose gradient is never zero.

7) Supplementary Material

The script that plots Figures 1 and 2 can be downloaded here.

The script that plots Figures 3, 4, and 5 can be downloaded here.

The equations of this blog post were written using $\LaTeX$ (see this article)