# An update on the UK Biobank: fine-mapping

The man who does not make his own tools does not make his own sculpture.

Irving Stone, The Agony and The Ecstasy

An analysis of the UK Biobank database carried out by Neale’s Lab pointed out a significant correlation between a region of chromosome 13 and females with self-reported Chronic Fatigue Syndrome (see this blog post). This result was further analyzed in a subsequent paper in which it was suggested that the causal variant within this region may be rs7337312 (position 13:41353297:G:A on reference genome GRCh37, forward strand) (Dibble J. et al. 2020). It is included in a regulatory region of gene SLC25A15, which encodes mitochondrial ornithine transporter I, and the potential biological effects of this variant have been discussed in the aforementioned study. I made an attempt at my own discussion of this variant (here).

It is not unusual for a GWAS study to associate a trait with a long sequence of variants belonging to the same chromosome and in a close physical relationship one with the other, due to high linkage disequilibrium often present between variants close to one another (see this blog post). Several methods have been developed in the last years to further refine the association and find the variant that causes the disease and drives the association of its nearby SNPs. These methods are based on both purely mathematical considerations and on the analysis of the known consequences of the variants, and we refer to them by the evocative name of fine-mapping (Schaid DJ et al. 2018).

I tried to develop my own method and software for a purely mathematical approach to fine-mapping and I applied it to the region associated with ME/CFS in females, and I found variant rs11147812 (13:41404706:T:C on h19) as the most likely causal one. The detailed discussion of the method I followed and the scripts I wrote are available here for download. It might be noted that this variant belongs to a genetic entity (pseudogene TPTE2P5) that harbors variants that have been strongly associated with the following hematic parameters (see GWAS catalog).

The alternative allele of rs11147812 has been reported to increase (effect size > 0) the expression of gene SLC25A15 in fat (see this page) with a p value of $2.85\cdot10^{-20}$. Since the alternative variant is less frequent in patients than in controls, the prediction is that we have a reduced expression of SLC25A15 in fat, in female CFS patients.

Note. The header image shows the convergence of the linear method used by my script for fine-mapping the region on chromosome 13. The detailed discussion of the method I followed and the scripts I wrote are available here for download.

# P-value threshold for genetic studies

“This it seems to me is characteristic of most judgments involving physical processes — the law of the excluded middle is not a valid description of our actual physical experience — there has to be a third category of doubtful, in addition to positive or negative.”

P. W. Bridgman, The Nature of Physical Theory

1) The multiple testing issue

With the emergence of titanic biological datasets generated by the latest omics technologies, the problem of correction of the significance threshold (the conventional $\alpha = 0.05$) for multiple testing has required a careful revision and update. The simplest way to correct for multiple testing is the Bonferroni correction (a derivation is provided in par. 5): in this method, the threshold for statistical significance is given by

$1)\;\;p\,<\,\frac{\alpha}{m}$

where the denominator represents the number of independent variables. Now, in the case of genetic testing, how should we set the value of $m$? Let’s say that we are comparing two populations of individuals (one with a disease, the other made up by healthy subjects) and we have genotyped N variants in each one of them. Once we have associated a p value to every single variant, we must face the problem of correcting it for multiple testing by using Eq. 1. What should we do? Well, it is simple, we put $m\,=\,N$, right? Wrong!

In general, variants belonging to the same chromosome are not inherited in an independent fashion: we know that whole chunks of chromosomes of about 50 kb pass from one generation to the other. These blocks are called haplotypes and there has been a considerable effort in classifying them for various human populations in the last 20 years. A map of human haplotypes is now available (HapMap) and it allows for the calculation of the real number of independent variants (often called tag SNPs) on the human Genome, and for other pivotal purposes I won’t write here about, like for instance imputation (a technique that dramatically reduces the number of variants we must genotype in order to scan the genome of a human being) (The International HapMap Consortium 2005), (The International HapMap Consortium, 2009). Visit the page of the International HapMap Consortium here. For us, what is important here is the fact that the knowledge of the haplotypes gives us a measure of the linkage disequilibrium between any possible pair of variants on the same chromosome. But what is linkage disequilibrium? Let’s say that we have two loci $L_1$ and $L_2$ on the same chromosome. $L_1$ has the two alleles $A$ and $a$, while $L_2$ presents the two alleles $B$ and $b$ of frequencies $f(A),\,f(a),\,f(B),\,f(b)$, respectively. Let’s define now the random variable $X$ whose value is one if $L_1$ is occupied by $A$, zero otherwise; similarly, let’s define the random variable $Y$ such that it is one if $L_2$ is occupied by $B$, zero otherwise. We can say that both these variables have a Bernoulli distribution with expectations given by $E[X]\,=\,f(A)$ and $E[Y]\,=\,f(B)$, respectively. If we now define $f(A,\,B)$ the frequency of the haplotype $AB$, and if we similarly define $f(A,\,b),\,f(a,\,b),\,f(a,B)$, we can then write the following relationships:

$2)\;\;\begin{cases} f(Ab)\,=\,f(A)\,-\,f(aB)\\f(aB)\,=\,f(B)\,-\,f(AB)\\f(ab)\,=\,1\,-\,f(B)\,-\,f(A)\,+\,f(AB)\\f(AB)\,+\,f(Ab)\,+\,f(aB)\,+\,f(ab)\,=\,1\\f(A)\,+\,f(a)\,=\,f(B)\,+\,f(b)\,=\,1 \end{cases}\$

where we have considered that $f(A,\,B)\,=\,E[XY]$, $f(A,\,b)\,=\,E[X(1-Y)]\,=$ $E[X]\,-\,E[XY]\,...$. That said, the covariance between $X,\,Y$ is given by:

$3)\;\; Cov[X,\,Y]\,=\,E[X\,Y]\,-\,E[X]E[Y]\,=$

$\,=\, f(AB)\,-\,f(A)f(B)\,=\,f(Ab)\,-\,f(A)f(b)\,=$

$=\,f(aB)\,-\,f(a)f(B)\,=\,f(ab)\,-\,f(a)f(b)$

Or also (with some algebraic manipulation, by substituting the relationship of Eq. 2):

$4)\;\;Cov[X,\,Y]\,=\,f(AB)f(ab)\,-\,f(Ab)f(aB)$

That said, one definition of Linkage disequilibrium is just

$5)\;\;d\,=\,Cov[X,\,Y]$

This definition presents a problem, namely the fact that when we have a complete dependence between $L_1$ and $L_2$, its value varies in $]0,0.25]$ (see Figure 1, Figure 2, and paragraph 6 for a detailed discussion) and this does not seem to have a particular biological meaning. It would be more interesting to have a definition such that Linkage disequilibrium assumes a unique value (for example 1) when we have complete dependence between the two loci. Note that we can accomplish this by considering the correlation coefficient between $X\,=\,Y$ instead, which is given by:

$6)\;\;\rho_{XY}\,=\,\frac{Cov[X,\,Y]}{\sqrt {Var(X)Var(Y)}}\,=\frac{d}{\sqrt {Var(X)Var(Y)}},$

Thus, in the present article we consider the following definition for linkage disequilibrium:

$7)\;\;linkage\,disequilibrium\,=\,\rho^2\,=\,\frac{d^2}{f(A)f(a)f(B)f(b)}$

where we have considered that $Var[X]\,=\,E[X](1-E[X])\,=\,f(A)f(a)$ and $Var[Y]\,=\,E[Y](1-E[Y])\,=\,f(B)f(b)$. For further details on this topic, go to paragraph 6. See also (Reginald D Smith 2020).

3) The search for the number of independent loci

Consider now that the number of loci in linkage equilibrium one with the other is the value m in Eq. 1, the one we are interested in. From an operational point of view, how can we determine if two loci are in linkage equilibrium so that they can be considered independent one from the other? We can randomly build the genome of 2,000 human beings with the only restrain that haplotypes must be respected (this is where the HapMap comes into play), we then randomly assign them to two populations and we calculate for each locus the p value for the comparison of these two populations. Let’s say that locus $L_i$ has a p value $p_i$. Since these two groups are identical by construction, it must be $p\,>\,\frac{\alpha}{m}$ or, in other words:

$8)\;\;m\,>\,\frac{\alpha}{p_i}$

If we repeat this inequality for each locus, we have a lower bound for m. That number is the number of independent loci we were searching for (Pe’er, I. et al. 2008).

4) Experimental results

I discuss here the calculation of m performed in a recent study, the best I have been able to find on this topic (Fadista, J. et al. 2016). The strength of this study is that they calculated m for two kinds of sequencing techniques: whole-genome sequencing (WGS) and whole-exome sequencing (WES). Not only that, but they also provided the results for different European and non-European populations, for different definitions of linkage disequilibrium, and for different thresholds for minor allele frequency (MAF). These results are plotted in their paper and they are also available in xlsx format (here, even though I must say that they omitted part of raw data on different ethnicities). Nevertheless, I have elaborated them in my own way (the script that draws the diagrams is available below) by using a different arrangement of the data and by providing a fancy polynomial interpolation, by cubic splines, as to have smooth and elegant surfaces and curves. In Figure 3 I plotted m in function of the threshold for the minor allele frequency (MAF) and for the linkage disequilibrium, in the case of whole-genome sequencing (WGS). In that plot, it is assumed that MAF is the lower bound for the minor allele frequency while the axis ‘Linkage disequilibrium’ is the upper limit for the value $\rho^2$ in Eq. 7. Figure 4 is exactly the same diagram, but in this case, it has been built by considering data from whole-exome sequencing (WES). Some observations: the number of independent variants is bigger in WGS than it is in WES (which is due to the fact that in the former case we are considering more genetic material than in the latter one); when we reduce MAF, m increases (reducing MAF means considering more variants, so again, this leads to a high number of independent variants); m decreases with Linkage disequilibrium (the threshold for considering two variants as independent one from the other). This last point deserves a further explanation: in this context, when we say that LD is, let’s say, 0.8 it means that we consider two variants to be independent when their linkage disequilibrium (calculated as in Eq. 7) is below 0.8; as we reduce this value, we define a more stringent criterion for independence between pair of variants, thus the number m must decrease.

If we consider the upper limit for $\rho^2$ to be 0.8, we obtain Table 1 and Table 2, where both m and corresponding Bonferroni corrections are collected, for various MAF thresholds.

In Figures 5, 6 you can find these same data after interpolation, arranged as a curve where p is expressed in function of m and MAF.

5) How to derive the Bonferroni correction

If we have a threshold $\alpha'$ for statistical significance and we perform m tests, then the probability that one test is positive by chance is $P\,<\,\alpha'$ and the probability that one test is negative by chance is $1\,-\,P\,>\,1\,-\,\alpha'$. This also means that the probability that all the m tests are negative is $(1\,-\,P)^m\,>\,(1\,-\,\alpha')^m$. In conclusion, the probability that at least one test is positive pureley by chance is

$9)\;\;p\,=\,1\,-\,(1\,-\,P)^m\,<\,1\,-\,(1\,-\,\alpha')^m$

We want this probability to be lower than $\alpha\,=\,0.05$, which leads to the following inequality:

$10)\;\;\alpha'\,<\,1\,-\,\sqrt[m]{1\,-\,\alpha}$

This value $\alpha'$ is the corrected threshold for statistical significance, accorrding to the Bonferroni correction. But this is not the correction in Eq. 1, you might note. Yes, that is right. But let’s consider the well known expansion by series:

$11)\;\;(1\,-\,\alpha)^{\frac{1}{m}}\,=\,\sum_{k\,=\,0}^{\infty}(-\,\alpha)^k\begin{pmatrix} \frac{1}{m}\\k \end{pmatrix}$

If we take only the first two addends of the sum and if we substitute them in the expression already found for $\alpha'$, we obtain Eq. 1. The error (its absolute value) that we make when we consider Eq. 1 instead of Eq. 8 is the one plotted in Figure 7, in function of m, for $\alpha\,=\,0.05$.

6) Mathematical notes on linkage disequilibrium

By substituting the fourth relationship of Eq. 2 in Eq. 4, we obtain the surface

$f(ab)\,-\,f(ab)f(Ab)\,-\,f(ab)f(aB)\,-$

$-\,f^2(ab)\,-\,f(Ab)f(aB)\,-\,d\,=\,0$

which is a quadratic form in the space of the frequencies of the haplotypes ab, Ab, aB. More precisely, and even though this is of no importance for us, this is a hyperbolic cylinder (as the reader can easily verify by studying its associated matrix). If we now ask that $f(Ab)\,\in\,[0,1]$ and $f(AB)\,\in\,[0,1]$ for $f(aB)\,\in\,[0,1]$, $f(ab)\,\in\,[0,1]$, we find the relationships

$\begin{cases} d\,\geq\,-f(aB)\,+\,f(aB)f(ab)\,+\,f(aB)^2\\d\,\leq\,f(ab)\,+\,f(aB)f(ab)\,+\,f(aB)^2\\d\,\geq\,-\,f(aB)\,-\,f(aB)f(ab)\,-\,f(ab)^2\\d\,\leq\,f(ab)\,-\,f(aB)f(ab)\,-\,f(ab)^2\\f(aB)\,\in\,[0,1],\,f(ab)\,\in\,[0,1]\end{cases}$

Taken togheter, these inequations say that d can assume only the values included within the elegant surface in Figure 1. In particular, when $f(Ab)\,=\,f(aB)\,=\,0$ (a condition of perfect disequilibrium which goes under the name of coupling), then d assumes the values of the curve coloured blue, while when $f(AB)\,=\,f(ab)\,=\,0$ (repulsion) then d assumes the value on the curve coloured orange. These two courves are plotted in Figure 2 too. Note that $f(Ab)\,=\,0\,\Rightarrow\,f(aB)\,=\,0$; moreover, $f(AB)\,=\,0\,\Rightarrow\,f(ab)\,=\,0$ (this can be proved easily by using Eq. 2).

Note that the application of the method of the Lagrange multipliers for restrained optimization to Eq. 4, with the restrains defined by the fourth relationship of Eq. 2, would not lead to any conclusion in this case, because it is equal to the search for the extremes of a function whose gradient is never zero.

7) Supplementary Material

The script that plots Figures 1 and 2 can be downloaded here.

The script that plots Figures 3, 4, and 5 can be downloaded here.

The equations of this blog post were written using $\LaTeX$ (see this article)

# A second glance at the UK Biobank Database

Featured

Note. This blog post should be read on a laptop. With smaller screens, some of the tables cannot be correctly formatted.

Introduction: what has been found

The UK Biobank is a large-scale biomedical database that includes genetic data on 500,000 UK citizens (Sudlow, C et al. 2015). Myalgic Encephalomyelitis/Chronic Fatigue Syndrome (ME/CFS) is among the phenotypes reported by participants. The total number of ME/CFS patients included is 1208 females and 451 males, with control groups of more than 150 k individuals each. For each individual 805,426 variants (both SNPs and short InDels) were genotyped. From them, up to 30 million variant can be imputed using haplotype maps. This is the biggest GWAS study on ME/CFS patients ever made, so far. Many research groups have analyzed this huge quantity of data and the conclusion, according to (Dibble J. et al. 2020), is that the only variants to be significantly different between patients and controls are included in gene SLC25A15, which encodes mitochondrial ornithine transporter I, particularly variant rs7337312 (position13:41353297:G:A on reference genome GRCh37, forward strand). But this difference is statistically significant only when we consider female patients vs female controls (Table 1). For a detailed discussion of this result, the best reference is the already mentioned (Dibble J. et al. 2020). I wrote an analysis myself on this same result and on its possible implications, available here. My focus, while writing that document, was also understanding the architecture of the metadata generated by using the UK Biobank, by Neale’s Lab (R), one of the research groups that performed the statistical analyses. It was not obvious, at the beginning, how I could derive for each variant genotyped a table like Table 1. In fact, it is the result of the elaboration of five files, four of them of considerable size. Once I realized how to read and manipulate these files, I manually derived the data I was interested in, at first. Then I decide to write a code that could do the work in an automated fashion (see below).

What has not been found

Now, even if we are certainly disappointed by the fact that the positive result of these analyses does not add much to what we already knew about ME/CFS, it is important to stress that negative results are also relevant. Why? Because we can rule out the huge amount of genetic associations made through the years between fatigue in general, and CFS in particular, and genetic variants (see for instance the encyclopedic review Whang T. et al. 2017). As an example, I will focus on the variants of TNF alpha gene collected in Table 2. In 2006, an Italian group found that the minor allele of variant rs1799724 (TNF-857) was significantly more frequent in 54 ME/CFS patients than in 224 controls ( Carlo-Stella N. et al. 2006). In 2020, a group from Germany reported a statistically significant difference for the same variant, but only when ME/CFS patients without infectious triggered onset were considered (Steiner S. et al. 2020). Two other variants of gene TNF alpha were associated with morning fatigue among oncology outpatients with breast, prostate, lung, or brain cancer ( Dhruva A. et al. 2015). Are these variants different between the 1659 ME/CFS patients and the about 300 k controls in the UK Biobank database?

To answer this question I have selected a suitable region of chromosome 6 (ch6:31,539,768 – ch6:31,546,495) that includes not only all the variants in Table 2, but the whole TNF gene. I then analyzed this region with a code I wrote in GNU Octave, version 6.3.0. This code generates three .csv files (one for each of the following groups: females only, males only, both sexes) with information on each one of the variants genotyped within the region mentioned (see Supplementary Material). An extract from these files, with only the variants collected in Table 2, can be seen in Table 3, Table 4, and Table 5. None of these variants reaches the statistically significant threshold, which for this kind of studies is considered to be p<5×10^(-8), for variants with a minor allele frequency above 5% (even lower for less common variants) (Fadista J et al. 2016).

Another way to conveniently visualize the differences between patients and controls (if any) is to plot p values in function of the position of the variants they refer to. We actually plot the log10 of 1/p or, in other words, -log10(p). This kind of diagram is known as Manhattan plot and in Figure 1 you see the manhattan plot for region ch6:31,539,768 – ch6:31,546,495, in the case of females + males. As you can see, none of the variants has a p value below 0.01. In Figure 2 you find the same plot, this time for females only (left) and males only (right).

Conclusions

In conclusion, these codes of mine provide as output the frequency of both the minor and the major allele among 1659 ME/CFS patients, stratified by sex, for about 15 million variants (including SNPs and InDels). They also provide the p value (beta test) for the comparison of these patients with about 300 k controls, along with general information on the variants (such as position, minor allele frequency among 500 k UK citizens, predicted consequences, etc). These codes do so by analyzing the metadata generated by Neale’s Lab, a research group that worked on the UK Biobank dataset. In this blog post, I have shown how to use these data to test previously published genetic associations between ME/CFS and gene TNF alpha.

Supplementary material

The files that I used as input for my codes are the following ones (click on them to download):

20002_1482.gwas.imputed_v3.both_sexes_TNF

20002_1482.gwas.imputed_v3.females_TNF

20002_1482.gwas.imputed_v3.males_TNF

The output generated by TNF.m (see below) consists in the following files (click on them to download) which were used to build tables from 3 to 5:

TNF_total_both_sexes

TNF_total_females

TNF_total_males

Details on these files can be found in this PDF. The code that generates the plots is pretty simple and can be downloaded here. The code that generates the three output files from the four input ones is the one that follows.

% file name = TNF
% it generates the file TNF_tot by using the package IO
% see https://octave.sourceforge.io/io/ for a description
clear all
close all
%-------------------------------------------------------------------------------
% Data source
%-------------------------------------------------------------------------------
% The collection of the analysis generated by Neale’s Lab with the UK Biobank
% data is available here:
% At lines 9 and 10 you can download a collection of the phenotypes and of some
% of the attributes of the relative sample, for females and males respectively
% (phenotypes.female.tsv, phenotypes.male.tsv). We are interested in phenotype 20002_1482
% (Non-cancer illness code, self-reported: chronic fatigue syndrome). In line 11
% you find a file that contains the details on each variant (variants.tsv). You
% must unzip the files before opening them. I have used File viewer plus for reading
% this file (Excel doesn’t seem to open it).
%-------------------------------------------------------------------------------
% Length and position of TNF
%-------------------------------------------------------------------------------
% The position of TNF goes from ch6:31,543,342 to ch6:31,546,113 on the reference
% genome GRCh37 (alias h19), according to NCBI (https://www.ncbi.nlm.nih.gov/gene/7124).
% The two variants we are interested in has a position ch6:31,542,482 (rs1799724, TNF-857)
% and ch6:31,543,031 (rs1800629), on GRCh37. So we must include them, along with
% the whole TNF gene. Since rs1799724 falls also in gene LTA, I include this gene
% as well (LTA goes from ch6:31,539,876 to ch6:31,542,101). Then a suitable interval
% in this dataset is ch6:31,539,768 to ch6:31,546,495.
%-------------------------------------------------------------------------------
% numbers of cases and controls
%-------------------------------------------------------------------------------
n_cases_f = 1208
n_controls_f = 192945
n_cases_m = 451
n_controls_m = 166537
n_cases_b = n_cases_f + n_cases_m
n_controls_b = n_controls_f + n_controls_m
%-------------------------------------------------------------------------------
% name of the gene
%-------------------------------------------------------------------------------
gene = 'TNF';
%-------------------------------------------------------------------------------
% number of variants = upper limit for range - 1
%-------------------------------------------------------------------------------
file_name = strcat('variants_',gene,'.xlsx')
[filetype, sh_names, fformat, nmranges] = xlsfinfo (file_name);
strn = char(sh_names(1,2));
strn = substr(strn,5) % upper limit for range as a string
n = str2num(strn) % upper limit for range as a number
%-------------------------------------------------------------------------------
%-------------------------------------------------------------------------------
file_name = strcat('variants_',gene,'.xlsx')
sheet = 'Sheet1'
range = strcat('A2:A',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
variant = text;
% reading and storing the chromosome
range = strcat('B2:B',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
chr = num;
% reading and storing the position on chromosome
range = strcat('C2:C',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
pos = int32(num);
% reading and storing the ref allele
range = strcat('D2:D',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
ref = text;
% reading and storing the alt allele
range = strcat('E2:E',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
alt = text;
% reading and storing the rsid
range = strcat('F2:F',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
rsid = raw;
% reading and storing the consequence
range = strcat('H2:H',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
consequence = text;
% reading and storing the consequence category
range = strcat('I2:I',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
consequence_category = text;
% reading and storing the Alternate allele count (calculated using hardcall genotypes)
range = strcat('L2:L',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AC = int32(num);
% reading and storing the Alternate allele frequency (calculated using hardcall genotypes)
range = strcat('M2:M',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AF = double(num);
% reading and storing the Minor allele (equal to ref allele when AF > 0.5, otherwise equal to alt allele)
range = strcat('N2:N',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_allele = text;
% reading and storing the Minor allele frequency (calculated using hardcall genotypes)
range = strcat('O2:O',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_AF = double(num);
% reading and storing the Hardy-Weinberg p-value
range = strcat('P2:P',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
p_hwe = double(num);
% reading and storing the Number of samples with defined genotype at this variant
range = strcat('Q2:Q',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_called = int32(num);
% reading and storing the Number of samples without a defined genotype at this variant
range = strcat('R2:R',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_not_called = int32(num);
% testing by writing some of the data
i = 2
variant (i)
chr (i)
pos (i)
ref (i)
alt (i)
rsid (i)
consequence (i)
consequence_category (i)
AC (i)
AF (i)
minor_allele (i)
minor_AF (i)
p_hwe (i)
n_called (i)
n_not_called(i)
%-------------------------------------------------------------------------------
%-------------------------------------------------------------------------------
file_name = strcat('20002_1482.gwas.imputed_v3.females_',gene,'.xlsx')
sheet = 'Sheet1'
range = strcat('A2:A',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
variant_f = text;
% reading and storing the Minor allele frequency in n_complete_samples_f
range = strcat('C2:C',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_AF_f = double(num);
% reading and storing the quality control
range = strcat('E2:E',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
low_confidence_variant_f = raw;
% reading and storing the numeber of individuala genotyped in cases + controls
range = strcat('F2:F',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_complete_samples_f = double(num);
% reading and storing the Alternate allele count within n_complete_samples_f
range = strcat('G2:G',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AC_f = double(num);
% reading and storing the number of alternative alleles in n_cases_f
range = strcat('H2:H',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
ytx_f = double(num);
% reading and storing the p values between cases and controls
range = strcat('L2:L',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
pval_f = double(num);
% testing by writing some of the data
i = 2
variant_f(i)
minor_AF_f(i)
low_confidence_variant_f(i)
n_complete_samples_f(i)
AC_f(i)
ytx_f(i)
pval_f(i)
%-------------------------------------------------------------------------------
%-------------------------------------------------------------------------------
file_name = strcat('20002_1482.gwas.imputed_v3.males_',gene,'.xlsx')
sheet = 'Sheet1'
range = strcat('A2:A',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
variant_m = text;
% reading and storing the Minor allele frequency in n_complete_samples_f
range = strcat('C2:C',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_AF_m = double(num);
% reading and storing the quality control
range = strcat('E2:E',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
low_confidence_variant_m = raw;
% reading and storing the numeber of individuala genotyped in cases + controls
range = strcat('F2:F',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_complete_samples_m = double(num);
% reading and storing the Alternate allele count within n_complete_samples_f
range = strcat('G2:G',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AC_m = double(num);
% reading and storing the number of alternative alleles in n_cases_f
range = strcat('H2:H',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
ytx_m = double(num);
% reading and storing the p values between cases and controls
range = strcat('L2:L',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
pval_m = double(num);
% testing by writing some of the data
i = 2
variant_m(i)
minor_AF_m(i)
low_confidence_variant_m(i)
n_complete_samples_m(i)
AC_m(i)
ytx_m(i)
pval_m(i)
%-------------------------------------------------------------------------------
%-------------------------------------------------------------------------------
file_name = strcat('20002_1482.gwas.imputed_v3.both_sexes_',gene,'.xlsx')
sheet = 'Sheet1'
range = strcat('A2:A',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
variant_b = text;
% reading and storing the Minor allele frequency in n_complete_samples_f
range = strcat('C2:C',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_AF_b = double(num);
% reading and storing the quality control
range = strcat('E2:E',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
low_confidence_variant_b = raw;
% reading and storing the numeber of individuala genotyped in cases + controls
range = strcat('F2:F',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_complete_samples_b = double(num);
% reading and storing the Alternate allele count within n_complete_samples_f
range = strcat('G2:G',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AC_b = double(num);
% reading and storing the number of alternative alleles in n_cases_f
range = strcat('H2:H',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
ytx_b = double(num);
% reading and storing the p values between cases and controls
range = strcat('L2:L',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
pval_b = double(num);
% testing by writing some of the data
i = 2
variant_b(i)
minor_AF_b(i)
low_confidence_variant_b(i)
n_complete_samples_b(i)
AC_b(i)
ytx_b(i)
pval_b(i)
%-------------------------------------------------------------------------------
% checking for wrong alignments between these 4 files
%-------------------------------------------------------------------------------
error = 0.;
for i=1:n-1
a = variant(i);
b = variant_f(i);
c = variant_m(i);
d = variant_b(i);
% function strcmp (string_1, strin_2) returns 1 if the two strings are identical
% it resturns 0 if the two strings have at least a different character in the same position
test = strcmp(a,b);
if ( (test==0) );
error = error +1;
endif
test = strcmp(a,c);
if ( (test==0) );
error = error +1;
endif
test = strcmp(a,d);
if ( (test==0) );
error = error +1;
endif
endfor
error
%-------------------------------------------------------------------------------
% defining major and minor alleles among all the variants
%-------------------------------------------------------------------------------
for i=1:n-1
frequency = AF(i); %frequency of the alternative allele
if ( frequency <= 0.5 )
minor_allele2(i)=alt(i);
major_allele(i)=ref(i);
else
minor_allele2(i)=ref(i);
major_allele(i)=alt(i);
endif
endfor
%-------------------------------------------------------------------------------
% defining alternate allele frequencies in case groups
%-------------------------------------------------------------------------------
for i=1:n-1
alt_AF_f(i) = ytx_f(i)/(2n_cases_f); alt_AF_m(i) = ytx_m(i)/(2n_cases_m);
alt_AF_b(i) = ytx_b(i)/(2n_cases_b); endfor
%-------------------------------------------------------------------------------
% defining alternate allele frequencies in control groups %-------------------------------------------------------------------------------
for i=1:n-1 alt_AF_f_controls(i) = ( AC_f(i)- ytx_f(i) )/(2n_controls_f);
alt_AF_m_controls(i) = ( AC_m(i)- ytx_m(i) )/(2n_controls_m); alt_AF_b_controls(i) = ( AC_b(i)- ytx_b(i) )/(2n_controls_b);
endfor
%-------------------------------------------------------------------------------
% defining reference allele frequencies in case groups
%-------------------------------------------------------------------------------
for i=1:n-1
ref_AF_f(i) = 1-alt_AF_f(i);
ref_AF_m(i) = 1-alt_AF_m(i);
ref_AF_b(i) = 1-alt_AF_b(i);
endfor
%-------------------------------------------------------------------------------
% defining reference allele frequencies in control groups
%-------------------------------------------------------------------------------
for i=1:n-1
ref_AF_f_controls(i) = 1-alt_AF_f_controls(i);
ref_AF_m_controls(i) = 1-alt_AF_m_controls(i);
ref_AF_b_controls(i) = 1-alt_AF_b_controls(i);
endfor
%-------------------------------------------------------------------------------
% checking for errors by comparing minor_allele with minor_allele2
%-------------------------------------------------------------------------------
error2 = 0.;
for i=1:n-1
a = minor_allele(i);
b = minor_allele2(i);
% function strcmp (string_1, strin_2) returns 1 if the two strings are identical
% it resturns 0 if the two strings have at least a different character in the same position
test = strcmp(a,b);
if ( (test==0) );
error2 = error2 +1;
endif
endfor
error2
%-------------------------------------------------------------------------------
% writing of GENE_total_females.csv
%-------------------------------------------------------------------------------
columns = cell(n,20);
columns{1,1} = "variant";
columns{1,2} = "chr";
columns{1,3} = "pos";
columns{1,4} = "rsid";
columns{1,5} = "minor_allele";
columns{1,6} = "major_allele";
columns{1,7} = "ref";
columns{1,8} = "alt";
columns{1,9} = "minor_AF";
columns{1,10} = "low_confidence_variant";
columns{1,11} = "p_hwe";
columns{1,12} = "consequence";
columns{1,13} = "consequence_category";
columns{1,14} = "pval";
columns{1,15} = "alt_AF_cases";
columns{1,16} = "alt_AF_controls";
columns{1,17} = "ref_AF_cases";
columns{1,18} = "ref_AF_controls";
columns{1,19} = "n_cases";
columns{1,20} = "n_controls";
for i=1:n-1
columns{i+1,1} = variant{i};
columns{i+1,2} = chr(i);
columns{i+1,3} = int32(pos(i));
columns{i+1,4} = rsid{i};
columns{i+1,5} = minor_allele{i};
columns{i+1,6} = major_allele{i};
columns{i+1,7} = ref{i};
columns{i+1,8} = alt{i};
columns{i+1,9} = minor_AF(i);
columns{i+1,10} = low_confidence_variant_f{i};
columns{i+1,11} = p_hwe(i);
columns{i+1,12} = consequence{i};
columns{i+1,13} = consequence_category{i};
columns{i+1,14} = pval_f(i);
columns{i+1,15} = alt_AF_f(i);
columns{i+1,16} = alt_AF_f_controls(i);
columns{i+1,17} = ref_AF_f(i);
columns{i+1,18} = ref_AF_f_controls(i);
columns{i+1,19} = n_cases_f;
columns{i+1,20} = n_controls_f;
endfor
file_name = strcat(gene,'_total_females.csv')
cell2csv (file_name, columns," ")
%-------------------------------------------------------------------------------
% writing of GENE_total_males.csv
%-------------------------------------------------------------------------------
columns = cell(n,20);
columns{1,1} = "variant";
columns{1,2} = "chr";
columns{1,3} = "pos";
columns{1,4} = "rsid";
columns{1,5} = "minor_allele";
columns{1,6} = "major_allele";
columns{1,7} = "ref";
columns{1,8} = "alt";
columns{1,9} = "minor_AF";
columns{1,10} = "low_confidence_variant";
columns{1,11} = "p_hwe";
columns{1,12} = "consequence";
columns{1,13} = "consequence_category";
columns{1,14} = "pval";
columns{1,15} = "alt_AF_cases";
columns{1,16} = "alt_AF_controls";
columns{1,17} = "ref_AF_cases";
columns{1,18} = "ref_AF_controls";
columns{1,19} = "n_cases";
columns{1,20} = "n_controls";
for i=1:n-1
columns{i+1,1} = variant{i};
columns{i+1,2} = chr(i);
columns{i+1,3} = int32(pos(i));
columns{i+1,4} = rsid{i};
columns{i+1,5} = minor_allele{i};
columns{i+1,6} = major_allele{i};
columns{i+1,7} = ref{i};
columns{i+1,8} = alt{i};
columns{i+1,9} = minor_AF(i);
columns{i+1,10} = low_confidence_variant_m{i};
columns{i+1,11} = p_hwe(i);
columns{i+1,12} = consequence{i};
columns{i+1,13} = consequence_category{i};
columns{i+1,14} = pval_m(i);
columns{i+1,15} = alt_AF_m(i);
columns{i+1,16} = alt_AF_m_controls(i);
columns{i+1,17} = ref_AF_m(i);
columns{i+1,18} = ref_AF_m_controls(i);
columns{i+1,19} = n_cases_m;
columns{i+1,20} = n_controls_m;
endfor
file_name = strcat(gene,'_total_males.csv')
cell2csv (file_name, columns," ")
%-------------------------------------------------------------------------------
% writing of CLYBL_total_both_sexes.csv
%-------------------------------------------------------------------------------
columns = cell(1523,20);
columns{1,1} = "variant";
columns{1,2} = "chr";
columns{1,3} = "pos";
columns{1,4} = "rsid";
columns{1,5} = "minor_allele";
columns{1,6} = "major_allele";
columns{1,7} = "ref";
columns{1,8} = "alt";
columns{1,9} = "minor_AF";
columns{1,10} = "low_confidence_variant";
columns{1,11} = "p_hwe";
columns{1,12} = "consequence";
columns{1,13} = "consequence_category";
columns{1,14} = "pval";
columns{1,15} = "alt_AF_cases";
columns{1,16} = "alt_AF_controls";
columns{1,17} = "ref_AF_cases";
columns{1,18} = "ref_AF_controls";
columns{1,19} = "n_cases";
columns{1,20} = "n_controls";
for i=1:n-1
columns{i+1,1} = variant{i};
columns{i+1,2} = chr(i);
columns{i+1,3} = int32(pos(i));
columns{i+1,4} = rsid{i};
columns{i+1,5} = minor_allele{i};
columns{i+1,6} = major_allele{i};
columns{i+1,7} = ref{i};
columns{i+1,8} = alt{i};
columns{i+1,9} = minor_AF(i);
columns{i+1,10} = low_confidence_variant_b{i};
columns{i+1,11} = p_hwe(i);
columns{i+1,12} = consequence{i};
columns{i+1,13} = consequence_category{i};
columns{i+1,14} = pval_b(i);
columns{i+1,15} = alt_AF_b(i);
columns{i+1,16} = alt_AF_b_controls(i);
columns{i+1,17} = ref_AF_b(i);
columns{i+1,18} = ref_AF_b_controls(i);
columns{i+1,19} = n_cases_b;
columns{i+1,20} = n_controls_b;
endfor
file_name = strcat(gene,'_total_both_sexes.csv')
cell2csv (file_name, columns," ")

# Is Carboxypeptidase N deficiency a contributing factor in ME/CFS and POTS?

Abstract

We present an attempt at exome analysis in two ME/CFS patients. Pt. 1 presents a mild form of carboxypeptidase N (CPN1) deficiency (a missense in exon 3) while Pt. 2 revealed two rare intronic variants in the same gene. CPN1 is an enzyme that inactivates kinins and complement proteins split products (such as C4a, a known anaphylatoxin). Therefore, CPN1 deficiency could explain C4a increase after exercise and mast cell abnormalities previously reported in ME/CFS. It could also explain the high prevalence of POTS in ME/CFS since kinins are vasodilators.

Introduction

Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) is a debilitating disease characterized by cognitive deficits, fatigue, orthostatic intolerance with symptoms exacerbated after exertion (IOM, 2015). This disease has no known cause but several abnormalities have been observed in energy metabolism (Tomas C. and Newton J. 2018), immune system and gut flora (Blomberg J. et al. 2018), brain (Zeineh MM. et al. 2014). In this population of patients, several abnormalities have been found to be triggered by exercise, such as abnormal aerobic performance (Snell C. et al. 2013), enhanced gene expression of specific receptors (White AT. et al. 2012), abnormal gut flora translocation (Shukla SK et al. 2015) and failure in blood clearance of complement protein 4 split product A (C4a) (Sorensen B et al. 2003). An increase in C4a is part of the human physiologic response to physical exercise, but these levels return to baseline within 30 minutes to 2 hours (Dufaux B et al. 1991) while in ME/CFS there is a peak in serum C4a six hours after exertion. A possible explanation for slow C4a inactivation could be a problem in carboxypeptidase N (CPN1), an enzyme involved in the inactivation of C3a, C4a, C5a. CPN1 is required for kinins inactivation too, such as bradykinin, kalladin (Hugli T. 1978), (Plummer TH et Hurwitz MY 1978), that are vasodilators. We report on the case of a ME/CFS patient (Pt. 1) with a missense variant in CPN1 gene that is linked to reduced function of the enzyme and of another ME/CFS patient (Pt. 2) with rare variants in introns 1 and 6 of the same gene with uncertain significance (table 1, figure 1).

Materials and Methods

Whole exome sequencing (WES) has been performed on cells from the saliva of two ME/CFS patients, with an average 100X coverage (Dante Labs). The first search for pathogenic variants and insertions/deletions was performed with the software EVE, provided by Sequencing.com. A further refinement of the search was conducted by manual insertion of these SNPs in VarSome. The search for possible unknown pathogenic variants within the gene for CPN1 has been performed using Integrative Genome Viewer (IGV), an opensource tool for genetic data analysis.

Results

Results from the analysis of the two exomes performed with EVE and refined with VarSome are collected in table 2 (Pt. 1) and table 3 (Pt. 2).

Pt. 2 is carrier of a mitochondrial disease (table 3, first line): a missense in gene for medium-chain acyl-CoA dehydrogenase (MCAD) which leads to mild functional impairment of the enzyme involved in the oxidation of fatty acids (44% residual activity) (Koster KL. et al. 2014).

Pt. 2 is also homozygous for a variation in gene arylsulfatase A (ARSA) that is linked to a residual activity of only 10% of normal (Gomez-Ospina N. 2010). Arylsulfatase A deficiency (also known as metachromatic leukodystrophy or MLD) is a disorder of impaired breakdown of sulfatides (cerebroside sulfate or 3-0-sulfo-galactosylceramide), sulfate-containing lipids that occur throughout the body and are found in greatest abundance in nervous tissue, kidneys, and testes. Sulfatides are critical constituents in the nervous system, where they comprise approximately 5% of the myelin lipids. Sulfatide accumulation in the nervous system eventually leads to myelin breakdown (leukodystrophy) and a progressive neurologic disorder (Von Figura et al 2001). Nevertheless, this genotype does not cause MLD, and this benign condition of reduced ARSA activity is called ARSA pseudodeficiency. There are about 4 homozygotes in 1000 persons among non-Finnish Europeans (VarSome)

Pt. 1 is a carrier of a missense in gene CPN1 (table 2, first line) which leads to a loss of more than 60% of activity, according to a study on a single patient (Mathews KP. et al. 1980), (Cao H. et Hegele RA. 2003). The study of gene CPN1 in both patients (using IGV) has led to the identification of two rare variants (frequency less than 0.002) in intron 1 and 6 of one allele from Pt. 2 (table 1, figure 1). In MCAD no other damaging variations have been identified in these two patients by direct inspection with IGV (data not shown).

Discussion

Whole exome sequencing (WES) is a technique that aims at the sequencing of the fraction of our genome that encodes for proteins: about 30 million base pairs (1% of the all the human DNA) divided into about 20 thousand genes (Ng SB et al. 2009). It has become increasingly clear that the use of WES can positively improve the rate of diagnosis and decrease the time needed for a definitive diagnosis in patients with rare genetic diseases (Sawyer SL et al. 2016). WES also positively impacts the ability to discover new pathogenic variants in known disease genes (Polychronakos C. et Seng KC. 2011) and the discovery of completely new disease genes (Boycott KM 2013). ME/CFS seems to have a genetic component: a US study found clear evidence of familial clustering and elevated risk for the disease among relatives of ME/CFS cases (Albright F et al. 2011) and several SNPs in various genes have been reported as more prevalent in ME/CFS patients versus healthy controls (Wang T et al. 2017). And yet, no studies that analyzed whole exomes of ME/CFS patients have been published, to my knowledge.

In this study, we searched for known genetic diseases in the exomes of two ME/CFS patients who fit the IOM criteria for SEID (IOM, 2015), with postural orthostatic tachycardia syndrome (POTS) identified by positive tilt table test. We detected a missense variant in CPN1 (rs61751507) in Pt. 1 (heterozygosis) that has been associated to a loss of activity of the enzyme of at least 60% in a previous study (Mathews KP. et al. 1980), (Cao H. et Hegele RA. 2003). We then found that, although Pt. 2 was not a carrier of this SNP, she had two rare SNPs in intron 1 (rs188667294) and 6 (rs113386068) of gene CPN1 (both present in less than 1/500 alleles, table 1, figure 1). These intronic variations have not been studied, to our knowledge, so their pathogenicity can’t be excluded at present. Variations in introns can be damaging just as missense and nonsense mutations in exons; suffice to say that the main known pathogenic SNP of gene CPN1 is a substitution in intron 1 (Cao H. et Hegele RA. 2003).

Carboxypeptidase N (CPN1) is an enzyme involved in the inactivation of C3a, C4a, C5a, and of kinins (bradykinin, kalladin) (Hugli T. 1978), (Plummer TH et Hurwitz MY 1978). In ME/CFS the physiologic increase in blood of C4a (the split product of the complement protein C4) after exercise is significantly more pronounced than in healthy controls as if there was a defect in C4a inactivation (Sorensen B et al. 2003). Such a defect could very well be a loss of function in CPN1, as found in Pt 1. Moreover, CPN1 is involved in inactivation of bradykinin, which is known to induce vasodilatation (Siltari A. et al. 2016), therefore CPN1 deficiency could play a role in POTS and in orthostatic intolerance in general. Both patients have a tilt table test positive for POTS. C4a has been recently considered to play a causal role in the cognitive deficit of schizophrenia, because of its role in synapsis pruning (Sekar, A et al, 2016); therefore a failure in its inactivation could be implicated in the incapacitating cognitive defects lamented by ME/CFS patients.

Only two patients with CPN1 deficiency have been reported so far in medical literature (Mathews KP. et al. 1980), (Willemse Jl et al. 2008), and the enzymatic defect has been associated to angioedema that most often involved the face and tongue, urticaria, and hay fever and asthma precipitated by exercise. This clinical presentation could be due, at least in part, to mast cell activation: in fact, C4a is a known anaphylatoxin that induces mast cells degranulation and release of histamine (Erdei A. et al. 2004). That said, we can observe that even if the clinical presentation of the only two known cases of CPN1 deficiency doesn’t fit the clinical picture of ME/CFS, mast cell activation syndrome (MCAS) has some commonalities with ME/CFS (Theoharides, TC et al. 2005), and mast cell abnormalities have been reported among ME/CFS patients (Nguyen T. et al. 2016). So we can’t exclude that activation of mast cells by a failure in C4a inactivation may lead to ME/CFS symptoms. The role of exercise as a trigger for symptoms in CPN1 deficiency is also highly suggestive because this is a pathognomonic feature of ME/CFS.

Conclusion

CPN1 deficiency is present (even if in a mild form) in Pt. 1, while Pt. 2 presents two rare intronic variants whose pathogenic role can’t be excluded. CPN1 deficiency could explain the abnormal increase of C4a after exercise and might be a contributing factor to post-exertional malaise and cognitive symptoms in ME/CFS. A search for pathogenetic SNPs in gene CPN1 among ME/CFS patients would clarify the role (if any) of this gene.

Acknowledgments. I would like to thank Chiara Scarpellini for her careful collection of annotations for each of the 2 hundred or so variants found by EVE within the exomes of Pt. 1 and Pt. 2 (table 2 and table 3).