The man who does not make his own tools does not make his own sculpture.
Irving Stone, The Agony and The Ecstasy
An analysis of the UK Biobank database carried out by Neale’s Lab pointed out a significant correlation between a region of chromosome 13 and females with self-reported Chronic Fatigue Syndrome (see this blog post). This result was further analyzed in a subsequent paper in which it was suggested that the causal variant within this region may be rs7337312 (position 13:41353297:G:A on reference genome GRCh37, forward strand) (Dibble J. et al. 2020). It is included in a regulatory region of gene SLC25A15, which encodes mitochondrial ornithine transporter I, and the potential biological effects of this variant have been discussed in the aforementioned study. I made an attempt at my own discussion of this variant (here).
It is not unusual for a GWAS study to associate a trait with a long sequence of variants belonging to the same chromosome and in a close physical relationship one with the other, due to high linkage disequilibrium often present between variants close to one another (see this blog post). Several methods have been developed in the last years to further refine the association and find the variant that causes the disease and drives the association of its nearby SNPs. These methods are based on both purely mathematical considerations and on the analysis of the known consequences of the variants, and we refer to them by the evocative name of fine-mapping (Schaid DJ et al. 2018).
I tried to develop my own method and software for a purely mathematical approach to fine-mapping and I applied it to the region associated with ME/CFS in females, and I found variant rs11147812 (13:41404706:T:C on h19) as the most likely causal one. The detailed discussion of the method I followed and the scripts I wrote are available here for download. It might be noted that this variant belongs to a genetic entity (pseudogene TPTE2P5) that harbors variants that have been strongly associated with the following hematic parameters (see GWAS catalog).
The alternative allele of rs11147812 has been reported to increase (effect size > 0) the expression of gene SLC25A15 in fat (see this page) with a p value of . Since the alternative variant is less frequent in patients than in controls, the prediction is that we have a reduced expression of SLC25A15 in fat, in female CFS patients.
Table. Significant associations between traits and variants of TPTE2P5.
Note. The header image shows the convergence of the linear method used by my script for fine-mapping the region on chromosome 13. The detailed discussion of the method I followed and the scripts I wrote are available here for download.
We have here a paradox: what theory is it that is not correct unless it leaves open the possibility that it may be incorrect? Answer: the theory of probability.
P.W. Bridgman, The Nature of the Physical Theory
While reasoning on one of the biological problems I have been spending most of my time on, I encounterd a random variable that was the ratio between two random variables with a Beta distribution. Let’s say that and . Then, if we define the random variable , what is its density? What about its cumulative distribution function? And what can we say about its moments? I answered all these questions in this paper. The density, in particualr, is given by:
On the exact same subject there is a 1999 paper I am currently comparing to mine (Pham-Gia 1999). While writing my own paper I did not check for other works on the same topic, but if I had had to bet I would have said that this calculation had been performed a century ago; so I was astonished when I discovered that this density was first published only 20 years ago! This is further proof of how recent the application of calculus to statistics is.
PS. I have just checked the density by Thu Pham-Gia: it is equivalent to the one I found, even though they appear different at a first glance. I have added a paragraph to my paper to show the equivalence between these two densities. It is also worth mentioning that in addition, I calculated the cumulative distribution function, the expectation, the second-order moment, and the variance, while Pham-Gia presented the density along with related subjects and applications, in his 1999 paper.
“This it seems to me is characteristic of most judgments involving physical processes — the law of the excluded middle is not a valid description of our actual physical experience — there has to be a third category of doubtful, in addition to positive or negative.”
P. W. Bridgman, The Nature of Physical Theory
1) The multiple testing issue
With the emergence of titanic biological datasets generated by the latest omics technologies, the problem of correction of the significance threshold (the conventional ) for multiple testing has required a careful revision and update. The simplest way to correct for multiple testing is the Bonferroni correction (a derivation is provided in par. 5): in this method, the threshold for statistical significance is given by
where the denominator represents the number of independent variables. Now, in the case of genetic testing, how should we set the value of ? Let’s say that we are comparing two populations of individuals (one with a disease, the other made up by healthy subjects) and we have genotyped N variants in each one of them. Once we have associated a p value to every single variant, we must face the problem of correcting it for multiple testing by using Eq. 1. What should we do? Well, it is simple, we put , right? Wrong!
2) Linkage disequilibrium
In general, variants belonging to the same chromosome are not inherited in an independent fashion: we know that whole chunks of chromosomes of about 50 kb pass from one generation to the other. These blocks are called haplotypes and there has been a considerable effort in classifying them for various human populations in the last 20 years. A map of human haplotypes is now available (HapMap) and it allows for the calculation of the real number of independent variants (often called tag SNPs) on the human Genome, and for other pivotal purposes I won’t write here about, like for instance imputation (a technique that dramatically reduces the number of variants we must genotype in order to scan the genome of a human being) (The International HapMap Consortium 2005), (The International HapMap Consortium, 2009). Visit the page of the International HapMap Consortium here. For us, what is important here is the fact that the knowledge of the haplotypes gives us a measure of the linkage disequilibrium between any possible pair of variants on the same chromosome. But what is linkage disequilibrium? Let’s say that we have two loci and on the same chromosome. has the two alleles and , while presents the two alleles and of frequencies , respectively. Let’s define now the random variable whose value is one if is occupied by , zero otherwise; similarly, let’s define the random variable such that it is one if is occupied by , zero otherwise. We can say that both these variables have a Bernoulli distribution with expectations given by and , respectively. If we now define the frequency of the haplotype , and if we similarly define , we can then write the following relationships:
where we have considered that , . That said, the covariance between is given by:
Or also (with some algebraic manipulation, by substituting the relationship of Eq. 2):
That said, one definition of Linkage disequilibrium is just
Figure 1. Left: the values admitted for d are those that fall within the shell; in particular, for f(aB) = f(Ab) = 0 (coupling) we have that d assumes the values of the curve coloured blue, while for f(AB) = f(ab) = 0 (repulsion) then d describes the curve in orange. These two curves are plotted in Figure 2, too.
This definition presents a problem, namely the fact that when we have a complete dependence between and , its value varies in (see Figure 1, Figure 2, and paragraph 6 for a detailed discussion) and this does not seem to have a particular biological meaning. It would be more interesting to have a definition such that Linkage disequilibrium assumes a unique value (for example 1) when we have complete dependence between the two loci. Note that we can accomplish this by considering the correlation coefficient between instead, which is given by:
Thus, in the present article we consider the following definition for linkage disequilibrium:
where we have considered that and . For further details on this topic, go to paragraph 6. See also (Reginald D Smith 2020).
Figure 2. For f(aB) = f(Ab) = 0 (coupling) we have that d assumes the values of the curve coloured blue, while for f(AB) = f(ab) = 0 (repulsion) then d describes the curve in orange.
3)The search for the number of independent loci
Consider now that the number of loci in linkage equilibrium one with the other is the value m in Eq. 1, the one we are interested in. From an operational point of view, how can we determine if two loci are in linkage equilibrium so that they can be considered independent one from the other? We can randomly build the genome of 2,000 human beings with the only restrain that haplotypes must be respected (this is where the HapMap comes into play), we then randomly assign them to two populations and we calculate for each locus the p value for the comparison of these two populations. Let’s say that locus has a p value . Since these two groups are identical by construction, it must be or, in other words:
If we repeat this inequality for each locus, we have a lower bound for m. That number is the number of independent loci we were searching for (Pe’er, I. et al. 2008).
4)Experimental results
I discuss here the calculation of m performed in a recent study, the best I have been able to find on this topic (Fadista, J. et al. 2016). The strength of this study is that they calculated m for two kinds of sequencing techniques: whole-genome sequencing (WGS) and whole-exome sequencing (WES). Not only that, but they also provided the results for different European and non-European populations, for different definitions of linkage disequilibrium, and for different thresholds for minor allele frequency (MAF). These results are plotted in their paper and they are also available in xlsx format (here, even though I must say that they omitted part of raw data on different ethnicities). Nevertheless, I have elaborated them in my own way (the script that draws the diagrams is available below) by using a different arrangement of the data and by providing a fancy polynomial interpolation, by cubic splines, as to have smooth and elegant surfaces and curves. In Figure 3 I plotted m in function of the threshold for the minor allele frequency (MAF) and for the linkage disequilibrium, in the case of whole-genome sequencing (WGS). In that plot, it is assumed that MAF is the lower bound for the minor allele frequency while the axis ‘Linkage disequilibrium’ is the upper limit for the value in Eq. 7. Figure 4 is exactly the same diagram, but in this case, it has been built by considering data from whole-exome sequencing (WES). Some observations: the number of independent variants is bigger in WGS than it is in WES (which is due to the fact that in the former case we are considering more genetic material than in the latter one); when we reduce MAF, m increases (reducing MAF means considering more variants, so again, this leads to a high number of independent variants); m decreases with Linkage disequilibrium (the threshold for considering two variants as independent one from the other). This last point deserves a further explanation: in this context, when we say that LD is, let’s say, 0.8 it means that we consider two variants to be independent when their linkage disequilibrium (calculated as in Eq. 7) is below 0.8; as we reduce this value, we define a more stringent criterion for independence between pair of variants, thus the number m must decrease.
Figure 3. The number of independent variants in whole-genome sequencing, in function of the lower limit for minor allele frequency and of the upper limit for Linkage disequilibrium.
Figure 4. The number of independent variants in whole-exome sequencing, in function of the lower limit for minor allele frequency and of the upper limit for Linkage disequilibrium.
MAF
m
p
782,361
7.19
1,580,650
7.50
2,093,887
7.62
4,126,125
7.92
Table 1. This table gives the values m, p in function of the lower limit for minor allele frequency, in the case of whole-genome sequencing. It is assumed that .
If we consider the upper limit for to be 0.8, we obtain Table 1 and Table 2, where both m and corresponding Bonferroni corrections are collected, for various MAF thresholds.
MAF
m
p
39,383
5.90
69,265
6.14
86,927
6.24
161,833
6.51
Table 2. This table gives the values m, p in function of the lower limit for minor allele frequency, in the case of whole-exome sequencing. It is assumed that .
In Figures 5, 6 you can find these same data after interpolation, arranged as a curve where p is expressed in function of m and MAF.
Figure 5. The corrected threshold for statistical significance in function of the number of independent variants and of the lower limit for minor allele frequency, for WGS.
Figure 6. The corrected threshold for statistical significance in function of the number of independent variants and of the lower limit for minor allele frequency, for WES.
5)How to derive the Bonferroni correction
If we have a threshold for statistical significance and we perform m tests, then the probability that one test is positive by chance is and the probability that one test is negative by chance is . This also means that the probability that all the m tests are negative is . In conclusion, the probability that at least one test is positive pureley by chance is
We want this probability to be lower than , which leads to the following inequality:
This value is the corrected threshold for statistical significance, accorrding to the Bonferroni correction. But this is not the correction in Eq. 1, you might note. Yes, that is right. But let’s consider the well known expansion by series:
If we take only the first two addends of the sum and if we substitute them in the expression already found for , we obtain Eq. 1. The error (its absolute value) that we make when we consider Eq. 1 instead of Eq. 8 is the one plotted in Figure 7, in function of m, for .
Figure 7. The absolute value of in function of m for
6)Mathematical notes on linkage disequilibrium
By substituting the fourth relationship of Eq. 2 in Eq. 4, we obtain the surface
which is a quadratic form in the space of the frequencies of the haplotypes ab, Ab, aB. More precisely, and even though this is of no importance for us, this is a hyperbolic cylinder (as the reader can easily verify by studying its associated matrix). If we now ask that and for , , we find the relationships
Taken togheter, these inequations say that d can assume only the values included within the elegant surface in Figure 1. In particular, when (a condition of perfect disequilibrium which goes under the name of coupling), then d assumes the values of the curve coloured blue, while when (repulsion) then d assumes the value on the curve coloured orange. These two courves are plotted in Figure 2 too. Note that ; moreover, (this can be proved easily by using Eq. 2).
Note that the application of the method of the Lagrange multipliers for restrained optimization to Eq. 4, with the restrains defined by the fourth relationship of Eq. 2, would not lead to any conclusion in this case, because it is equal to the search for the extremes of a function whose gradient is never zero.
7) Supplementary Material
The script that plots Figures 1 and 2 can be downloaded here.
The script that plots Figures 3, 4, and 5 can be downloaded here.
The script that plots Figure 7 can be downloaded here.
Philosophy is written in a great book that is continually open before our eyes (that is the Universe), but it cannot be understood unless one first learns to decipher the language and the characters in which it is written. It is written in a mathematical language, and the characters are triangles, circles, and other geometric figures, without which it is impossible for us to understand a word of it; without them it is a vain wandering through a dark labyrinth.
Galileo Galilei, The Assayer
The geometry of the backbone (or main chain) of a peptide is completely described by a sequence of dihedral angles (also known as torsion angles): the angle Φ is the angle around the chemical bond between the alpha carbon of one amino acidand the azote of the same amino acid; the angle Ψ is the angle around the axis of the bond between the alpha carbon and the other carbon (I call it beta carbon, but this is a personal notation) of the same amino acid. This definition is incomplete, of course, because as we all know, a dihedral angle is defined not only by an axis but also by two planes that intersect in that axis. Not only that, these two angles have a sign, so we must specify a positive arrow of rotation. This can be done in several ways. The problem is that the most efficient way would be by just drawing the planes the torsion angles are defined by. Now, through the years I discovered that this drawing is usually lacking in books, leading to some confusion, particularly in those who study biochemistry without a particular interest in a quantitative description of molecular structures. In Figure 1 you find the graphical description of Φ and Ψ in a well-know book of biochemistry. Do you see the planes the angles are defined by? In Figure 2, two further examples of illustrations that are not suited at completely describing the two dihedral angles, from a two other well-know books, one about bioinformatics, the other one a booklet about protein structures. The second one is the best one, but you still have to mentally merge two drawings (FIG. 5 and FIG. 6) to get the full picture. I hope I won’t be sued…
Figure 1. Dihedral angles, from Principles of Biochemistry, by Voet D, Voet J, and Charlotte WP.
Figure 2. Two illustrations for the description of the torsion angles. The one on the left is from Bioinformatics, Sequence and Genome Analysis, by David W Mount. On the right: The Anatomy and Taxonomy of Protein Structures, by Jane S Richardson.
Now, it is possible that I have just been unlucky with the books of my personal library. Or it may be that the illustration with the three planes we need to correctly define Φ and Ψ (we just need to add a plane to the amide planes), would not be clearly readable. I tried years ago to draw the planes for Ψ (the illustration is in this blog post) and I have now completed it with the other torsion angle, by drawing a tripeptide (Figure 3). It is just a handmade drawing, but I think it serves the scope. This is what I call the lost illustration.
Figure 3. A peptide of three amino acids. The amide planes are shaded grey. A further plane has been added to fully characterize the torsion angles Φ and Ψ. Pencil and pen on paper, by Paolo Maccallini. For the signs of the two angles, a clockwise rotation is considered positive. So, in this case, we have Φ of about -90° and Ψ of about 150°.
Note. This blog post should be read on a laptop. With smaller screens, some of the tables cannot be correctly formatted.
Introduction: what has been found
The UK Biobank is a large-scale biomedical database that includes genetic data on 500,000 UK citizens (Sudlow, C et al. 2015). Myalgic Encephalomyelitis/Chronic Fatigue Syndrome (ME/CFS) is among the phenotypes reported by participants. The total number of ME/CFS patients included is 1208 females and 451 males, with control groups of more than 150 k individuals each. For each individual 805,426 variants (both SNPs and short InDels) were genotyped. From them, up to 30 million variant can be imputed using haplotype maps. This is the biggest GWAS study on ME/CFS patients ever made, so far. Many research groups have analyzed this huge quantity of data and the conclusion, according to (Dibble J. et al. 2020), is that the only variants to be significantly different between patients and controls are included in gene SLC25A15, which encodes mitochondrial ornithine transporter I, particularly variant rs7337312 (position13:41353297:G:A on reference genome GRCh37, forward strand). But this difference is statistically significant only when we consider female patients vs female controls (Table 1). For a detailed discussion of this result, the best reference is the already mentioned (Dibble J. et al. 2020). I wrote an analysis myself on this same result and on its possible implications, available here. My focus, while writing that document, was also understanding the architecture of the metadata generated by using the UK Biobank, by Neale’s Lab (R), one of the research groups that performed the statistical analyses. It was not obvious, at the beginning, how I could derive for each variant genotyped a table like Table 1. In fact, it is the result of the elaboration of five files, four of them of considerable size. Once I realized how to read and manipulate these files, I manually derived the data I was interested in, at first. Then I decide to write a code that could do the work in an automated fashion (see below).
Table 1. Frequencies and allele counts for variant rs7337312 (gene SLC25A15 ) among 1208 CFS female patients, 451 CFS male patients, 192945 female controls, 166537 male controls, total UK Biobank sample, and general population (global genome AD). P values for patients vs controls have been calculated for both sexes. I derived this table from metadata by Neale’s Lab (R).
What has not been found
Now, even if we are certainly disappointed by the fact that the positive result of these analyses does not add much to what we already knew about ME/CFS, it is important to stress that negative results are also relevant. Why? Because we can rule out the huge amount of genetic associations made through the years between fatigue in general, and CFS in particular, and genetic variants (see for instance the encyclopedic review Whang T. et al. 2017). As an example, I will focus on the variants of TNF alpha gene collected in Table 2. In 2006, an Italian group found that the minor allele of variant rs1799724 (TNF-857) was significantly more frequent in 54 ME/CFS patients than in 224 controls ( Carlo-Stella N. et al. 2006). In 2020, a group from Germany reported a statistically significant difference for the same variant, but only when ME/CFS patients without infectious triggered onset were considered (Steiner S. et al. 2020). Two other variants of gene TNF alpha were associated with morning fatigue among oncology outpatients with breast, prostate, lung, or brain cancer ( Dhruva A. et al. 2015). Are these variants different between the 1659 ME/CFS patients and the about 300 k controls in the UK Biobank database?
Table 2. Variants on gene TNF associated with ME/CFS, with ME/CFS without Infectious Triggered Onset (ITO), and with morning fatigue in oncology outpatients with breast, prostate, lung, or brain cancer.
To answer this question I have selected a suitable region of chromosome 6 (ch6:31,539,768 – ch6:31,546,495) that includes not only all the variants in Table 2, but the whole TNF gene. I then analyzed this region with a code I wrote in GNU Octave, version 6.3.0. This code generates three .csv files (one for each of the following groups: females only, males only, both sexes) with information on each one of the variants genotyped within the region mentioned (see Supplementary Material). An extract from these files, with only the variants collected in Table 2, can be seen in Table 3, Table 4, and Table 5. None of these variants reaches the statistically significant threshold, which for this kind of studies is considered to be p<5×10^(-8), for variants with a minor allele frequency above 5% (even lower for less common variants) (Fadista J et al. 2016).
rsid
min
ref
alt
min_AF
p val
alt cases
alt contr
1799724
T
C
T
0.072
0.150
0.077
0.072
1800629
A
G
A
0.197
0.706
0.195
0.197
3093662
G
A
G
0.076
0.590
0.073
0.076
Table 3. Frequencies of the same variants of Table 2 from Neale’s Lab Metadata. This is the comparison between 1656 ME/CFS patients (females + males) and 359482 controls (females + males). min = minor allele; ref = reference allele; alt = alternate allele; min_AF = minor allele frequency. alt cases = frequency of alternative allele in cases; alt contr = frequency of alternate allele in controls. Elaboration made by the code TNF.m I wrote in Octave (see supplementary material).
rsid
min
ref
alt
min_AF
p val
alt cases
alt contr
1799724
T
C
T
0.072
0.167
0.078
0.071
1800629
A
G
A
0.197
0.857
0.198
0.196
3093662
G
A
G
0.076
0.603
0.073
0.076
Table 4. As in Table 3, but in this case the comparison is between 1208 female ME/CFS patients and 192945 female controls. min = minor allele; ref = reference allele; alt = alternate allele; min_AF = minor allele frequency. alt cases = frequency of alternative allele in cases; alt contr = frequency of alternate allele in controls. Elaboration made by the code TNF.m I wrote in Octave (see supplementary material).
rsid
min
ref
alt
min_AF
p val
alt cases
alt contr
1799724
T
C
T
0.072
0.609
0.075
0.072
1800629
A
G
A
0.197
0.314
0.186
0.199
3093662
G
A
G
0.076
0.856
0.074
0.075
Table 5. As in Table 3 and Table 4, but in this case the comparison is between 451 male ME/CFS patients and 166537 male controls. min = minor allele; ref = reference allele; alt = alternate allele; min_AF = minor allele frequency. alt cases = frequency of alternative allele in cases; alt contr = frequency of alternate allele in controls. Elaboration made by the code TNF.m I wrote in Octave (see supplementary material).
Another way to conveniently visualize the differences between patients and controls (if any) is to plot p values in function of the position of the variants they refer to. We actually plot the log10 of 1/p or, in other words, -log10(p). This kind of diagram is known as Manhattan plot and in Figure 1 you see the manhattan plot for region ch6:31,539,768 – ch6:31,546,495, in the case of females + males. As you can see, none of the variants has a p value below 0.01. In Figure 2 you find the same plot, this time for females only (left) and males only (right).
Figure 1. Manhattan plot for ch6:31,539,768 – ch6:31,546,495, in the case of female + male CFS patients vs controls. Red dot is rs1799724, blue dot is rs1800629, yellow dot is rs3093662. None of these three variants reaches a p value below 10^-1. Plotted by manhattan_plot.m (see supplementary material).Figure 2. Manhattan plot for ch6:31,539,768 – ch6:31,546,495, in the case of female CFS patients vs female controls (left) and male CFS patients vs male controls (right). Red dot is rs1799724, blue dot is rs1800629, yellow dot is rs3093662. None of these three variants reaches a p value below 10^-1.
Conclusions
In conclusion, these codes of mine provide as output the frequency of both the minor and the major allele among 1659 ME/CFS patients, stratified by sex, for about 15 million variants (including SNPs and InDels). They also provide the p value (beta test) for the comparison of these patients with about 300 k controls, along with general information on the variants (such as position, minor allele frequency among 500 k UK citizens, predicted consequences, etc). These codes do so by analyzing the metadata generated by Neale’s Lab, a research group that worked on the UK Biobank dataset. In this blog post, I have shown how to use these data to test previously published genetic associations between ME/CFS and gene TNF alpha.
Supplementary material
The files that I used as input for my codes are the following ones (click on them to download):
Details on these files can be found in this PDF. The code that generates the plots is pretty simple and can be downloaded here. The code that generates the three output files from the four input ones is the one that follows.
% file name = TNF
% date of creation = 25/08/2021
% it generates the file TNF_tot by using the package IO
% see https://octave.sourceforge.io/io/ for a description
clear all
close all
pkg load io
%-------------------------------------------------------------------------------
% Data source
%-------------------------------------------------------------------------------
% The collection of the analysis generated by Neale’s Lab with the UK Biobank
% data is available here:
% https://docs.google.com/spreadsheets/d/1kvPoupSzsSFBNSztMzl04xMoSC3Kcx3CrjVf4yBmESU/edit?ts=5b5f17db#gid=178908679
% In line 2 of that file, you can download an explanation of the labels (README).
% At lines 9 and 10 you can download a collection of the phenotypes and of some
% of the attributes of the relative sample, for females and males respectively
% (phenotypes.female.tsv, phenotypes.male.tsv). We are interested in phenotype 20002_1482
% (Non-cancer illness code, self-reported: chronic fatigue syndrome). In line 11
% you find a file that contains the details on each variant (variants.tsv). You
% must unzip the files before opening them. I have used File viewer plus for reading
% this file (Excel doesn’t seem to open it).
%-------------------------------------------------------------------------------
% Length and position of TNF
%-------------------------------------------------------------------------------
% The position of TNF goes from ch6:31,543,342 to ch6:31,546,113 on the reference
% genome GRCh37 (alias h19), according to NCBI (https://www.ncbi.nlm.nih.gov/gene/7124).
% The two variants we are interested in has a position ch6:31,542,482 (rs1799724, TNF-857)
% and ch6:31,543,031 (rs1800629), on GRCh37. So we must include them, along with
% the whole TNF gene. Since rs1799724 falls also in gene LTA, I include this gene
% as well (LTA goes from ch6:31,539,876 to ch6:31,542,101). Then a suitable interval
% in this dataset is ch6:31,539,768 to ch6:31,546,495.
%-------------------------------------------------------------------------------
% numbers of cases and controls
%-------------------------------------------------------------------------------
n_cases_f = 1208
n_controls_f = 192945
n_cases_m = 451
n_controls_m = 166537
n_cases_b = n_cases_f + n_cases_m
n_controls_b = n_controls_f + n_controls_m
%-------------------------------------------------------------------------------
% name of the gene
%-------------------------------------------------------------------------------
gene = 'TNF';
%-------------------------------------------------------------------------------
% number of variants = upper limit for range - 1
%-------------------------------------------------------------------------------
file_name = strcat('variants_',gene,'.xlsx')
[filetype, sh_names, fformat, nmranges] = xlsfinfo (file_name);
strn = char(sh_names(1,2));
strn = substr(strn,5) % upper limit for range as a string
n = str2num(strn) % upper limit for range as a number
%-------------------------------------------------------------------------------
% reading and storing variants_GENE.xlsx
%-------------------------------------------------------------------------------
file_name = strcat('variants_',gene,'.xlsx')
sheet = 'Sheet1'
% reading and storing chr:position:ref:alt
range = strcat('A2:A',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
variant = text;
% reading and storing the chromosome
range = strcat('B2:B',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
chr = num;
% reading and storing the position on chromosome
range = strcat('C2:C',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
pos = int32(num);
% reading and storing the ref allele
range = strcat('D2:D',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
ref = text;
% reading and storing the alt allele
range = strcat('E2:E',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
alt = text;
% reading and storing the rsid
range = strcat('F2:F',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
rsid = raw;
% reading and storing the consequence
range = strcat('H2:H',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
consequence = text;
% reading and storing the consequence category
range = strcat('I2:I',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
consequence_category = text;
% reading and storing the Alternate allele count (calculated using hardcall genotypes)
range = strcat('L2:L',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AC = int32(num);
% reading and storing the Alternate allele frequency (calculated using hardcall genotypes)
range = strcat('M2:M',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AF = double(num);
% reading and storing the Minor allele (equal to ref allele when AF > 0.5, otherwise equal to alt allele)
range = strcat('N2:N',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_allele = text;
% reading and storing the Minor allele frequency (calculated using hardcall genotypes)
range = strcat('O2:O',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_AF = double(num);
% reading and storing the Hardy-Weinberg p-value
range = strcat('P2:P',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
p_hwe = double(num);
% reading and storing the Number of samples with defined genotype at this variant
range = strcat('Q2:Q',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_called = int32(num);
% reading and storing the Number of samples without a defined genotype at this variant
range = strcat('R2:R',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_not_called = int32(num);
% testing by writing some of the data
i = 2
variant (i)
chr (i)
pos (i)
ref (i)
alt (i)
rsid (i)
consequence (i)
consequence_category (i)
AC (i)
AF (i)
minor_allele (i)
minor_AF (i)
p_hwe (i)
n_called (i)
n_not_called(i)
%-------------------------------------------------------------------------------
% reading and storing 20002_1482.gwas.imputed_v3.females_GENE.xlsx
%-------------------------------------------------------------------------------
file_name = strcat('20002_1482.gwas.imputed_v3.females_',gene,'.xlsx')
sheet = 'Sheet1'
% reading and storing chr:position:ref:alt
range = strcat('A2:A',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
variant_f = text;
% reading and storing the Minor allele frequency in n_complete_samples_f
range = strcat('C2:C',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_AF_f = double(num);
% reading and storing the quality control
range = strcat('E2:E',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
low_confidence_variant_f = raw;
% reading and storing the numeber of individuala genotyped in cases + controls
range = strcat('F2:F',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_complete_samples_f = double(num);
% reading and storing the Alternate allele count within n_complete_samples_f
range = strcat('G2:G',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AC_f = double(num);
% reading and storing the number of alternative alleles in n_cases_f
range = strcat('H2:H',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
ytx_f = double(num);
% reading and storing the p values between cases and controls
range = strcat('L2:L',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
pval_f = double(num);
% testing by writing some of the data
i = 2
variant_f(i)
minor_AF_f(i)
low_confidence_variant_f(i)
n_complete_samples_f(i)
AC_f(i)
ytx_f(i)
pval_f(i)
%-------------------------------------------------------------------------------
% reading and storing 20002_1482.gwas.imputed_v3.males_GENE.xlsx
%-------------------------------------------------------------------------------
file_name = strcat('20002_1482.gwas.imputed_v3.males_',gene,'.xlsx')
sheet = 'Sheet1'
% reading and storing chr:position:ref:alt
range = strcat('A2:A',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
variant_m = text;
% reading and storing the Minor allele frequency in n_complete_samples_f
range = strcat('C2:C',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_AF_m = double(num);
% reading and storing the quality control
range = strcat('E2:E',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
low_confidence_variant_m = raw;
% reading and storing the numeber of individuala genotyped in cases + controls
range = strcat('F2:F',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_complete_samples_m = double(num);
% reading and storing the Alternate allele count within n_complete_samples_f
range = strcat('G2:G',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AC_m = double(num);
% reading and storing the number of alternative alleles in n_cases_f
range = strcat('H2:H',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
ytx_m = double(num);
% reading and storing the p values between cases and controls
range = strcat('L2:L',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
pval_m = double(num);
% testing by writing some of the data
i = 2
variant_m(i)
minor_AF_m(i)
low_confidence_variant_m(i)
n_complete_samples_m(i)
AC_m(i)
ytx_m(i)
pval_m(i)
%-------------------------------------------------------------------------------
% reading and storing 20002_1482.gwas.imputed_v3.both_sexes_GENE.xlsx
%-------------------------------------------------------------------------------
file_name = strcat('20002_1482.gwas.imputed_v3.both_sexes_',gene,'.xlsx')
sheet = 'Sheet1'
% reading and storing chr:position:ref:alt
range = strcat('A2:A',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
variant_b = text;
% reading and storing the Minor allele frequency in n_complete_samples_f
range = strcat('C2:C',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_AF_b = double(num);
% reading and storing the quality control
range = strcat('E2:E',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
low_confidence_variant_b = raw;
% reading and storing the numeber of individuala genotyped in cases + controls
range = strcat('F2:F',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_complete_samples_b = double(num);
% reading and storing the Alternate allele count within n_complete_samples_f
range = strcat('G2:G',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AC_b = double(num);
% reading and storing the number of alternative alleles in n_cases_f
range = strcat('H2:H',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
ytx_b = double(num);
% reading and storing the p values between cases and controls
range = strcat('L2:L',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
pval_b = double(num);
% testing by writing some of the data
i = 2
variant_b(i)
minor_AF_b(i)
low_confidence_variant_b(i)
n_complete_samples_b(i)
AC_b(i)
ytx_b(i)
pval_b(i)
%-------------------------------------------------------------------------------
% checking for wrong alignments between these 4 files
%-------------------------------------------------------------------------------
error = 0.;
for i=1:n-1
a = variant(i);
b = variant_f(i);
c = variant_m(i);
d = variant_b(i);
% function strcmp (string_1, strin_2) returns 1 if the two strings are identical
% it resturns 0 if the two strings have at least a different character in the same position
test = strcmp(a,b);
if ( (test==0) );
error = error +1;
endif
test = strcmp(a,c);
if ( (test==0) );
error = error +1;
endif
test = strcmp(a,d);
if ( (test==0) );
error = error +1;
endif
endfor
error
%-------------------------------------------------------------------------------
% defining major and minor alleles among all the variants
%-------------------------------------------------------------------------------
for i=1:n-1
frequency = AF(i); %frequency of the alternative allele
if ( frequency <= 0.5 )
minor_allele2(i)=alt(i);
major_allele(i)=ref(i);
else
minor_allele2(i)=ref(i);
major_allele(i)=alt(i);
endif
endfor
%-------------------------------------------------------------------------------
% defining alternate allele frequencies in case groups
%-------------------------------------------------------------------------------
for i=1:n-1
alt_AF_f(i) = ytx_f(i)/(2n_cases_f); alt_AF_m(i) = ytx_m(i)/(2n_cases_m);
alt_AF_b(i) = ytx_b(i)/(2n_cases_b); endfor%------------------------------------------------------------------------------- % defining alternate allele frequencies in control groups %------------------------------------------------------------------------------- for i=1:n-1 alt_AF_f_controls(i) = ( AC_f(i)- ytx_f(i) )/(2n_controls_f);
alt_AF_m_controls(i) = ( AC_m(i)- ytx_m(i) )/(2n_controls_m); alt_AF_b_controls(i) = ( AC_b(i)- ytx_b(i) )/(2n_controls_b);
endfor
%-------------------------------------------------------------------------------
% defining reference allele frequencies in case groups
%-------------------------------------------------------------------------------
for i=1:n-1
ref_AF_f(i) = 1-alt_AF_f(i);
ref_AF_m(i) = 1-alt_AF_m(i);
ref_AF_b(i) = 1-alt_AF_b(i);
endfor
%-------------------------------------------------------------------------------
% defining reference allele frequencies in control groups
%-------------------------------------------------------------------------------
for i=1:n-1
ref_AF_f_controls(i) = 1-alt_AF_f_controls(i);
ref_AF_m_controls(i) = 1-alt_AF_m_controls(i);
ref_AF_b_controls(i) = 1-alt_AF_b_controls(i);
endfor
%-------------------------------------------------------------------------------
% checking for errors by comparing minor_allele with minor_allele2
%-------------------------------------------------------------------------------
error2 = 0.;
for i=1:n-1
a = minor_allele(i);
b = minor_allele2(i);
% function strcmp (string_1, strin_2) returns 1 if the two strings are identical
% it resturns 0 if the two strings have at least a different character in the same position
test = strcmp(a,b);
if ( (test==0) );
error2 = error2 +1;
endif
endfor
error2
%-------------------------------------------------------------------------------
% writing of GENE_total_females.csv
%-------------------------------------------------------------------------------
columns = cell(n,20);
columns{1,1} = "variant";
columns{1,2} = "chr";
columns{1,3} = "pos";
columns{1,4} = "rsid";
columns{1,5} = "minor_allele";
columns{1,6} = "major_allele";
columns{1,7} = "ref";
columns{1,8} = "alt";
columns{1,9} = "minor_AF";
columns{1,10} = "low_confidence_variant";
columns{1,11} = "p_hwe";
columns{1,12} = "consequence";
columns{1,13} = "consequence_category";
columns{1,14} = "pval";
columns{1,15} = "alt_AF_cases";
columns{1,16} = "alt_AF_controls";
columns{1,17} = "ref_AF_cases";
columns{1,18} = "ref_AF_controls";
columns{1,19} = "n_cases";
columns{1,20} = "n_controls";
for i=1:n-1
columns{i+1,1} = variant{i};
columns{i+1,2} = chr(i);
columns{i+1,3} = int32(pos(i));
columns{i+1,4} = rsid{i};
columns{i+1,5} = minor_allele{i};
columns{i+1,6} = major_allele{i};
columns{i+1,7} = ref{i};
columns{i+1,8} = alt{i};
columns{i+1,9} = minor_AF(i);
columns{i+1,10} = low_confidence_variant_f{i};
columns{i+1,11} = p_hwe(i);
columns{i+1,12} = consequence{i};
columns{i+1,13} = consequence_category{i};
columns{i+1,14} = pval_f(i);
columns{i+1,15} = alt_AF_f(i);
columns{i+1,16} = alt_AF_f_controls(i);
columns{i+1,17} = ref_AF_f(i);
columns{i+1,18} = ref_AF_f_controls(i);
columns{i+1,19} = n_cases_f;
columns{i+1,20} = n_controls_f;
endfor
file_name = strcat(gene,'_total_females.csv')
cell2csv (file_name, columns," ")
%-------------------------------------------------------------------------------
% writing of GENE_total_males.csv
%-------------------------------------------------------------------------------
columns = cell(n,20);
columns{1,1} = "variant";
columns{1,2} = "chr";
columns{1,3} = "pos";
columns{1,4} = "rsid";
columns{1,5} = "minor_allele";
columns{1,6} = "major_allele";
columns{1,7} = "ref";
columns{1,8} = "alt";
columns{1,9} = "minor_AF";
columns{1,10} = "low_confidence_variant";
columns{1,11} = "p_hwe";
columns{1,12} = "consequence";
columns{1,13} = "consequence_category";
columns{1,14} = "pval";
columns{1,15} = "alt_AF_cases";
columns{1,16} = "alt_AF_controls";
columns{1,17} = "ref_AF_cases";
columns{1,18} = "ref_AF_controls";
columns{1,19} = "n_cases";
columns{1,20} = "n_controls";
for i=1:n-1
columns{i+1,1} = variant{i};
columns{i+1,2} = chr(i);
columns{i+1,3} = int32(pos(i));
columns{i+1,4} = rsid{i};
columns{i+1,5} = minor_allele{i};
columns{i+1,6} = major_allele{i};
columns{i+1,7} = ref{i};
columns{i+1,8} = alt{i};
columns{i+1,9} = minor_AF(i);
columns{i+1,10} = low_confidence_variant_m{i};
columns{i+1,11} = p_hwe(i);
columns{i+1,12} = consequence{i};
columns{i+1,13} = consequence_category{i};
columns{i+1,14} = pval_m(i);
columns{i+1,15} = alt_AF_m(i);
columns{i+1,16} = alt_AF_m_controls(i);
columns{i+1,17} = ref_AF_m(i);
columns{i+1,18} = ref_AF_m_controls(i);
columns{i+1,19} = n_cases_m;
columns{i+1,20} = n_controls_m;
endfor
file_name = strcat(gene,'_total_males.csv')
cell2csv (file_name, columns," ")
%-------------------------------------------------------------------------------
% writing of CLYBL_total_both_sexes.csv
%-------------------------------------------------------------------------------
columns = cell(1523,20);
columns{1,1} = "variant";
columns{1,2} = "chr";
columns{1,3} = "pos";
columns{1,4} = "rsid";
columns{1,5} = "minor_allele";
columns{1,6} = "major_allele";
columns{1,7} = "ref";
columns{1,8} = "alt";
columns{1,9} = "minor_AF";
columns{1,10} = "low_confidence_variant";
columns{1,11} = "p_hwe";
columns{1,12} = "consequence";
columns{1,13} = "consequence_category";
columns{1,14} = "pval";
columns{1,15} = "alt_AF_cases";
columns{1,16} = "alt_AF_controls";
columns{1,17} = "ref_AF_cases";
columns{1,18} = "ref_AF_controls";
columns{1,19} = "n_cases";
columns{1,20} = "n_controls";
for i=1:n-1
columns{i+1,1} = variant{i};
columns{i+1,2} = chr(i);
columns{i+1,3} = int32(pos(i));
columns{i+1,4} = rsid{i};
columns{i+1,5} = minor_allele{i};
columns{i+1,6} = major_allele{i};
columns{i+1,7} = ref{i};
columns{i+1,8} = alt{i};
columns{i+1,9} = minor_AF(i);
columns{i+1,10} = low_confidence_variant_b{i};
columns{i+1,11} = p_hwe(i);
columns{i+1,12} = consequence{i};
columns{i+1,13} = consequence_category{i};
columns{i+1,14} = pval_b(i);
columns{i+1,15} = alt_AF_b(i);
columns{i+1,16} = alt_AF_b_controls(i);
columns{i+1,17} = ref_AF_b(i);
columns{i+1,18} = ref_AF_b_controls(i);
columns{i+1,19} = n_cases_b;
columns{i+1,20} = n_controls_b;
endfor
file_name = strcat(gene,'_total_both_sexes.csv')
cell2csv (file_name, columns," ")
Some days ago it came to my attention the fact that there is a religious group that considers the presence of propositions within the Old Testament that seem to describe events that have happened after the propositions were written as proof of the divine origin of such a text. Not only that: to my understanding, they seem to infer from those successful prophecies that also the other predictions present in the Bible must turn out to happen somewhere in the future.
Now, in these statements, there are several problems. One can note, for instance, that successful scientific theories like Newtonian gravitation continuously predict with great accuracy the future, and yet this doesn’t seem to be generally considered proof that Isaac Newton had a divine nature. On the other hand, the fact that his gravitational law predicts the orbits of the Earth, of Mars, of the modules of the Apollo missions, etc., doesn’t prevent the same theory to fail when it comes to the orbit of Mercury. So, this single counter-example might be considered enough to prove that the first paragraph of this article contains false statements.
But then there is another problem. How do I evaluate whether a prophecy is true or false? Despite the obvious difficulties due to the fact that this kind of prediction is expressed in a human language (translated from one tongue into another) and not in mathematical language and as such ambiguous by its very nature, one can still try – I reasoned – to apply the standard method used in science. In other words, one can calculate the probability of the event under the null hypothesis (that, in this case, is: the prophecy is true by chance) and see what kind of number he gets.
Well, I discovered right away that this calculation had in fact been attempted by Peter Stoner, a math teacher at Pasadena City College (R), and it can be found in his book, Science Speaks (freely available here). Let’s consider one of those prophecies, the one that seems to be particularly close to the heart of Marina, the Jehovah’s Witness who is currently persecuting me, the prophecy of Babylon:
And Babylon … shall never be inhabited, neither shall it be dwelt in from generation to generation: neither shall the Arabian pitch tent there; neither shall the shepherds make their fold there. But wild beasts of the desert shall lie there; and their houses shall be full of doleful creatures.
(Isa. 13:19-21 – written 712 B.C.)
And they shall not take of thee a stone for a corner, nor a stone for foundations; but thou shalt be desolate forever, saith the Lord … Neither doth any son of man pass thereby.
(Jer. 51:26,43 – written 600 B.C.)
Now, in his book, Peter Stoner translates this prophecy in the following seven events or propositions:
: Babylon shall be destroyed. : It shall never be reinhabited (once destroyed). : The Arabs shall not pitch their tents there (once destroyed). : There shall be no sheepfolds there (once destroyed). : Wild beasts shall occupy the ruins (once destroyed). : The stones shall not be taken away for other buildings (once destroyed). : Men shall not pass by the ruins (once destroyed).
I have added the expression within brackets, to enhance clarity. The probabilities of the single events (on the right) are those proposed by Peter Stoner, and I am not going to discuss them. Now, if we indicate the whole prophecy with E, according to Stoner the probability of the event under the null hypothesis is
This is a small probability for the event under the null hypothesis, so we must accept that a power beyond human comprehension has… But let’s read again the prophecy: it seems articulated but, in fact, it only says that the place where Babylon once raised will be basically abandoned by human beings. In other words, it seems reasonable to say that for . If that is the case, then we have
A further observation is that if the place is desolated with no human beings () then it is reasonable to assume that it becomes the reign of wild animals. In other words: . Not only that, I would guess that it is safe to assume that is about one. Then we have found
In other words, the mistake that leads Stoner to a completely wrong value for is the fact that he considered the events as independent one from the other, while this is obviously not the case.
Now, is this the probability of the event under the null hypothesis? Well, it depends, because this is a case in which we have a prediction that has been got from a very long book with thousands of propositions, some of which look very much like predictions. Now, of course, when one picks a prophecy among all these propositions, he might be unconsciously tempted to pick the one that looks more like a fulfilled prophecy. In other words, we have to check for multiple comparisons in this case. So, let us consider that we have a number of N propositions similar to the one about Babylon. The probability that at least one of these propositions is true purely by chance is
The function is plotted by the following code and as you can see, for N >4 we have that the null hypothesis becomes very likely. In other words, if we pick this prophecy among 5 other similar sentences, its resemblance to reality is just due to chance. In the figure below the red line indicates a probability of 0.05. A probability equal to or higher than 0.05 is associated with null hypotheses that must be considered true.
Figure. The probability that at least one among N prophecies is true purely by chance, for increasing values of N. The red line indicates a probability of 0.05 above which the null hypothesis must be considered true.
It should be added that the fact that the prophecy of Babylon has to be considered true is highly questionable. The reader can do his own research on that, for example starting from the notes collected in this text.
% file name = Babylon % date of creation = 20/03/2021 % it calculates the probability that N prophecies are true by chance clear all % array for p(N) N = 30; for i=1:N p(i) = 1-(99/100)^i; p2(i) = 0.05; endfor % plotting plot ([1:N],p(:),’-k’, “linewidth”, 2) hold on plot ([1:N],p2(:),’-r’, “linewidth”, 2) xlabel(‘number of prophecies’); ylabel(‘probability of the null hypothesis’); grid minor grid on
A more detailed discussion of these same topics is now available here.
Introduction
Let’s consider the relative humidity we have right now in Venice, Italy: 97% with a temperature of 8°C and a pressure of 755 mmHg. Pretty humid, right? What about a warm place, let’s say Buenos Aires, in Argentina? Right now they have a relative humidity of 55%, at a temperature of 23° and a pressure of 760 mmHg. Now, which place is more humid, between these two? In other words: in which of these two places does the same volume of air contain the biggest amount of water? Are you sure you know the answer?
Some definitions and a formula
Relative humidity is defined as follows
where is the partial pressure of water vapor in the air and is the saturation vapor pressure with respect to water (the vapor pressure in the air in equilibrium with the surface of the water). We then have
Here and in what follows we express pressure in hPa, with , while is the temperature in °C and is the temperature in , with . Absolute humidity is simply given by
where is the mass of vapor and is the total volume occupied by the mixture (Ref, chapter 4).
Physics of gasses
From the law of perfect gasses and considering that, according to Dalton’s law, in a moisture, the partial pressure of each gas is the pressure it would have if it occupied the whole volume (Ref), we can also write
with . Then, by substituting and in , we have
At the end of this post, you find a simple code in Octave that calculates this formula and the plot.
The answer
If we apply now the equation for to the environmental parameters relative to Venice, we find that one of air contains 8 grams of water; if we repeat the same calculation for Buenos Aires we find that the same volume of air contains 11 grams of water. In other words, today Venice is less humid than Buenos Aires in the same period if we refer to the content of water in the air.
In the diagram below you can see the plot of absolute humidity as a function of the temperature, for three values of relative humidity. So, for instance, 20 of water corresponds to an RH of 90% at a temperature of 24°C and to an RH of only 50% at a temperature of 35°C. The reason for this, beyond the mathematical formulae, is obvious: when the temperature of the air decreases, the ability of the moisture to retain water as vapor decreases accordingly.
Conclusion
Relative humidity is relevant for the subjective perception of heat (the more the relative humidity, the more difficult it is sweating for human beings, then the higher it is the perception of heat since we use sweating for cooling down). But if we are interested in the interaction of air and our lungs, for instance, relative humidity might be completely misleading and absolute humidity is likely the parameter to be considered (R).
The script
% file name = absolute_humidity % date of creation = 07/02/2021 % it calculates the absolute humidity given the relative humidity and the temperature clear all % constant for water vapour (J/kg*K) Rv = 461.4; % relative humidity (%), temperature (°C), pressure (mmHg) RH = 55 t = 23 pHg = 760 % conversion to Pa (1 mmHg = 133.322365 Pa) p = pHg*133.322 % conversion to hPa=100Pa p = p/100; Rv = 4.614; % absolute humidity (kg/m^3) AH = (RH/(100*Rv*(273.15+t)))*(1.0016+3.15*10^(-6)*p-0.074/p)*6.112*e^(17.62*t/(243.12+t)) % we now fix the RH and plot AH for variuos values of t RH = 50 pHg = 760 p = pHg*133.322; p = p/100; for i=1:30 t2(i)=i+5; AH2(i) = 1000*(RH/(100*Rv*(273.15+t2(i))))*(1.0016+3.15*10^(-6)*p-0.074/p)*6.112*e^(17.62*t2(i)/(243.12+t2(i))); endfor plot (t2, AH2,’-k’,’LineWidth’,3) grid on grid minor hold on RH = 70 for i=1:30 t2(i)=i+5; AH2(i) = 1000*(RH/(100*Rv*(273.15+t2(i))))*(1.0016+3.15*10^(-6)*p-0.074/p)*6.112*e^(17.62*t2(i)/(243.12+t2(i))); endfor plot (t2, AH2,’-r’,’LineWidth’,3) hold on RH = 90 for i=1:30 t2(i)=i+5; AH2(i) = 1000*(RH/(100*Rv*(273.15+t2(i))))*(1.0016+3.15*10^(-6)*p-0.074/p)*6.112*e^(17.62*t2(i)/(243.12+t2(i))); endfor plot (t2, AH2,’-b’,’LineWidth’,3) xlabel (‘temperature (°C)’) ylabel (‘absolute humidity (g/{m^3})’) legend (‘AH for RH = 50%’,’AH for RH = 70%’,’AH for RH = 90%’, ‘location’, ‘NORTHWEST’ )
The equations of this blog post were written using.
Some days ago, David Systrom offered an overview of his work on cardiopulmonary testing in ME/CFS during a virtual meeting hosted by the Massachusetts ME/CFS & FM Association and the Open Medicine Foundation. In this blog post, I present an introduction to the experimental setting used for Systrom’s work (paragraph 1), a brief presentation of his previous findings (paragraph 2), and an explanation of his more recent discoveries in his cohort of patients (paragraph 3). In paragraph 4 you’ll find a note on how to support his research.
1. Invasive Cardiopulmonary Exercise Testing
It is a test that allows for the determination of pulmonary, cardiac, and metabolic parameters in response to physical exertion of increasing workload. It is, mutatis mutandis, the human equivalent of an engine test stand. A stationary bike with a mechanical resistance that increases by 10 to 50 Watts for minute is usually employed for assessing the patient in a upright position, but a recumbent bike can also be used in some instances. Distinguishing between these two different settings might be of pivotal relevance in ME/CFS and POTS. I shall now briefly describe some of the measurements that can be collected during invasive cardiopulmonary exercise testing (iCPET) and their biological meaning. For a more accurate and in-depth account, please refer to (Maron BA et al. 2013), (Oldham WM et al. 2016). I have used these papers as the main reference for this paragraph, unless otherwise specified.
Gas exchange. A face mask collects the gasses exchanged by the patient during the experiment and allows for monitoring of both oxygen uptake per unit of time (named ) and carbon dioxide output (), measured in mL/min. Gas exchange is particularly useful for the determination of the anaerobic threshold (AT), i.e. the point in time at which the diagram of in function of displays an abrupt increase in its derivative: at this workload, the patient starts relying more on her anaerobic energy metabolism (glycolysis, for the most part) with a build-up of lactic acid in tissues and blood (see Figure 1).
Figure 1. Diagram of in function of . The point in which there is a change in the derivative with respect to is called “anaerobic threshold” (AT). AT is highlighted with a vertical line in this picture. This diagram is from an actual CPET of a patient.
Oxygen uptake for unit of time at AT (called max) can be considered an integrated function of patient’s muscular, pulmonary, and cardiac efficiency during exercise. It is abnormal when its value is below 80% of what predicted according to patient’s age, sex, and height. Importantly, according to some studies there might be no difference in max between ME/CFS patients and healthy controls, unless the exercise test is repeated a day after the first measure: in this case the value max for patients is significantly lower than for controls (VanNess JM et al. 2007), (Snell CR and al. 2013).
Another measure derived from the assessing of gas exchange is minute ventilation (VE, measured in L/min) which represents the total volume of gas expired per minute. The link between VE and is as follows:
Maximum voluntary ventilation (MVV) is the maximum volume of air that is voluntarily expired at rest. During incremental exercise, a healthy person should be able to maintain her VE at a value ∼0.7 MVV and it is assumed that if the ratio VE/MVV is above 0.7, then the patient has a pulmonary mechanical limit during exercise. If VE is normal, then an early AT suggests an inefficient transport of oxygen from the atmosphere to muscles, not due to pulmonary mechanics, thus linked to either pulmonary vascular abnormalities or muscular/mitochondrial abnormalities. It is suggested that an abnormally high derivative of the diagram of VE in function of and/or a high ratio VE/ at AT (these are measures of how efficiently the system gets rid of ) are an indicator of poor pulmonary vascular function.
Respiratory exchange ratio (RER) is a measure of the effort that the patient puts into the exercise. It is measured as follows:
and an RER>1.05 indicates a sufficient level of effort. In this case the test can be considered valid.
Arterial catheters. A sensor is placed just outside the right ventricle (pulmonary artery, Figure 2) and another one is placed in the radial artery: they allow for measures of intracardiac hemodynamics and arterial blood gas data, respectively. By using this setting, it is possible to indirectly estimate cardiac output (Qt) by using Fick equation:
where the is measured by the radial artery catheter and the venous one is measured by the one in the pulmonary artery (ml/L). An estimation for an individual’s predicted maximum Qt (L/min) can be obtained by dividing her predicted max by the normal maximum value of during exercise, which is 149 mL/L:
If during iCPET the measured Qt max is below 80% of the predicted maximum cardiac output (as measured above), associated with reduced max, then a cardiac abnormality might be suspected. Stroke volume (SV), defined as the volume of blood ejected by the left ventricle per beat, can be obtained from the Qt according to the following equation:
where HR stands for heart rate. One obvious measure from the pulmonary catheter is the mean pulmonary artery pressure (mPAP). The right atrial pressure (RAP) is the blood pressure at the level of the right atrium. Pulmonary capillary wedge pressure (PCWP) is an estimation for the left atrial pressure. It is obtained by the pulmonary catheter. The mean arterial pressure (MAP) is the pressure measured by the radial artery catheter and it is a proxy for the pressure in the left atrium. RAP, mPAP, and PCWP are measured by the pulmonary catheter (the line in red) which from the right atrium goes through the tricuspid valve, enters the right ventricle, and then goes along the initial part of the pulmonary artery (figure 2).
Figure 2. Right atrial pressure (RAP) is the pressure of the right atrium, mean pulmonary arterial pressure (mPAP) is the pressure of the right ventricle, pulmonary capillary wedge pressure (PCWP) is an estimation of the pressure of the left atrium. Mean arterial pressure gives a measure of the pressure of the left ventricle. RAP, mPAP, and PCWP are measured by the pulmonary catheter (the line in red) which from the right atrium goes through the tricuspid valve, enters the right ventricle, and then goes across the initial part of the pulmonary artery (R).
Derived parameters. As seen, Qt (cardiac output) is derived from actual direct measures collected by this experimental setting, by using a simple mathematical model (Fick equation). Another derived parameter is pulmonary vascular resistance (PVR) which is obtained using the particular solution of the Navier-Stokes equations (the dynamic equation for Newtonian fluids) that fits the geometry of a pipe with a circular section. This solution is called the Poiseuille flow, and it states that the difference in pressure between the extremities of a pipe with a circular cross-section A and a length L is given by
where is a mechanical property of the fluid (called dynamic viscosity) and Q is the blood flow (Maccallini P. 2007). As the reader can recognize, this formula has a close resemblance with Ohm’s law, with P analogous to the electric potential, Q analogous to the current, and analogous to the resistance. In the case of PVR, Q is given by Qt while . Then we have:
where the numeric coefficient is due to the fact that PVR is usually measured in and 1 dyne is Newton while 1 mmHg is 1333 N/m².
2. Preload failure
A subset of patients with exercise intolerance presents with preload-dependent limitations to cardiac output. This phenotype is called preload failure (PLF) and is defined as follows: RAP max < 8 mmHg, Qt and max <80% predicted, with normal mPAP (<25 mmHg) and normal PVR (<120 ) (Maron BA et al. 2013). This condition seems prevalent in ME/CFS and POTS. Some of these patients have a positive cutaneous biopsy for small-fiber polyneuropathy (SFPN), even though there seems to be no correlation between hemodynamic parameters and the severity of SFPN. Intolerance to exercise in PLF seems to improve after pyridostigmine administration, mainly through potentiation of oxygen extraction in the periphery. A possible explanation for PLF in non-SFPN patients might be a more proximal lesion in the autonomic nervous system (Urbina MF et al. 2018), (Joseph P. et al. 2019). In particular, 72% of PLF patients fits the IOM criteria for ME/CFS and 27% meets the criteria for POTS. Among ME/CFS patients, 44% has a positive skin biopsy for SFPN. One possible cause for damage to the nervous system (both in the periphery and centrally) might be TNF-related apoptosis-inducing ligand (TRAIL) which has been linked to fatigue after radiation therapy; TRAIL increases during iCPET among ME/CFS patients (see video below).
3. Latest updates from David Systrom
During the Massachusetts ME/CFS & FM Association and Open Medicine Foundation Fall 2020 Event on Zoom, David Systrom reported on the results of iCPET in a set of ME/CFS patients. The max is lower in patients vs controls (figure 3, up). As mentioned before, max is an index that includes contributions from cardiac performances, pulmonary efficiency, and oxygen extraction rate in the periphery. In other words, a low max gives us no explanation on why it is low. This finding seems to be due to different reasons in different patients even though the common denominator among all ME/CFS patients of this cohort is a low pressure in the right atrium during upright exercise (low RAP, figure 3, left). But then, if we look at the slope of Qt in function of (figure 3, right) we find three different phenotypes. Those with a high slope are defined “high flow” (in red in figure 3). Then we have a group with a normal flow (green) and a group with a low flow (blue). If we look then at the ability to extract oxygen by muscles (figure 3, below) expressed by the ratio
we can see that the high flow patients reach the lowest score. In summary, all ME/CFS patients of this cohort present with poor max and preload failure. A subgroup, the high flow phenotype, has poor oxygen extraction capacity at the level of skeletal muscles.
Figure 3.The results presented by David Systrom are here displayed around a schematic representation of the circulatory system. is a global measure of the efficiency of the circulatory system. CO, which stands for cardiac output (indicated Qt in this blog post) is related to the output of the left half of the heart. RAP is the pressure of the right atrium.By Paolo Maccallini.
Now the problem is: what is the reason for the preload failure? And in the high flow phenotype, why the muscles can’t properly extract oxygen from blood? As mentioned, about 44% of ME/CFS patients in this cohort has SFPN but there is no correlation between the density of small-fibers in the skin biopsies and the hemodynamic parameters. Eleven patients with poor oxygen extraction (high flow) had their muscle biopsy tested for mitochondrial function (figure 4) and all but one presented a reduction in the activity of citrate synthase (fourth column): this is the enzyme that catalyzes the last/first step of Krebs cycle and it is considered a global biomarker for mitochondrial function. Some patients also have defects in one or more steps of the electron transport chain (fifth column) associated with genetic alterations (sixth column). Another problem in high flow patients might be a dysfunctional vasculature at the interface between the vascular system and skeletal muscles (but this might be true for the brain too), rather than poor mitochondrial function.
Figure 4.Eleven patients with high flow (poor oxygen extraction) underwent a muscle biopsy. Mitochondrial function has been assessed in these samples and all the patients but one presented a reduced activity for the enzyme citrate synthase (4th column). Defects in the oxygen transport chain and in the mitochondrial chromosome have also been documented in 4 of them (column 5th and column 6th).
The use of an acetylcholinesterase inhibitor (pyridostigmine) improved the ability to extract oxygen in the high flow group, without improving cardiac output, as measured with a CPET, after one year of continuous use of the drug. This might be due to better regulation of blood flow in the periphery. This paragraph is an overview of the following video:
4. Funding
The trial on the use of pyridostigmine in ME/CFS at the Brigham & Women’s Hospital by Dr. David Systrom is funded by the Open Medicine Foundation (R). This work is extremely important, as you have seen, both for developing diagnostic tools and for finding treatments for specific subgroups of patients. Please, consider a donation to the Open Medicine Foundation to speed up this research. See how to donate.
The equations of this blog post were written using (see this article).
Each mathematical formula I have used in my blog till now was an image: I wrote them in a Word document using the equation editor by Microsoft (that I like very much), then I captured the screen and saved it as an image, and then I uploaded them in my blog.
In the past, I tried to find a way to write them as text, without succeeding. But it is in fact possible, by using the LaTeX syntax and a few HTML commands.
These are the first “true” mathematical formulae present in this blog. I will use this page to test the LaTeX syntax for mathematical expressions (a handbook is available here). In the table that follows, I present a series of relevant examples of LaTeX syntax. A useful resource for testing out LaTeX instructions is this website.
My ME/CFS improves during summer, in the period of the year that goes from May/June to the end of September. I don’t know why. I have several hypotheses. One possible reason for the improvement in summer is an interaction between the light from the Sun and some parts of my physiology, the immune system for instance. We know that ME/CFS tends to have an oscillating course in most of the patients (Chu L. et al. 2019), but the presence of a seasonal pattern in this patient population has not been investigated so far, to my knowledge. And yet, if you ask directly to patients, many of them say that they feel better in summer. Unfortunately, we don’t have scientific data on that, this is an area worth investigating with some carefully done survey.
Seasonal variation of the immune system
The immune system has a high degree of variation for several reasons (Brodin P et Davis MM 2017). In particular, there are studies about the seasonal fluctuations in the expression of some crucial genes of the immune response (Dopico XC et al. 2014).
How does this regulation happen? Different mechanisms are possible, some of them might be related to changes in the light we receive from the Sun as the Earth rotates around it. We know that the length of the day has an effect on innate immunity: the more the hours of light, the lower the power of the innate immune system (Pierre K. et al. 2016). We also know that ultraviolet radiation, particularly UVB, is an agonist for the aryl hydrocarbon receptor (AhR) (Navid F. et al. 2013). This receptor seems to reduce the expression of the major histocompatibility complex II (MHC II) in dendriticcells (DCs), thus reducing their antigen-presenting activity (Rothhammer V. et Quintana F.J. 2019). UVB might be able to reach dendriticcells when they are circulating near the skin, during summer, thus inhibiting their antigen-presenting activity. Infrared radiation, on the other hand, seems to have an effect on energy metabolism: in Fall we lose a significant amount of infrared radiation in a wavelength range (0.7-1.0 nm) that is known to have an effect on mitochondrial activity (Nguyen L.M. et al. 2013) and it might perhaps have an indirect effect on immunity too.
As further proof of seasonal fluctuation in immunity, some immunological diseases have this kind of seasonality: Rheumatoid arthritis (Nagamine R. et al. 2014) and Rheumatic fever (Coelho Mota C.C. et al. 2010) are two examples. Moreover, the prevalence of Multiple Sclerosis is directly proportional to the latitude (Simpson S. et al. 2011). We also know that there is seasonal fluctuation in serum autoantibodies (Luong T.H. et al. 2001).
Of course, sunlight might be just one of the variables into play. The other aspect I am considering is the seasonal distribution of some common pathogens. Streptococcus, Enteroviruses and Fungi of the genus Penicillium are known to have a seasonal distribution with a peak in Fall and/or Winter (Ana S.G. et al. 2006), (Salort-Pons M et al. 2018), (Coelho Mota C.C. et al. 2010). Common influenza has this pattern too. Rheumatic fever, a disease due to an abnormal immune response to Streptococcus, has its flares in Fall because Streptococcus is more common in that period of the year (Coelho Mota C.C. et al. 2010). Even the composition of the gut microbiota has a seasonal pattern (Koliada A. et al. 2020). I am currently investigating my immunosignature, measured with an array of 150.000 random peptides, to see if I can find some relevant pathogen in my case. You can find this study here.
(A few months after I wrote these notes a pivotal study has been published on these same topics, avalilablehere).
An experiment
I moved from Rome (Italy) to Rosario (Argentina) at the beginning of January. I was very sick and I steadily improved after about 40 days. I became a less severe ME/CFS patients and I could work several hours a day and care for myself, granted that I did not exceed with aerobic exercise. At the end of March, I started deteriorating as it usually happens at the end of September, when I am in Rome. In order to study this phenomenon, I have built a complete model of solar radiation at sea level, which considers the inclination of sunrays in function of the latitude and of the day of the year. It takes into account the effect of the atmosphere (both diffusion and absorption) and the eccentricity of the orbit (Maccallini P. 2019). If you look at the figure below (a byproduct of my mathematical model) you can see that when I started deteriorating in Rosario, the power of sunrays at noon in that city was still as high as it is in Rome during the summer solstice (this is due to the fact that the Earth is closer to the Sun in this period and to the fact that Rosario is closer to the Equator than Rome is).
So I have to discard the original idea that the power within the infrared range, or the ultraviolet radiation, or the visible one is responsible for my improvement in summer. If I still have to consider that sunlight has something to do with my improvement, I must conclude that it is the length of the day the relevant parameter: I may need more than 12 hours of light to feel better. Why? Because the longer the day, the lower the strength of the innate immunity. This is now my working hypothesis and I will start from the following mathematical model to pursue this research: (Pierre K. et al. 2016).
I will also use full-spectrum lamps early in the morning and in the evening to reproduce a 15 hours day, so to dampen down my innate immune system in a safe, drug-free way. I have to reproduce a day of 15 hours and see what happens. In the figure below the hours of the day at dawn and at dusk and the length of the day for Rome, for each day of the year (this is also a plot from my model).
What follows is the script I have coded in order to plot the first figure of this post. More details on this model of solar radiation are here: (Maccallini P. 2019). As a further note, I would like to acknowledge that I started pursuing this avenue in the summer of 2009: I was building the mathematical model of solar radiation (see figure below, made in 2009) but as the summer finished, I turned into a statue and I had to stop working on it. When I improved, about a year later I started working on the systematic analysis of the mechanical equilibrium of planar structures (it is a chapter of this book). I am proud of that analysis, but it has not been very useful for my health…
% file name = sun emissive power sea level Rosario vs Roma
% date of creation = 07/11/2019
% sun emissive power per unit area, per unit wavelength at sea level
clear all
% three parameters of the orbit
A = 6.69*( 10^(-9) ); % 1/km
B = 1.12*( 10^(-10) ); % 1/km
delta = pi*313/730;
% the two parameters of Plunk's law
C_1 = 3.7415*( 10^(-16) ); % W*m^2
C_2 = 1.4388*( 10^(-2) ); % mK
% Stefan-Boltzmann parameter ( W/( (m^2)*(K^4) ) )
SB = 5.670*( 10^(-8) );
% radius of the photosphere (m)
R_S = 696*(10^6); % m
% temperature of the photosphere (K)
T_S = 5875;
% conversion of units of measurments
N = 20; % dots for the equator
R = 3.8; % radius of the orbit
ro_E = 1.3; % radius of the earth
lambda_Rosario = -32*pi/180; % latitude of Rosario (radiants)
lambda_Roma = 41*pi/180; % latitude of Rome (radiants)
delta = 23.45*pi/180; % tilt angle
% the array of theta
theta(1) = 0; % winter solstice (21/22 December)
i_ws = 1;
day = 2*pi/365;
days = [1:1:366];
for i = 2:366
theta(i) = theta(i-1) + day;
if ( abs( theta(i) - (pi/2) ) <= day )
i_se = i; % spring equinox (20 March)
endif
if ( abs( theta(i) - pi ) <= day )
i_ss = i; % summer solstice (20/21 June)
endif
if ( abs( theta(i) - (3*pi/2) ) <= day )
i_ae = i; % autumn equinox (22/23 September)
endif
endfor
% the array of the radius (m)
for i=1:1:366
o_omega (i) = (10^3)/[ A + ( B*sin(theta(i) + delta ) ) ]; % m
endfor
% the array of the wavelength in micron
N = 471;
L(1) = 0.3;
L(N) = 5.0;
delta_L = ( L(N) - L(1) )/(N-1);
for j = 2:N-1
L (j) = L(j-1) + delta_L;
endfor
% the array of beta*L
load beta_int.mat S
% the array of L in metres
L_m = L*( 10^(-6) );
% angle psi
psi(1) = 0;
minute = pi/(12*60);
for i = 2:(24*60)+1
psi(i) = psi(i-1) + minute;
endfor
% -----------------------------------------------------------------------------
% Rosario
lambda = lambda_Rosario
% angle between n and r at noon in Rosario
for i= [i_ws, i_ae, i_ss, i_se]
for j=1:(24*60) + 1
% scalar product between n and r
scalar_p(j) = [cos(lambda)*sin(psi(j))*cos(delta) + sin(lambda)*sin(delta)]*( -cos(theta(i)) )+ [(-1)*cos(lambda)*cos(psi(j))]*( -sin(theta(i)) );
endfor
% value of psi at noon
for j=1:(24*60) + 1
if ( ( scalar_p(j) ) == ( max( scalar_p ) ) )
j_noon = j;
psi_noon (i) = psi(j);
endif
endfor
% angle between n and r at noon
cos_gamma (i) = scalar_p(j_noon);
endfor
% the array of the emissive power (W/(m^2)*micron) in Rosario
for i = i_se:i_se
for j=1:N
num = C_1*( (R_S)^2 );
den = ( (L_m(j)^5)*( (e^(C_2/( L_m(j)*T_S ))) - 1)*( (o_omega(i))^2 ) )*10^6;
power(j,i) = ( num/den )*( e^(-S(j)/cos_gamma (i)) );
endfor
% plotting
plot (L (1:N), power(1:N,i), '-r', "linewidth", 2)
xlabel('wavelenght ({\mu})');
ylabel('W/m^{2}{\mu}');
axis ([0.3,5,0,1500])
grid on
endfor
hold on
% -----------------------------------------------------------------------------
% Rome
lambda = lambda_Roma
% angle between n and r at noon in Rosario
for i= [i_ws, i_ae, i_ss, i_se]
for j=1:(24*60) + 1
% scalar product between n and r
scalar_p(j) = [cos(lambda)*sin(psi(j))*cos(delta) + sin(lambda)*sin(delta)]*( -cos(theta(i)) )+ [(-1)*cos(lambda)*cos(psi(j))]*( -sin(theta(i)) );
endfor
% value of psi at noon
for j=1:(24*60) + 1
if ( ( scalar_p(j) ) == ( max( scalar_p ) ) )
j_noon = j;
psi_noon (i) = psi(j);
endif
endfor
% angle between n and r at noon
cos_gamma (i) = scalar_p(j_noon);
endfor
% the array of the emissive power (W/(m^2)*micron) in Rosario
for i = [i_ae, i_ss]
for j=1:N
num = C_1*( (R_S)^2 );
den = ( (L_m(j)^5)*( (e^(C_2/( L_m(j)*T_S ))) - 1)*( (o_omega(i))^2 ) )*10^6;
power(j,i) = ( num/den )*( e^(-S(j)/cos_gamma (i)) );
endfor
endfor
hold on
plot (L (1:N), power(1:N,i_ae), '-k', "linewidth", 2)
plot (L (1:N), power(1:N,i_ss), '--k', "linewidth", 2)
legend ('spring equinox in Rosario', 'autumn equinox in Rome', 'summer solstice in Rome', "location",'NORTHEAST')
hold on
plot ([0.4,0.4], [0,1500], '--k', "linewidth", 1)
plot ([0.7,0.7], [0,1500], '--k', "linewidth", 1)
In this document, I propose a distribution for the number of infected subjects in Italy during the outbreak of Coronavirus 19. To do that I find a logistic curve for the number of deaths due to this virus, in Italy. I also use a density of probability for the fatality rate and one for the number of days from infection to death. These functions have been built using recently published statistical data on the Chinese outbreak.
Brain normalization is the rigid rotation and/or deformation of a 3D brain scan so that it can match a template brain. This is a necessary step for the analysis of brain magnetic resonance data (whether it is morphological or functional) as well as of brain PET data. It allows introducing a set of spatial coordinates such that each triplet (x,y,z) identifies the same anatomical region both in the brain we are studying and in the template (Mandal P.K. et al. 2012). So, for instance, once the normalization has been performed on an fMRI of the brain of a patient and on a set of fMRIs from a suited control group, we can compare the BOLD activation of each anatomical region of the patient’s brain with the activation of the corresponding anatomical region of the control group.
Mathematical notes
This part can be skipped, it is not necessary to read these passages for the understanding of the following paragraphs. If we assume that P is a point of the brain before normalization and we call S(P) its position after the normalization, we can consider the vectorial function:
which gives the new position of each point of the brain, after normalization. If P_0 is a point of the brain before the normalization, then we can write:
and remembering the expression of the differential of a vectorial function, we have
With a few passages we can write:
From the above formula, we realize that in order to define the configuration of the brain after normalization, we have to define, for each point P, a set of 12 parameters. Of these parameters, 6 describe the rigid movement and can be considered the same for each point. The other 6 (the coefficients of the matrix) are those that describe the change of shape and size and, in general, they are different for each point. The job of SPM is to calculate these parameters.
Brain atlases
There are several templates (also called brain atlases) that have been developed across the years. The very first one was published in 1957 (Talairach J. et al. 1957). The same researcher then participated in building one of the most used brain atlas ever, based on a single female brain, published in 1988 (Talairach J. et Tournoux P. 1988). Another widely used template is the so-called MNI-152. It was built adapting the 3D MRI brain scans of 152 healthy individuals to the Talairach and Tournoux template. The adaptation was achieved using both a rigid roto-translation and a deformation (Maintz J.B. et Viergever M.A. 1988). The second step is required for overcoming the problem of differences in brain shape and dimension that we encounter within the human population.
Limitations
Available brain atlases have some limitations. One of them being the fact that despite diseased brains are the most widely studied, they are also the most difficult to register to a template built from healthy individuals, because of usually marked differences in shape and/or size. This is true for instance for brains of patients with Alzheimer’s disease (Mandal P.K. et al. 2012). Another important limitation is that registration algorithms perform poorly for the brainstem (particularly for pons and medulla) (Napadow V. et al. 2006). This might have represented a problem for the study of diseases where a possible involvement of the brainstem is suspected, like for instance ME/CFS (VanElzakker M. et al. 2019).
SPM12
The SPM software package is one of the most widely used instruments for the analysis of functional brain imaging data (web page). It is freely available for download, but it requires that you have a MatLab copy in your computer. Those who don’t have a MatLab license can request and install a standalone version of SPM12 by following the instructions of this page.
Importing a DICOM file
Once you have installed SPM12 in your computer, the first step in order to register a brain is to convert the format the series of images are written in, to a format that SPM12 can read. MRI images are usually in .dcm format (DICOM) while SPM12 reads .nii files. In order to do that, click DICOM import (figure below, on the left, red ellipse), then click on DICOM files (on the right, red ellipse), then select your .dcm file and click DONE (below, red ellipse). If you then click DISPLAY (blue ellipse, left) you will see your MRI scan in another window (see next paragraph). A video tutorial on these operations is available here.
Setting the origin
To start the normalization process, it is highly recommended to set manually the origin of the coordinates. If this is done properly, the registration will not only take less time but, even more importantly, the chances of a successful normalization will increase. The origin is set at the level of the anterior commissure (figure below). To find this anatomical structure, you can follow this video. Once you have put the cross on the right place in the sagittal, coronal and axial windows, just click SET ORIGIN (red ellipse) and then save your work clicking REORIENT (blue ellipse).
Normalization estimate
In this step, SPM12 calculates the set of distortions that have to be applied to the source brain to adapt it to the template in MNI space. On the main menu select NORMALIZE (ESTIMATE) (figure, on the left, red). This will open the batch editor where you are asked to load the subject you want to apply normalization to (figure, right, red). You have also a set of estimation options (blue), that we leave as they are. Then you click the RUN button, the arrow on the top of the batch editor.
At this point, your PC will perform a set of calculations that will require a few minutes. At the end of this process, a new .nii file will be saved in the spm12 folder. This is the set of distortions that will allow your subject’s brain to be registered to the template.
Normalization writing
Now click on NORMALIZE (WRITE) on the main menu. The batch editor will then ask you for the deformation field, which is the file generated in the previous step, and for the images to write, which is the scan of your subject (figure below). Select them, then press the RUN button on the batch editor. A new .nii file will be written in the spm12 folder. This is the normalized brain!
In the next figure, you have the normalized brain on the left and the initial scan of the same subject on the right. As you can see, there is an overall change of shape.
Anatomical areas
Now that we have normalized our brain in the MNI space, we can easily find anatomical regions within its sections. We can, for instance, load the normalized brain with MRIcron and overlay a template with Brodmann’s areas highlighted in various colours (figure below).