An update on the UK Biobank: fine-mapping

An update on the UK Biobank: fine-mapping

The man who does not make his own tools does not make his own sculpture.

Irving Stone, The Agony and The Ecstasy

An analysis of the UK Biobank database carried out by Neale’s Lab pointed out a significant correlation between a region of chromosome 13 and females with self-reported Chronic Fatigue Syndrome (see this blog post). This result was further analyzed in a subsequent paper in which it was suggested that the causal variant within this region may be rs7337312 (position 13:41353297:G:A on reference genome GRCh37, forward strand) (Dibble J. et al. 2020). It is included in a regulatory region of gene SLC25A15, which encodes mitochondrial ornithine transporter I, and the potential biological effects of this variant have been discussed in the aforementioned study. I made an attempt at my own discussion of this variant (here).

It is not unusual for a GWAS study to associate a trait with a long sequence of variants belonging to the same chromosome and in a close physical relationship one with the other, due to high linkage disequilibrium often present between variants close to one another (see this blog post). Several methods have been developed in the last years to further refine the association and find the variant that causes the disease and drives the association of its nearby SNPs. These methods are based on both purely mathematical considerations and on the analysis of the known consequences of the variants, and we refer to them by the evocative name of fine-mapping (Schaid DJ et al. 2018).

I tried to develop my own method and software for a purely mathematical approach to fine-mapping and I applied it to the region associated with ME/CFS in females, and I found variant rs11147812 (13:41404706:T:C on h19) as the most likely causal one. The detailed discussion of the method I followed and the scripts I wrote are available here for download. It might be noted that this variant belongs to a genetic entity (pseudogene TPTE2P5) that harbors variants that have been strongly associated with the following hematic parameters (see GWAS catalog).

Diastolic blood pressure (reduction)Evangelou E et al. 2018
Blood volume occupied by platelets (increase)Astle WG et al. 2016
Basophil count (decrease)Masahiro K et al. 2018
Table. Significant associations between traits and variants of TPTE2P5.

The alternative allele of this variant has been reported to increase (effect size > 0) the expression of gene SLC25A15 in fat (see this page) with a p value of 2.85×10^(-20). Since the alternative variant is less frequent in patients than in controls, the prediction is that we have a reduced expression of SLC25A15 in fat, in female CFS patients.

Note. The header image shows the convergence of the linear method used by my script for fine-mapping the region on chromosome 13.

Asmara

Asmara

Non ho la mia età, lo sai. Ma non perché abbia vissuto meno degli anni sui documenti: la verità è che gli anni li ho tutti, tutti quelli della Storia.

Vestivo la musica con le ossa cave, nei Balcani: intagliavo le asole ai femori d’orso, cercando le note che vibrano dentro le note, d’istinto pesando le perle della serie trigonometrica di quel Jean Baptiste che mi sarebbe stato benevolo commilitone, sotto Napoleone, cinquecentocinquanta secoli dopo. Tutta la Terra era un bosco, che si diceva nero e infinito, come lo spazio per voi; i mari, galassie. Migravamo da costa a costa, inventando i monologhi dei miei flauti, e spesso era foresta per vite intere. Le colonne d’aria che avevo imparato ad ammaestrare coprivano i conati coraggiosi delle prossime puerpere e tacevano ai vagiti, piangevano quando nessuno danzava più, ci facevano incrociare in quella solitudine senza nomi, illuminavano la notte delle conifere come una torcia nell’universo. Cantavano la felicità che vi ha generati.

Ad Ur fui fabbro dei Caldei e poi, generazioni seguenti, e con un altro nome, sacerdote. Era una città piena di sole che abbacinava, dove le divinità ungulate emulavano i simulacri speculari, assorti nelle loro sfilate sul candore dei muri. Ho detto città, ma per noi allora quella parola significava solo l’ambizione ingenua di inquadrare l’universo in qualche decina di strade, templi, granai, e uffici. Da loro coltivai l’arte nuova della scrittura.

Cinquecento anni dopo, mezzo millennio di solite meraviglie e di dolori sempre diversi, fui in Egitto, a raccogliere esempi del teorema di Pitagora: gli egizi non concepivano dimostrazioni generali, Euclide doveva ancora nascere e io sarei stato poi illuminato e distrutto dall’ingegno superiore della sua vita breve. I figli di Rha erano collezionisti di casi particolari di ogni enunciato, e così, con un singolo teorema, si garantivano scoperte per l’eternità. Lì fui scriba e architetto, di nome in nome, poi scultore, con l’amico Thutmose.

Fu sotto lo sguardo indifferente e lontano del dio cane, mezzo uomo, che vidi Asmara la prima volta, tra tutta quella umanità che si spostava come la polvere che diventa stelle nel raggio di una finestra, come i girini che si rincorrono nel gioco infantile. Era un giorno di mercato, un mercato d’oriente, commercio dei colori delle cose inutili, così fondamentali per vivere, di gioie e drammi, di uomini che svendono la loro miseria, di esercizi di astuzia, di cose da grandi che volano sopra la testa dei piccoli tenuti per mano, di ammiccamenti che ti farebbero parlare a lungo, se dovessi spiegare, di falsi ingegnosi e originali. E quel giorno, che è un punto di una successione infinita di gesti distratti e prosaico quotidiano, mi è presente più degli altri, anche ora. Anche allora, quando modellai la carne, sul cuore di calcite sedimentata per tutta la mia vita, del volto di Nefertiti, Nefertiti fu Asmara che Asmara non era già più. Odiai quel busto e non lo finii, perché amai l’originale. Eppure, tremila anni dopo, percorsi l’Europa in fiamme, fino al centro dell’inferno, per salvarlo dalla distruzione di Berlino.

Gli dèi nascono e poi diventano paura in frantumi, dimenticati nella sabbia come Anubi, si avvicendano; ma se Pitagora offrì all’eternità l’incastro durevole della ipotenusa coi cateti, io porterò Asmara – morta e dispersa da millenni – riflessa come allora su queste iridi di conifere e terra, per sempre.

Anatomy of a conflict

Putin recognizes two new independent republics on the east border of Ukraine so that he can enter Ukraine, without formally entering it (February 23rd, 2022). In this satiric illustration, Putin is stretching Ukraine’s borders with its own hands, making room for two new independent republics which are created with the only aim of allowing the Russian army to enter Ukraine from the east side and buy some time for setting up the armaments for the big invasion.

Ratio between two Random Variables: the case of the Beta Distribution

Ratio between two Random Variables: the case of the Beta Distribution

We have here a paradox: what theory is it that is not correct unless it leaves open the possibility that it may be incorrect? Answer: the theory of probability.

P.W. Bridgman, The Nature of the Physical Theory

While reasoning on one of the biological problems I have been spending most of my time on, I encounterd a random variable that was the ratio between two random variables with a Beta distribution. Let’s say that X_1\,\sim\,B(\alpha_1,\beta_1) and X_2\,\sim\,B(\alpha_2,\beta_2). Then, if we define the random variable Y\,=\,\frac{X_1}{X_2}, what is its density? What about its cumulative distribution function? And what can we say about its moments? I answered all these questions in this paper. The density, in particualr, is given by:

On the exact same subject there is a 1999 paper I am currently comparing to mine (Pham-Gia 1999). While writing my own paper I did not check for other works on the same topic, but if I had had to bet I would have said that this calculation had been performed a century ago; so I was astonished when I discovered that this density was first published only 20 years ago! This is further proof of how recent the application of calculus to statistics is.

PS. I have just checked the density by Thu Pham-Gia: it is equivalent to the one I found, even though they appear different at a first glance. I have added a paragraph to my paper to show the equivalence between these two densities. It is also worth mentioning that in addition, I calculated the cumulative distribution function, the expectation, the second-order moment, and the variance, while Pham-Gia presented the density along with related subjects and applications, in his 1999 paper.

An estimation of the proportion of asymptomatic Omicron infection

An estimation of the proportion of asymptomatic Omicron infection

Abstract

In this document, I compare data from two COVID vaccine trials carried out in South Africa (the Ubuntu trial and the Sisonke sub-study). Both trials were conducted on unvaccinated subjects, asymptomatic at the time of enrollment. Among them, some were found positive to SARS-CoV-2, but while the variant was identified as Delta in the Sisonke sub-study, the Omicron variant was reported in asymptomatic subjects of the Ubuntu trial. By considering the proportion of SARS-CoV-2 positive subjects in these two populations, the average number of daily confirmed cases, and the percentage of unvaccinated South African citizens at the respective times of enrollment of the two trials, I calculated the proportion of asymptomatic infections for Omicron, in function of the same parameter for the Delta variant. I found that if we accept a proportion of asymptomatic infections of 17% for Delta, then we conclude that about 60% of all Omicron cases are asymptomatic, in unvaccinated subjects. This points towards significantly lower pathogenicity for the Omicron variant when compared to the Delta variant.

Introduction

The B.1.1.529 (Omicron) SARS-CoV-2 variant was first identified in Botswana and South Africa and was classified by WHO as a variant of concern (VOC) on 26 November 2021 (WHO website). As of 15 December 2021, the Omicron variant had already spread in 77 countries (especially in the United Kingdom, South Africa, and the United States) (Thakur V, Ratho RK, 2021). Omicron is characterized by a large number of mutations and deletions that seem to give it increased antibody escape capacity and enhanced transmissibility (Miller NL et al 2021). Yet, if we look at the number of deaths in South Africa (R), we note that despite a rapid increase in confirmed cases (most Omicron), it doesn’t follow the usual consequent increase (with a phase shift of two weeks). This suggests that Omicron is less dangerous than Delta, also considering that previous infection with Delta does not seem to protect against infection with Omicron (Garret N et al 2021) and that only 30% of the population of South Africa received at least one dose of vaccine (R).

In what follows I present a quantitative method for the comparison between the effect of Omicron and Delta on unvaccinated subjects. In particular, I calculate the ratio between asymptomatic SARS-CoV-2+ subjects and total SARS-CoV-2+ subjects for Omicron (r_{o}) in function of the same ratio for Delta (r_{\delta}). We consider only subjects who did not receive any dose of vaccination against SARS-CoV-2. By considering only unvaccinated subjects, we eliminate the confounding effect of vaccination on the pathogenicity of Omicron and Delta. The aforementioned ratio might be considered one of the possible indexes of disease severity, while we wait for further data, according to the following assumption: the higher the number of asymptomatic carriers of a virus divided by the total number of carriers, the lower the pathogenicity of the virus. The derivation of the function r_{o}\,=\,r_{o}(r_{\delta}) is described in the Methods section.

Results

The function r_{o}\,=\,r_{o}(r_{\delta}) is plotted in Figure 1. It indicates the proportion of Omicron asymptomatic infections in function of the same proportion for Delta, in unvaccinated subjects. As you can see, the Omicron variant leads to a higher proportion of asymptomatic cases when compared to the Delta one. If we assume for Delta a value r_{\delta}\,=\,0.17 (Byambasuren O. et al. 2020), we have that r_{o}\,\sim\,0.6. The diagram has been plotted for two different choices of the ratio k_{o}/k_{\delta} (the meaning of this ratio is explained in the following paragraph), but as you can see, it doesn’t change much whether you consider one value or the other.

Figure 1. The ratio of asymptomatic SARS-CoV-2+ subjects and total SARS-CoV-2+ subjects for the Omicron variant expressed in function of the same ratio for the Delta variant. All subjects are considered unvaccinated.

Methods

Let N_{pos} be the number of unvaccinated individuals positive to SARS-CoV-2 at a given time and r the ratio between the unvaccinated individuals who are positive but are asymptomatic and N_{pos}. Then, the total number of the asymptomatic and unvaccinated carriers of the virus is N_{pos} r. If N_{tot} is the total number of unvaccinated individuals without symptoms in a certain geographical region, then the ratio between asymptomatic and unvaccinated carriers of the virus and asymptomatic and unvaccinated individuals (no matter their positivity to the virus) is \frac{N_{pos}r}{N_{tot}}. We are interested in expressing r in the case of the Omicron variant (let’s say r_{o}) in function of the same parameter in the case of the Delta variant (let’s say r_{\delta}). The following table collects the symbols we have just introduced. In what follows we will add a pedix o for the Omicron variant and a pedix \delta when we refer to the Delta variant.

N_{pos}The number of carriers of the virus who did not receive the vaccine.
N_{tot}The number of asymptomatic individuals who did not receive the vaccine. Their positivity to the virus is not specified.
rThe ratio between the asymptomatic carriers of the virus who are unvaccinated and N_{pos}.
N_{pos}rThe number of asymptomatic carriers of the virus who did not receive the vaccine.

Between December 2 and December 17, 2021, a total of 330 asymptomatic subjects from South Africa were enrolled in a phase III clinical trial (Ubuntu trial, PACTR202105817814362) to assess the relative efficacy of the COVID-19 vaccine mRNA-1273 (MODERNA). This population is exclusively made up of individuals who had not been previously exposed to any COVID-19 vaccine. This population included persons living with HIV (PLWH) with a median age of 39 years (18-76). Among them, 230 individuals were tested for SARS-CoV-2 by RT-PCR and 31% (71 subjects) turned out to be positive. Among them, 56 samples were successfully subjected to further investigations: all had S gene dropout, suggestive of Omicron infection (Garret N et al 2021). It is worth mentioning that positivity to Omicron does not correlate with previous exposure to SARS-CoV-2, in this population: in other words, previous exposure to other SARS-CoV-2 variants does not protect against Omicron. Positivity to Omicron does not correlate with CD4+ cell count in PLWH either, so it seems that HIV does not interfere with r, the value we are interested in. If we now consider that the average daily number of confirmed new cases in South Africa for the interval December 2-17 is 18745, we can write

N_{pos-o}\,=\,18745K_{o}\,+\,r_{o}N_{pos-o}\,\Rightarrow\,N_{pos-o}\,=\,\frac{18745K_{o}}{1\,-\,r_{o}}

where K_{o} is a multiplicative constant that accounts for all the possible errors in the measure of the positive symptomatic cases and also for the fact that we are here considering only unvaccinated subjects, while data available for confirmed cases does not distinguish between vaccinated and unvaccinated. Moreover, the number refers to average daily new cases, while we are in fact considering the number of total positive cases, at that point in time; so K_{o} is also a way to get the latter from the former (assuming linear proportionality). Therefore, we can write

\frac{N_{pos}r_{o}}{N_{tot-o}}\,=\,0.31\,\Rightarrow\,\frac{r_{o}}{1\,-\,r_{o}}\frac{18745K_{o}}{N_{tot-o}}\,=\,0.31

Another study carried out in South Africa between June and August 2021 during the Delta outbreak (Sisonke sub-study, NCT04838795) reported a percentage of asymptomatic carriers of 2.4% (39/1604) among unvaccinated subjects (Garret N et al 2021). The average number of daily reported cases, in this case, is 12181; hence we have

\frac{r_{\delta}}{1\,-\,r_{\delta}}\frac{12181K_{\delta}}{N_{tot-\delta}}\,=\,0.024

These last two equations give

\frac{(1\,-\,r_{\delta})r_{o}}{(1\,-\,r_{o})r_{\delta}}\frac{18745}{12181}\frac{K_{o}N_{tot-\delta}}{K_{\delta}N_{tot-o}}\,=\,\frac{0.31}{0.024}\Rightarrow

\Rightarrow\frac{(1\,-\,r_{\delta})r_{o}}{(1\,-\,r_{o})r_{\delta}}1.54\frac{K_{o}N_{tot-\delta}}{K_{\delta}N_{tot-o}}\,=\,12.92\Rightarrow

\Rightarrow r_{o}\,=\,\frac{12.92}{1.54\frac{1\,-\,r_{\delta}}{r_{\delta}}\frac{K_{o}N_{tot-\delta}}{K_{\delta}N_{tot-o}}\,+\,12.92}

We are here considering unvaccinated subjects, but while in the period December 2-17 (Omicron outbreak) the percentage of subjects who received at least one dose of vaccine was on average 30.65%, in the time frame June-August (Delta outbreak), the same percentage was only 10.65% (remember that in both cases we consider subjects from South Africa) (R). Then we must assume that N_{tot-o}\,=\,0.69N_{tot} and N_{tot-\delta}\,=\,0.89N_{tot}, therefore we have

\frac{N_{tot-\delta}}{N_{tot-o}}\,=\,\frac{0.89N_{tot}}{0.69N_{tot}}\,=\,1.29

This means that r_{o} is given by

r_{o}\,=\,\frac{12.92}{1.99\frac{1\,-\,r_{\delta}}{r_{\delta}}\frac{K_{o}}{K_{\delta}}\,+\,12.92}

Note that we are here considering only unvaccinated subjects, while the number of reported cases includes both vaccinated and unvaccinated individuals. This means that K_{o} can’t have the same value of K_{\delta}. One possible choice is to assume

\frac{K_{o}}{K_{\delta}}\,=\,\frac{k\cdot k_{o}}{k\cdot k_{\delta}}\,=\,\frac{k_{o}}{k_{\delta}}\,=\,\frac{100-30.65}{100-10.65}\,=\,0.78

where we assumed that k is the error (underestimation) made in measuring positive cases, which is the same in both cases since it depends on the efficiency of the health care system, on the behavior of the population, and other factors that do not change; while k_{\delta} and k_{o} are directly proportional to the percentage of individuals who are unvaccinated and allow for the calculation of the number of positive individuals with no prior vaccination from the number of positive individuals (no matter the vaccination status). Another possible choice is to assume that the confirmed cases are mostly unvaccinated and in this case we would simply have

\frac{K_{o}}{K_{\delta}}\,=\,1

All that said, we can now express r_{o} in function of r_{\delta} and we get the plot in Figure 1. The script in Octave used to plot the figure is the one that follows. The number of confirmed cases has been retrieved from this website, in particular from this CSV file: (download).

% file name = omicron
% date of creation = 2/01/2022
clear all
close all 
% the array of r_delta
steps = 100;
r_delta(1) = 0;                                                            
r_delta(steps) = 1;
inc = ( r_delta(100) - r_delta(1) )/(steps-1)
for i = 2:steps-1
  r_delta(i) = r_delta(i-1) + inc;
endfor  
% the array of r_omicron
for i=1:steps
  ratio_r = (1 - r_delta(i))/r_delta(i);
  ratio_k = 0.78;
  r_omicron (i) = 12.92/((1.99*ratio_r*ratio_k) + 12.92);
  ratio_k = 1.;
  r_omicron2 (i) = 12.92/((1.99*ratio_r*ratio_k) + 12.92);
endfor
% plotting
plot (r_delta, r_omicron, '-k', "linewidth", 1)
hold on
plot (r_delta, r_omicron2, '--k', "linewidth", 1)
ylabel('r_{o} = (asymptomatic AND positive)/positive (Omicron)','fontsize',15); 
xlabel('r_{\delta} = (asymptomatic AND positive)/positive (Delta)','fontsize',15);
axis equal;    
axis([0,r_delta(steps),0,r_omicron2(steps)])                             
grid on
grid minor
legend ('k_{0}/k_{\delta} = 0.78', 'k_{0}/k_{\delta}  = 1', 'location', "northwest", 'fontsize', 15); 

Limitations

The main limitation of this method is the fact that the characteristics of the two populations considered (the one from the Ubuntu trial and the one from the Sisonke sub-study) were not considered, so we can’t say whether these two populations are comparable with respect to variables such as age, comorbidities, and previous SARS-CoV-2 infection. We don’t know how well these two populations represent the general population either.

Further comments and conclusions

From the present analysis, the Omicron variant shows a higher relative number of asymptomatic cases, when compared to the Delta variant, in unvaccinated subjects. This points towards lower pathogenicity for the new variant.

In Italy, as of 31 December 2021, the prevalence of Omicron was only 20% (R), so its effect on the number of deaths and hospital resource use has yet to be appreciated. At present, with Delta being the most prevalent variant in our country, the large majority of deaths and intensive care occupation is seen among unvaccinated subjects who are accepted in intensive care units 6.5 times more than vaccinated subjects (Figure 2) and die 5.2 times more frequently (Figure 3). If Omicron is really less dangerous than Delta for unvaccinated subjects, we will see a progressive convergence between the yellow curve and the blue one in both Figure 2 and Figure 3. The plots below will be updated as new data become available.

Figure 2. The number of subjects admitted to intensive care unit, in Italy, for 100,000 unvaccinated (yellow), vaccinated (blue), boosted (green) subjects. You can select a specific age range from the bar at the top of the plot.


The equations of this blog post were written using \LaTeX (see this article)

International Day of People with Disabilities

International Day of People with Disabilities

In celebration of the International Day of People with Disabilities, I would like to mention one of the most iconic movie characters with a disability: Darth Vader. With both his legs and both his arms amputated, he can move only thanks to very sophisticated prosthetic limbs. He also requires constant medical care and would not survive long without his highly technological suit. As Obi-Wan once said about him, he is “more a machine now than a man, twisted and evil.”

The character of Bane, the main villain in Nolan’s “The Dark Knight Rises”, follows pretty much this same paradigm: he also has a tragic past, he underwent mutilations like Anakin Skywalker, and just like him, he hides them behind a mask, that has also the function of keeping “the pain at bay” (this expression, pronounced by a secondary character in the movie, recalls the name with a reversed assonance: pain at bay, Bane; nomen omen). But while Darth Vader is a representation of totalitarianism (note how he reproduces Mussolini’s gestures in his fists on hips pose, by the way), Bane is more an incarnation of modern terrorism and offers a quite thoughtful insight into the genesis of it. Interestingly, both these epiphanies of evil (“necessary evil”, Bane explains and Darth Vader would probably agree) are very menacing and powerful, despite their physical limitations: the source of their superhuman strength seems to be their monumental rage, continuously nourished by jealousy and by the pain that curses both their consumed souls and their mutilated bodies. It is this hopeless grudge the mysterious engine that makes them more than just ordinary men, it is thanks to it that they can overcome disability. But there is a price to pay, these movies seem to prove: you can feed yourself on this limitless energy only if you turn it into destruction.

The character of Alex Murphy in the 1987 movie Robocop seems also pertinent in this context. He has in common with Darth Vader and Bane the search for vengeance, and that impossible anger that sits on the grave of his grief; but he is the good guy, the hero. Another important difference is that Murphy not only suffered extremely bad physical injuries (he is resuscitated by prosthetics of the whole body: only his brain and some other tissues have been spared by the men who tried to kill him); he also had brain damage, a kind of very pervasive post-traumatic brain injury exacerbated by the very same procedure used to bring him back to life by integrating his nervous system with mechatronic technology. So, Murphy is in a constant struggle for regaining some of the humanity of his previous life. In that sense, he is the most miserable and suffering among these three examples of cinematic disabilities. As Edward Neumeier (the screenwriter) somewhere said: “He [Alex Murphy] will never go back, he is always going to be something different: he is neither a man nor a machine, he is something different; he’s his own creature, maybe”.

Disability does not necessarily make you a better person. On the contrary. When you meet a person with an important chronic health issue that precludes a normal life, consider that he might be consumed by anger and by hopelessly destructive sorrow. Especially, if not exclusively, when this mutilation has occurred at an early stage of his life.

Gallery

Gallery

A collection of some of my artworks, made during these last 20 years. I would like to write some notes on each one of them; I might do it in the future.

La búsqueda de Borges

La búsqueda de Borges

“A las dificultades intrínsecas debemos que añadir que Averroes, ignorante del siríaco y del griego, trabajaba sobre la traducción de una traducción.”

Jorge Louis Borges, La busca de Averroes

Tengo insomnio en este periodo: la noche no puedo dormir, si no hasta que la noche misma no está lista para el sueño. Y no es claro si me quedo despierto porque quiero leer o si leo para que el tiempo no me hable de cosas que no tengo ganas de escuchar: tengo la edad en la cual un hombre puede imaginar con desdén y realidad el mundo que sigue su aventura sin su presencia.

En la noche de ayer, en las últimas páginas del cuento “El Zahir”, de Jorge Luis Borges, me encontré con una frase que ya había sentido en otro lugar, en otro idioma. Estas son las palabras:

“… no hay hecho, por humilde que sea, que no implique la historia universal y su infinita concatenación de efectos y causas”

Yo creo que la fuente de inspiración para este cuento fue Edgar Allan Poe: el tema de la muerte de una mujer muy linda (“aunque no todas las efigies apoyaran incondicionadamente esa hipótesis”) que “buscaba lo absoluto, como Flaubert”; el entretenerse en la descripción de la corrupción de la cara de la muerta en el velorio, y la locura de un detalle (el Zahir del título) que absorberá toda entera la vida mental de el protagonista (Borges mismo), recuerdan muy de cerca al cuento “Berenice”, de Poe. Pero la frase no la leí en Poe, escritor que yo frecuentaba hace mucho tiempo, cuando leer de la muerte en los cuentos de él fue leer sobre un hecho teórico con que nunca me había encontrado todavía, sobre una hipótesis que olía a papel de una edición económica adornada por las ilustraciones de Alberto Martini.

“El Zahir” fue publicado la primera vez el julio de 1947, en la revista Los Anales de Buenos Aires. En el marzo del mismo año, Cesare Pavese termina la redacción de su último libro, “Dialoghi con Leucò”. El volumen fue publicado en el 1947 por Einaudi. Los diálogos reunidos en este libro todos están inspirados en el mito griego, pero no es un mero catálogo mitológico, como podría ser el Fabularum liber de Julius Hyginus: Teseo, Ulises, Diana son solo un pretexto, un catalizador de lo absoluto, que como tal no es ligado a ninguna época y encuentra expresión privilegiada en historias que nunca han sucedido y que siempre han ocurrido en la fantasía de todos los seres humanos, desde los padres de Homero hasta nosotros. En el penúltimo diálogo, Mnemosine le dice a Hesíodo:

“… non avete un istante, nemmeno il più futile, che non sgorghi dal silenzio delle origini”

Yo conjeturo que Borges pude procurarse el volumen de Pavese, el verano de 1947, ante de publicar su relato, e abbia voluto usare quella medesima frase al posto di un suo periodo che esprimeva la stessa idea, ma con minore economia di parole. Non posso provarlo, certo, e non so neanche se davvero i Dialoghi siano stati pubblicati prima del luglio del 1947. Tuttavia penso che Borges fosse in grado di leggere l’italiano, anche quello di Dante; a maggior ragione quello di Pavese. L’italiano di Pavese in fondo potrebbe essere proprio quello di uno straniero che si avventura nella scrittura di una lingua nuova: egli userà parole semplici, comuni, ma le disporrà in un ordine inconsueto per i madrelingua. La ricerca di Cesare Pavese lo portò con sicurezza a calcare i passi incerti del forestiero che inizia a usare l’italiano.

Io stesso ho meditato su quella citazione di Pavese, da quando la incontrai nella estate del 2020, perché mi ero imbattuto nella medesima idea (le idee si scoprono, almeno quanto non si inventino)(*), mentre scrivevo un racconto, nel 2006; e il mio di periodo è presente in questo sito, ma non lo scriverò qui, vicino alla soluzione di Pavese, perché non voglio essere io a rappresentare la dimostrazione pratica della differenza tra uno scrittore e un inutile grafomane; non oggi almeno: queste giornate hanno già le loro frustrazioni.

Ma Borges modifica l’originale, prolunga la ascissa del tempo, dalla genesi al futuro, passando per il quotidiano in cui Pavese si arresta. E sarebbe facile dire che nella prospettiva di Pavese, nel 1947, il futuro forse non esisteva già più e che è in questa frase che, grazie alla aggiunta complementare di Borges che denuncia l’assenza, ci svela, senza saperlo lui stesso, che sceglierà di lì a poco la scelta che fu di un personaggio del suo romanzo “Tra donne sole”: di chiudere la sua realtà soggettiva in una scatola di legno; di scendere dal mondo che continuerà la sua avventura senza di lui.

C’è un’altra differenza: Borges traduce Pavese in un linguaggio da fisico-matematico, invocando con un istinto generoso, la teoria del caos deterministico, ovvero la scienza dello studio asintotico delle soluzioni dei sistemi di equazioni differenziali non lineari, soluzioni le quali sono affette da ogni gesto, persino il più umile e futile (vedi ad esempio questo riferimento). Per altro Borges sembra estendere questi concetti, alludendo alle equazioni differenziali in modo forse ancora più esplicito, in un altro racconto (**), quello in cui il sacerdote Tzinacán realizza che:

“… en el lenguaje de un dios toda palabra enunciaría esa infinita concatenación de los hechos, y no de un modo implícito, sino explicito, y no de un modo progresivo, sino inmediato”

Credo che questa sia la migliore descrizione del teorema di esistenza e unicità di Cauchy che possa essere data senza usare nessun linguaggio matematico. Proprio questo talento scientifico, mai esplicitamente evocato, la caccia al tesoro alle invocazoni matematiche, credo sia una delle attrattive di Borges.

Sin embargo, otra posibilidad es que los dos encontraron una frase parecida in una tercera fuente (una obra che quizá, un día, yo también podría encontrar), y los ha seducido, como ha sido para el que escribe estas líneas, porque son las palabras entre cuyos espacios vacíos cabe una verdad providencial, descubierta quizás hace años.

¿Por qué escribí esto? Porque ahora no puedo hacer mi trabajo, porque he aprendido (quizás, no sé) a hacer saltos mortales para llenar mis días, “urgido por la fatalidad de hacer algo, de poblar de algùn modo el tiempo” (**). Porque un día sin matemáticas, sin dibujar o sin escribir es otra página de un diario sin recuerdos.


* Credo che si possa dire altrettanto per la matematica e per l’arte. In fondo entrambe sono sinonimi di idee. Lo scrive, nel 1940, G.H. Hardy nella sua famosa apologia: “I believe that mathematical reality lies outside us, that our function is to discover or observe it, and that the theorems which we prove, and which we describe grandiloquently as our ‘creations’ are simply our notes of our observations”. Analogamente P.W. Bridgman nel suo volumetto del 1936 (The Nature of Physical Theory): “Mathematics thus appears to be ultimately just as truly an empirical science as physics or chemistry, and the feeling that it is something essentially different arises only when we do not carry our analysis far enough”. Nell’arte, Irving Stone fa dire al suo Michelangelo: “The sculptor on the contrary had to see within the marble the form that it held”. Le idee perciò, che siano teoremi o forme (cioè geometria, quindi matematica), esistevano già prima di essere inventate. Dove si trovano? Non necessariamente nell’universo conosciuto. In questo le idee artistiche e matematiche si distinguono da quelle scientifiche: esistono nell’universo di tutti gli universi possibili.

** “La escritura del dios”, Borges

P-value threshold for genetic studies

P-value threshold for genetic studies

“This it seems to me is characteristic of most judgments involving physical processes — the law of the excluded middle is not a valid description of our actual physical experience — there has to be a third category of doubtful, in addition to positive or negative.”

P. W. Bridgman, The Nature of Physical Theory

1) The multiple testing issue

With the emergence of titanic biological datasets generated by the latest omics technologies, the problem of correction of the significance threshold (the conventional \alpha = 0.05) for multiple testing has required a careful revision and update. The simplest way to correct for multiple testing is the Bonferroni correction (a derivation is provided in par. 5): in this method, the threshold for statistical significance is given by

1)\;\;p\,<\,\frac{\alpha}{m}

where the denominator represents the number of independent variables. Now, in the case of genetic testing, how should we set the value of m? Let’s say that we are comparing two populations of individuals (one with a disease, the other made up by healthy subjects) and we have genotyped N variants in each one of them. Once we have associated a p value to every single variant, we must face the problem of correcting it for multiple testing by using Eq. 1. What should we do? Well, it is simple, we put m\,=\,N, right? Wrong!

2) Linkage disequilibrium

In general, variants belonging to the same chromosome are not inherited in an independent fashion: we know that whole chunks of chromosomes of about 50 kb pass from one generation to the other. These blocks are called haplotypes and there has been a considerable effort in classifying them for various human populations in the last 20 years. A map of human haplotypes is now available (HapMap) and it allows for the calculation of the real number of independent variants (often called tag SNPs) on the human Genome, and for other pivotal purposes I won’t write here about, like for instance imputation (a technique that dramatically reduces the number of variants we must genotype in order to scan the genome of a human being) (The International HapMap Consortium 2005), (The International HapMap Consortium, 2009). Visit the page of the International HapMap Consortium here. For us, what is important here is the fact that the knowledge of the haplotypes gives us a measure of the linkage disequilibrium between any possible pair of variants on the same chromosome. But what is linkage disequilibrium? Let’s say that we have two loci L_1 and L_2 on the same chromosome. L_1 has the two alleles A and a, while L_2 presents the two alleles B and b of frequencies f(A),\,f(a),\,f(B),\,f(b), respectively. Let’s define now the random variable X whose value is one if L_1 is occupied by A, zero otherwise; similarly, let’s define the random variable Y such that it is one if L_2 is occupied by B, zero otherwise. We can say that both these variables have a Bernoulli distribution with expectations given by E[X]\,=\,f(A) and E[Y]\,=\,f(B), respectively. If we now define f(A,\,B) the frequency of the haplotype AB, and if we similarly define f(A,\,b),\,f(a,\,b),\,f(a,B), we can then write the following relationships:

2)\;\;\begin{cases} f(Ab)\,=\,f(A)\,-\,f(aB)\\f(aB)\,=\,f(B)\,-\,f(AB)\\f(ab)\,=\,1\,-\,f(B)\,-\,f(A)\,+\,f(AB)\\f(AB)\,+\,f(Ab)\,+\,f(aB)\,+\,f(ab)\,=\,1\\f(A)\,+\,f(a)\,=\,f(B)\,+\,f(b)\,=\,1 \end{cases}\

where we have considered that f(A,\,B)\,=\,E[XY], f(A,\,b)\,=\,E[X(1-Y)]\,= E[X]\,-\,E[XY]\,.... That said, the covariance between X,\,Y is given by:

3)\;\; Cov[X,\,Y]\,=\,E[X\,Y]\,-\,E[X]E[Y]\,=

\,=\, f(AB)\,-\,f(A)f(B)\,=\,f(Ab)\,-\,f(A)f(b)\,=

=\,f(aB)\,-\,f(a)f(B)\,=\,f(ab)\,-\,f(a)f(b)

Or also (with some algebraic manipulation, by substituting the relationship of Eq. 2):

4)\;\;Cov[X,\,Y]\,=\,f(AB)f(ab)\,-\,f(Ab)f(aB)

That said, one definition of Linkage disequilibrium is just

5)\;\;d\,=\,Cov[X,\,Y]

Figure 1. Left: the values admitted for d are those that fall within the shell; in particular, for f(aB) = f(Ab) = 0 (coupling) we have that d assumes the values of the curve coloured blue, while for f(AB) = f(ab) = 0 (repulsion) then d describes the curve in orange. These two curves are plotted in Figure 2, too.

This definition presents a problem, namely the fact that when we have a complete dependence between L_1 and L_2, its value varies in ]0,0.25] (see Figure 1, Figure 2, and paragraph 6 for a detailed discussion) and this does not seem to have a particular biological meaning. It would be more interesting to have a definition such that Linkage disequilibrium assumes a unique value (for example 1) when we have complete dependence between the two loci. Note that we can accomplish this by considering the correlation coefficient between X\,=\,Y instead, which is given by:

6)\;\;\rho_{XY}\,=\,\frac{Cov[X,\,Y]}{\sqrt {Var(X)Var(Y)}}\,=\frac{d}{\sqrt {Var(X)Var(Y)}},

Thus, in the present article we consider the following definition for linkage disequilibrium:

7)\;\;linkage\,disequilibrium\,=\,\rho^2\,=\,\frac{d^2}{f(A)f(a)f(B)f(b)}

where we have considered that Var[X]\,=\,E[X](1-E[X])\,=\,f(A)f(a) and Var[Y]\,=\,E[Y](1-E[Y])\,=\,f(B)f(b). For further details on this topic, go to paragraph 6. See also (Reginald D Smith 2020).

Figure 2. For f(aB) = f(Ab) = 0 (coupling) we have that d assumes the values of the curve coloured blue, while for f(AB) = f(ab) = 0 (repulsion) then d describes the curve in orange.

3) The search for the number of independent loci

Consider now that the number of loci in linkage equilibrium one with the other is the value m in Eq. 1, the one we are interested in. From an operational point of view, how can we determine if two loci are in linkage equilibrium so that they can be considered independent one from the other? We can randomly build the genome of 2,000 human beings with the only restrain that haplotypes must be respected (this is where the HapMap comes into play), we then randomly assign them to two populations and we calculate for each locus the p value for the comparison of these two populations. Let’s say that locus L_i has a p value p_i. Since these two groups are identical by construction, it must be p\,>\,\frac{\alpha}{m} or, in other words:

8)\;\;m\,>\,\frac{\alpha}{p_i}

If we repeat this inequality for each locus, we have a lower bound for m. That number is the number of independent loci we were searching for (Pe’er, I. et al. 2008).

4) Experimental results

I discuss here the calculation of m performed in a recent study, the best I have been able to find on this topic (Fadista, J. et al. 2016). The strength of this study is that they calculated m for two kinds of sequencing techniques: whole-genome sequencing (WGS) and whole-exome sequencing (WES). Not only that, but they also provided the results for different European and non-European populations, for different definitions of linkage disequilibrium, and for different thresholds for minor allele frequency (MAF). These results are plotted in their paper and they are also available in xlsx format (here, even though I must say that they omitted part of raw data on different ethnicities). Nevertheless, I have elaborated them in my own way (the script that draws the diagrams is available below) by using a different arrangement of the data and by providing a fancy polynomial interpolation, by cubic splines, as to have smooth and elegant surfaces and curves. In Figure 3 I plotted m in function of the threshold for the minor allele frequency (MAF) and for the linkage disequilibrium, in the case of whole-genome sequencing (WGS). In that plot, it is assumed that MAF is the lower bound for the minor allele frequency while the axis ‘Linkage disequilibrium’ is the upper limit for the value \rho^2 in Eq. 7. Figure 4 is exactly the same diagram, but in this case, it has been built by considering data from whole-exome sequencing (WES). Some observations: the number of independent variants is bigger in WGS than it is in WES (which is due to the fact that in the former case we are considering more genetic material than in the latter one); when we reduce MAF, m increases (reducing MAF means considering more variants, so again, this leads to a high number of independent variants); m decreases with Linkage disequilibrium (the threshold for considering two variants as independent one from the other). This last point deserves a further explanation: in this context, when we say that LD is, let’s say, 0.8 it means that we consider two variants to be independent when their linkage disequilibrium (calculated as in Eq. 7) is below 0.8; as we reduce this value, we define a more stringent criterion for independence between pair of variants, thus the number m must decrease.

Figure 3. The number of independent variants in whole-genome sequencing, in function of the lower limit for minor allele frequency and of the upper limit for Linkage disequilibrium.
Figure 4. The number of independent variants in whole-exome sequencing, in function of the lower limit for minor allele frequency and of the upper limit for Linkage disequilibrium.
MAFmplog_{10}\frac{1}{p}
\geq\,0.05 782,3616.39\cdot10^{-8} 7.19
\geq\,0.01 1,580,6503.16\cdot10^{-8} 7.50
\geq\,0.005 2,093,8872.39\cdot10^{-8} 7.62
\geq\,0.001 4,126,1251.21\cdot10^{-8} 7.92
Table 1. This table gives the values m, p in function of the lower limit for minor allele frequency, in the case of whole-genome sequencing. It is assumed that \rho^2\,\leq\,0.8 .

If we consider the upper limit for \rho^2 to be 0.8, we obtain Table 1 and Table 2, where both m and corresponding Bonferroni corrections are collected, for various MAF thresholds.

MAFmplog_{10}\frac{1}{p}
\geq\,0.05 39,3831.27\cdot10^{-6} 5.90
\geq\,0.01 69,2657.22\cdot10^{-7} 6.14
\geq\,0.005 86,9275.75\cdot10^{-7} 6.24
\geq\,0.001 161,8333.09\cdot10^{-7} 6.51
Table 2. This table gives the values m, p in function of the lower limit for minor allele frequency, in the case of whole-exome sequencing. It is assumed that \rho^2\,\leq\,0.8 .

In Figures 5, 6 you can find these same data after interpolation, arranged as a curve where p is expressed in function of m and MAF.

Figure 5. The corrected threshold for statistical significance in function of the number of independent variants and of the lower limit for minor allele frequency, for WGS.
Figure 6. The corrected threshold for statistical significance in function of the number of independent variants and of the lower limit for minor allele frequency, for WES.

5) How to derive the Bonferroni correction

If we have a threshold \alpha' for statistical significance and we perform m tests, then the probability that one test is positive by chance is P\,<\,\alpha' and the probability that one test is negative by chance is 1\,-\,P\,>\,1\,-\,\alpha'. This also means that the probability that all the m tests are negative is (1\,-\,P)^m\,>\,(1\,-\,\alpha')^m. In conclusion, the probability that at least one test is positive pureley by chance is

9)\;\;p\,=\,1\,-\,(1\,-\,P)^m\,<\,1\,-\,(1\,-\,\alpha')^m

We want this probability to be lower than \alpha\,=\,0.05 , which leads to the following inequality:

10)\;\;\alpha'\,<\,1\,-\,\sqrt[m]{1\,-\,\alpha}

This value \alpha' is the corrected threshold for statistical significance, accorrding to the Bonferroni correction. But this is not the correction in Eq. 1, you might note. Yes, that is right. But let’s consider the well known expansion by series:

11)\;\;(1\,-\,\alpha)^{\frac{1}{m}}\,=\,\sum_{k\,=\,0}^{\infty}(-\,\alpha)^k\begin{pmatrix} \frac{1}{m}\\k \end{pmatrix}

If we take only the first two addends of the sum and if we substitute them in the expression already found for \alpha', we obtain Eq. 1. The error (its absolute value) that we make when we consider Eq. 1 instead of Eq. 8 is the one plotted in Figure 7, in function of m, for \alpha\,=\,0.05 .

Figure 7. The absolute value of 1\,-\,\sqrt[m]{1\,-\,\alpha}\,-\,\frac{\alpha}{m} in function of m for \alpha\,=\,0.05

6) Mathematical notes on linkage disequilibrium

By substituting the fourth relationship of Eq. 2 in Eq. 4, we obtain the surface

f(ab)\,-\,f(ab)f(Ab)\,-\,f(ab)f(aB)\,-

-\,f^2(ab)\,-\,f(Ab)f(aB)\,-\,d\,=\,0

which is a quadratic form in the space of the frequencies of the haplotypes ab, Ab, aB. More precisely, and even though this is of no importance for us, this is a hyperbolic cylinder (as the reader can easily verify by studying its associated matrix). If we now ask that f(Ab)\,\in\,[0,1] and f(AB)\,\in\,[0,1] for f(aB)\,\in\,[0,1], f(ab)\,\in\,[0,1], we find the relationships

\begin{cases} d\,\geq\,-f(aB)\,+\,f(aB)f(ab)\,+\,f(aB)^2\\d\,\leq\,f(ab)\,+\,f(aB)f(ab)\,+\,f(aB)^2\\d\,\geq\,-\,f(aB)\,-\,f(aB)f(ab)\,-\,f(ab)^2\\d\,\leq\,f(ab)\,-\,f(aB)f(ab)\,-\,f(ab)^2\\f(aB)\,\in\,[0,1],\,f(ab)\,\in\,[0,1]\end{cases}

Taken togheter, these inequations say that d can assume only the values included within the elegant surface in Figure 1. In particular, when f(Ab)\,=\,f(aB)\,=\,0 (a condition of perfect disequilibrium which goes under the name of coupling), then d assumes the values of the curve coloured blue, while when f(AB)\,=\,f(ab)\,=\,0 (repulsion) then d assumes the value on the curve coloured orange. These two courves are plotted in Figure 2 too. Note that f(Ab)\,=\,0\,\Rightarrow\,f(aB)\,=\,0; moreover, f(AB)\,=\,0\,\Rightarrow\,f(ab)\,=\,0 (this can be proved easily by using Eq. 2).

Note that the application of the method of the Lagrange multipliers for restrained optimization to Eq. 4, with the restrains defined by the fourth relationship of Eq. 2, would not lead to any conclusion in this case, because it is equal to the search for the extremes of a function whose gradient is never zero.

7) Supplementary Material

The script that plots Figures 1 and 2 can be downloaded here.

The script that plots Figures 3, 4, and 5 can be downloaded here.

The script that plots Figure 7 can be downloaded here.

Data from (Fadista, J. et al. 2016) can be downloaded here.


The equations of this blog post were written using \LaTeX (see this article)

La cura

La cura

“Nada puede occurir una sola vez, nada es preciosamente precario. Lo elegiaco, lo grave, lo ceremonial, no rigen para los Inmortales”

Jorge Luis Borges, El inmortal

Ho dovuto cercare l’esemplare. L’Europa non sostenne mai il peso dei brachiosauri, non poté nutrire quegli oceani silenziosi di platani necessari ad alimentare la passione prolifica dei plateosauri, né solitudini estese a sufficienza per dare respiro alle migrazioni scomposte e festose dei diplodochi. Per questo ho fatto sopralluoghi a nord del Nord America, prima, e mi sono spinto poi, qualche mese dopo, sui tavolati della Patagonia.

Cosa dice il manuale di farmacologia? “L’esploratore sfogli i sedimenti del tardo Mesozoico, a passi lenti ma regolari, come l’ibis che spera assorto nel riflesso musivo delle scaglie dorate, oltre la pellicola liquida sul Nilo. Come una lancetta il malcapitato passi in rassegna, di eone in eone, il Cretaceo pietrificato, l’album di famiglia della Terra, scrutando tra le pagine le pietre che trattengono l’eco della anatomia superba del sauropode.”

“Il sauropode abbia conosciuto la solitudine del pascolo, abbia vissuto la lussuria di una stagione lussureggiante di amori, conosca la sazietà e il dolore. Abbia ottanta milioni di anni.”

–Ho tanti milioni di anni!– così mi sembrò di sentire una voce, a metà febbraio, quando la mattina cerca il colorito della salute e i sogni ancora stentano ad assopirsi. Ma c’ero solo io che insultavo il freddo, in Patagonia, sulle tracce di Darwin. –Scaldami le ossa ancora una volta, rimettimi nella corrente del tempo che consuma la vita ad ogni giro della vite che gira intorno al Sole, anche solo per un’ora! Senza la minaccia senza assoluzione o amnistia della morte, esistere non ha senso. L’eternità, esauriti i gradienti che nutrono e accrescono la fame della entropia vorace, è una legge ingiusta–.

Un tronco mi parve all’inizio, abbattuto, screpolato, mezzo sommerso. Ma poi riconobbi un condilo e più in là una cresta iliaca. Stavo camminando sulle rovine di una cattedrale, un argentinosauro, e non me ne ero accorto!

In breve, questa è la storia. Poi come si prepara il fossile lo sai: lo macini, aggiungi cannella e brodo granulare, rabbocchi la ciotola di acqua, ne fai solleticare le molecole dalle microonde finché non prenda il colore dell’ocra. E speri che funzioni.

Hai visto mai i bimbi sorpresi dalla morte? Sono improbabili, tra le palpebre socchiuse guardano una cosa che non possono capire, come uno schiaffo violento della mamma, dalla quale hanno conosciuto solo abbracci. Rimangono con l’esclamazione di sorpresa muta del tulipano. Così è stato per me, morto quando dovevo iniziare.

Se guarissi, non mi fermerebbe nessuno, comincerei a correre senza cronometro, come Forrest Gump. Sono una persona avventurosa, in fondo; avventura fisica intendo, esplorazione. In Abruzzo non avevo otto anni che partii per la prima grande spedizione in solitaria. Un ramarro mi tagliò la strada, segnando il confine di Eracle, ma non mi lasciai spaventare dal profilo dei monti che cambiava e dal paesino che diventava più piccolo.

Un avventuriero che ha vissuto la sua vita in una stanza, senza più cervello. Leopardi in una sua lettera all’amico Ranieri diceva che la sua pazzia sarebbe stata quella di rimanere su una sedia, immobile. Io non capivo, a 18 anni, di cosa parlasse.

Poco dopo l’ho scoperto. E da allora cerco la cura.

The lost illustration

The lost illustration

Philosophy is written in a great book that is continually open before our eyes (that is the Universe), but it cannot be understood unless one first learns to decipher the language and the characters in which it is written. It is written in a mathematical language, and the characters are triangles, circles, and other geometric figures, without which it is impossible for us to understand a word of it; without them it is a vain wandering through a dark labyrinth.

Galileo Galilei, The Assayer

The geometry of the backbone (or main chain) of a peptide is completely described by a sequence of dihedral angles (also known as torsion angles): the angle Φ is the angle around the chemical bond between the alpha carbon of one amino acid and the azote of the same amino acid; the angle Ψ is the angle around the axis of the bond between the alpha carbon and the other carbon (I call it beta carbon, but this is a personal notation) of the same amino acid. This definition is incomplete, of course, because as we all know, a dihedral angle is defined not only by an axis but also by two planes that intersect in that axis. Not only that, these two angles have a sign, so we must specify a positive arrow of rotation. This can be done in several ways. The problem is that the most efficient way would be by just drawing the planes the torsion angles are defined by. Now, through the years I discovered that this drawing is usually lacking in books, leading to some confusion, particularly in those who study biochemistry without a particular interest in a quantitative description of molecular structures. In Figure 1 you find the graphical description of Φ and Ψ in a well-know book of biochemistry. Do you see the planes the angles are defined by? In Figure 2, two further examples of illustrations that are not suited at completely describing the two dihedral angles, from a two other well-know books, one about bioinformatics, the other one a booklet about protein structures. The second one is the best one, but you still have to mentally merge two drawings (FIG. 5 and FIG. 6) to get the full picture. I hope I won’t be sued…

Figure 1. Dihedral angles, from Principles of Biochemistry, by Voet D, Voet J, and Charlotte WP.

Figure 2. Two illustrations for the description of the torsion angles. The one on the left is from Bioinformatics, Sequence and Genome Analysis, by David W Mount. On the right: The Anatomy and Taxonomy of Protein Structures, by Jane S Richardson.

Now, it is possible that I have just been unlucky with the books of my personal library. Or it may be that the illustration with the three planes we need to correctly define Φ and Ψ (we just need to add a plane to the amide planes), would not be clearly readable. I tried years ago to draw the planes for Ψ (the illustration is in this blog post) and I have now completed it with the other torsion angle, by drawing a tripeptide (Figure 3). It is just a handmade drawing, but I think it serves the scope. This is what I call the lost illustration.

Figure 3. A peptide of three amino acids. The amide planes are shaded grey. A further plane has been added to fully characterize the torsion angles Φ and Ψ. Pencil and pen on paper, by Paolo Maccallini. For the signs of the two angles, a clockwise rotation is considered positive. So, in this case, we have Φ of about -90° and Ψ of about 150°.

A second glance at the UK Biobank Database

FeaturedA second glance at the UK Biobank Database

Note. This blog post should be read on a laptop. With smaller screens, some of the tables cannot be correctly formatted.

Introduction: what has been found

The UK Biobank is a large-scale biomedical database that includes genetic data on 500,000 UK citizens (Sudlow, C et al. 2015). Myalgic Encephalomyelitis/Chronic Fatigue Syndrome (ME/CFS) is among the phenotypes reported by participants. The total number of ME/CFS patients included is 1208 females and 451 males, with control groups of more than 150 k individuals each. For each individual 805,426 variants (both SNPs and short InDels) were genotyped. From them, up to 30 million variant can be imputed using haplotype maps. This is the biggest GWAS study on ME/CFS patients ever made, so far. Many research groups have analyzed this huge quantity of data and the conclusion, according to (Dibble J. et al. 2020), is that the only variants to be significantly different between patients and controls are included in gene SLC25A15, which encodes mitochondrial ornithine transporter I, particularly variant rs7337312 (position13:41353297:G:A on reference genome GRCh37, forward strand). But this difference is statistically significant only when we consider female patients vs female controls (Table 1). For a detailed discussion of this result, the best reference is the already mentioned (Dibble J. et al. 2020). I wrote an analysis myself on this same result and on its possible implications, available here. My focus, while writing that document, was also understanding the architecture of the metadata generated by using the UK Biobank, by Neale’s Lab (R), one of the research groups that performed the statistical analyses. It was not obvious, at the beginning, how I could derive for each variant genotyped a table like Table 1. In fact, it is the result of the elaboration of five files, four of them of considerable size. Once I realized how to read and manipulate these files, I manually derived the data I was interested in, at first. Then I decide to write a code that could do the work in an automated fashion (see below).

Table 1. Frequencies and allele counts for variant rs7337312 (gene SLC25A15 ) among 1208 CFS female patients, 451 CFS male patients, 192945 female controls, 166537 male controls, total UK Biobank sample, and general population (global genome AD). P values for patients vs controls have been calculated for both sexes. I derived this table from metadata by Neale’s Lab (R).

What has not been found

Now, even if we are certainly disappointed by the fact that the positive result of these analyses does not add much to what we already knew about ME/CFS, it is important to stress that negative results are also relevant. Why? Because we can rule out the huge amount of genetic associations made through the years between fatigue in general, and CFS in particular, and genetic variants (see for instance the encyclopedic review Whang T. et al. 2017). As an example, I will focus on the variants of TNF alpha gene collected in Table 2. In 2006, an Italian group found that the minor allele of variant rs1799724 (TNF-857) was significantly more frequent in 54 ME/CFS patients than in 224 controls ( Carlo-Stella N. et al. 2006). In 2020, a group from Germany reported a statistically significant difference for the same variant, but only when ME/CFS patients without infectious triggered onset were considered (Steiner S. et al. 2020). Two other variants of gene TNF alpha were associated with morning fatigue among oncology outpatients with breast, prostate, lung, or brain cancer ( Dhruva A. et al. 2015). Are these variants different between the 1659 ME/CFS patients and the about 300 k controls in the UK Biobank database?

VariantPosition on GRCh37 (fs) Disease p val Reference
rs1799724 ch6:31,542,482 ME/CFS
ME/CFS w/o ITO
0.0028
0.0430
Carlo-Stella N. et al. 2006
Steiner S. et al. 2020
rs1800629
rs3093662
ch6:31,543,031
ch6:31,544,189
Morning fatigue in oncology outpatients
with breast, prostate, lung, or brain cancer
0.0250
0.0040
Dhruva A. et al. 2015
Table 2. Variants on gene TNF associated with ME/CFS, with ME/CFS without Infectious Triggered Onset (ITO), and with morning fatigue in oncology outpatients with breast, prostate, lung, or brain cancer.

To answer this question I have selected a suitable region of chromosome 6 (ch6:31,539,768 – ch6:31,546,495) that includes not only all the variants in Table 2, but the whole TNF gene. I then analyzed this region with a code I wrote in GNU Octave, version 6.3.0. This code generates three .csv files (one for each of the following groups: females only, males only, both sexes) with information on each one of the variants genotyped within the region mentioned (see Supplementary Material). An extract from these files, with only the variants collected in Table 2, can be seen in Table 3, Table 4, and Table 5. None of these variants reaches the statistically significant threshold, which for this kind of studies is considered to be p<5×10^(-8), for variants with a minor allele frequency above 5% (even lower for less common variants) (Fadista J et al. 2016).

rsidminref altmin_AFp valalt
cases
alt
contr
1799724TCT0.0720.1500.0770.072
1800629AGA0.1970.7060.1950.197
3093662GAG0.0760.5900.0730.076
Table 3. Frequencies of the same variants of Table 2 from Neale’s Lab Metadata. This is the comparison between 1656 ME/CFS patients (females + males) and 359482 controls (females + males). min = minor allele; ref = reference allele; alt = alternate allele; min_AF = minor allele frequency. alt cases = frequency of alternative allele in cases; alt contr = frequency of alternate allele in controls. Elaboration made by the code TNF.m I wrote in Octave (see supplementary material).

rsidminrefaltmin_AFp valalt
cases
alt
contr
1799724TCT0.0720.1670.0780.071
1800629AGA0.1970.8570.1980.196
3093662GAG0.0760.6030.0730.076
Table 4. As in Table 3, but in this case the comparison is between 1208 female ME/CFS patients and 192945 female controls. min = minor allele; ref = reference allele; alt = alternate allele; min_AF = minor allele frequency. alt cases = frequency of alternative allele in cases; alt contr = frequency of alternate allele in controls. Elaboration made by the code TNF.m I wrote in Octave (see supplementary material).

rsidminrefaltmin_AFp valalt
cases
alt
contr
1799724TCT0.0720.6090.0750.072
1800629AGA0.1970.3140.1860.199
3093662GAG0.0760.8560.0740.075
Table 5. As in Table 3 and Table 4, but in this case the comparison is between 451 male ME/CFS patients and 166537 male controls. min = minor allele; ref = reference allele; alt = alternate allele; min_AF = minor allele frequency. alt cases = frequency of alternative allele in cases; alt contr = frequency of alternate allele in controls. Elaboration made by the code TNF.m I wrote in Octave (see supplementary material).

Another way to conveniently visualize the differences between patients and controls (if any) is to plot p values in function of the position of the variants they refer to. We actually plot the log10 of 1/p or, in other words, -log10(p). This kind of diagram is known as Manhattan plot and in Figure 1 you see the manhattan plot for region ch6:31,539,768 – ch6:31,546,495, in the case of females + males. As you can see, none of the variants has a p value below 0.01. In Figure 2 you find the same plot, this time for females only (left) and males only (right).

Figure 1. Manhattan plot for ch6:31,539,768 – ch6:31,546,495, in the case of female + male CFS patients vs controls. Red dot is rs1799724, blue dot is rs1800629, yellow dot is rs3093662. None of these three variants reaches a p value below 10^-1. Plotted by manhattan_plot.m (see supplementary material).
Figure 2. Manhattan plot for ch6:31,539,768 – ch6:31,546,495, in the case of female CFS patients vs female controls (left) and male CFS patients vs male controls (right). Red dot is rs1799724, blue dot is rs1800629, yellow dot is rs3093662. None of these three variants reaches a p value below 10^-1.

Conclusions

In conclusion, these codes of mine provide as output the frequency of both the minor and the major allele among 1659 ME/CFS patients, stratified by sex, for about 15 million variants (including SNPs and InDels). They also provide the p value (beta test) for the comparison of these patients with about 300 k controls, along with general information on the variants (such as position, minor allele frequency among 500 k UK citizens, predicted consequences, etc). These codes do so by analyzing the metadata generated by Neale’s Lab, a research group that worked on the UK Biobank dataset. In this blog post, I have shown how to use these data to test previously published genetic associations between ME/CFS and gene TNF alpha.

Supplementary material

The files that I used as input for my codes are the following ones (click on them to download):

variants_TNF

20002_1482.gwas.imputed_v3.both_sexes_TNF

20002_1482.gwas.imputed_v3.females_TNF

20002_1482.gwas.imputed_v3.males_TNF

The output generated by TNF.m (see below) consists in the following files (click on them to download) which were used to build tables from 3 to 5:

TNF_total_both_sexes

TNF_total_females

TNF_total_males

Details on these files can be found in this PDF. The code that generates the plots is pretty simple and can be downloaded here. The code that generates the three output files from the four input ones is the one that follows.

% file name = TNF
% date of creation = 25/08/2021
% it generates the file TNF_tot by using the package IO
% see https://octave.sourceforge.io/io/ for a description
clear all
close all
pkg load io
%-------------------------------------------------------------------------------
% Data source
%-------------------------------------------------------------------------------
% The collection of the analysis generated by Neale’s Lab with the UK Biobank
% data is available here:
% https://docs.google.com/spreadsheets/d/1kvPoupSzsSFBNSztMzl04xMoSC3Kcx3CrjVf4yBmESU/edit?ts=5b5f17db#gid=178908679
% In line 2 of that file, you can download an explanation of the labels (README).
% At lines 9 and 10 you can download a collection of the phenotypes and of some
% of the attributes of the relative sample, for females and males respectively
% (phenotypes.female.tsv, phenotypes.male.tsv). We are interested in phenotype 20002_1482
% (Non-cancer illness code, self-reported: chronic fatigue syndrome). In line 11
% you find a file that contains the details on each variant (variants.tsv). You
% must unzip the files before opening them. I have used File viewer plus for reading
% this file (Excel doesn’t seem to open it).
%-------------------------------------------------------------------------------
% Length and position of TNF
%-------------------------------------------------------------------------------
% The position of TNF goes from ch6:31,543,342 to ch6:31,546,113 on the reference
% genome GRCh37 (alias h19), according to NCBI (https://www.ncbi.nlm.nih.gov/gene/7124).
% The two variants we are interested in has a position ch6:31,542,482 (rs1799724, TNF-857)
% and ch6:31,543,031 (rs1800629), on GRCh37. So we must include them, along with
% the whole TNF gene. Since rs1799724 falls also in gene LTA, I include this gene
% as well (LTA goes from ch6:31,539,876 to ch6:31,542,101). Then a suitable interval
% in this dataset is ch6:31,539,768 to ch6:31,546,495.
%-------------------------------------------------------------------------------
% numbers of cases and controls
%-------------------------------------------------------------------------------
n_cases_f = 1208
n_controls_f = 192945
n_cases_m = 451
n_controls_m = 166537
n_cases_b = n_cases_f + n_cases_m
n_controls_b = n_controls_f + n_controls_m
%-------------------------------------------------------------------------------
% name of the gene
%-------------------------------------------------------------------------------
gene = 'TNF';
%-------------------------------------------------------------------------------
% number of variants = upper limit for range - 1
%-------------------------------------------------------------------------------
file_name = strcat('variants_',gene,'.xlsx')
[filetype, sh_names, fformat, nmranges] = xlsfinfo (file_name);
strn = char(sh_names(1,2));
strn = substr(strn,5) % upper limit for range as a string
n = str2num(strn) % upper limit for range as a number
%-------------------------------------------------------------------------------
% reading and storing variants_GENE.xlsx
%-------------------------------------------------------------------------------
file_name = strcat('variants_',gene,'.xlsx')
sheet = 'Sheet1'
% reading and storing chr:position:ref:alt
range = strcat('A2:A',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
variant = text;
% reading and storing the chromosome
range = strcat('B2:B',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
chr = num;
% reading and storing the position on chromosome
range = strcat('C2:C',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
pos = int32(num);
% reading and storing the ref allele
range = strcat('D2:D',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
ref = text;
% reading and storing the alt allele
range = strcat('E2:E',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
alt = text;
% reading and storing the rsid
range = strcat('F2:F',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
rsid = raw;
% reading and storing the consequence
range = strcat('H2:H',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
consequence = text;
% reading and storing the consequence category
range = strcat('I2:I',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
consequence_category = text;
% reading and storing the Alternate allele count (calculated using hardcall genotypes)
range = strcat('L2:L',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AC = int32(num);
% reading and storing the Alternate allele frequency (calculated using hardcall genotypes)
range = strcat('M2:M',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AF = double(num);
% reading and storing the Minor allele (equal to ref allele when AF > 0.5, otherwise equal to alt allele)
range = strcat('N2:N',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_allele = text;
% reading and storing the Minor allele frequency (calculated using hardcall genotypes)
range = strcat('O2:O',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_AF = double(num);
% reading and storing the Hardy-Weinberg p-value
range = strcat('P2:P',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
p_hwe = double(num);
% reading and storing the Number of samples with defined genotype at this variant
range = strcat('Q2:Q',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_called = int32(num);
% reading and storing the Number of samples without a defined genotype at this variant
range = strcat('R2:R',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_not_called = int32(num);
% testing by writing some of the data
i = 2
variant (i)
chr (i)
pos (i)
ref (i)
alt (i)
rsid (i)
consequence (i)
consequence_category (i)
AC (i)
AF (i)
minor_allele (i)
minor_AF (i)
p_hwe (i)
n_called (i)
n_not_called(i)
%-------------------------------------------------------------------------------
% reading and storing 20002_1482.gwas.imputed_v3.females_GENE.xlsx
%-------------------------------------------------------------------------------
file_name = strcat('20002_1482.gwas.imputed_v3.females_',gene,'.xlsx')
sheet = 'Sheet1'
% reading and storing chr:position:ref:alt
range = strcat('A2:A',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
variant_f = text;
% reading and storing the Minor allele frequency in n_complete_samples_f
range = strcat('C2:C',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_AF_f = double(num);
% reading and storing the quality control
range = strcat('E2:E',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
low_confidence_variant_f = raw;
% reading and storing the numeber of individuala genotyped in cases + controls
range = strcat('F2:F',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_complete_samples_f = double(num);
% reading and storing the Alternate allele count within n_complete_samples_f
range = strcat('G2:G',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AC_f = double(num);
% reading and storing the number of alternative alleles in n_cases_f
range = strcat('H2:H',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
ytx_f = double(num);
% reading and storing the p values between cases and controls
range = strcat('L2:L',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
pval_f = double(num);
% testing by writing some of the data
i = 2
variant_f(i)
minor_AF_f(i)
low_confidence_variant_f(i)
n_complete_samples_f(i)
AC_f(i)
ytx_f(i)
pval_f(i)
%-------------------------------------------------------------------------------
% reading and storing 20002_1482.gwas.imputed_v3.males_GENE.xlsx
%-------------------------------------------------------------------------------
file_name = strcat('20002_1482.gwas.imputed_v3.males_',gene,'.xlsx')
sheet = 'Sheet1'
% reading and storing chr:position:ref:alt
range = strcat('A2:A',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
variant_m = text;
% reading and storing the Minor allele frequency in n_complete_samples_f
range = strcat('C2:C',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_AF_m = double(num);
% reading and storing the quality control
range = strcat('E2:E',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
low_confidence_variant_m = raw;
% reading and storing the numeber of individuala genotyped in cases + controls
range = strcat('F2:F',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_complete_samples_m = double(num);
% reading and storing the Alternate allele count within n_complete_samples_f
range = strcat('G2:G',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AC_m = double(num);
% reading and storing the number of alternative alleles in n_cases_f
range = strcat('H2:H',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
ytx_m = double(num);
% reading and storing the p values between cases and controls
range = strcat('L2:L',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
pval_m = double(num);
% testing by writing some of the data
i = 2
variant_m(i)
minor_AF_m(i)
low_confidence_variant_m(i)
n_complete_samples_m(i)
AC_m(i)
ytx_m(i)
pval_m(i)
%-------------------------------------------------------------------------------
% reading and storing 20002_1482.gwas.imputed_v3.both_sexes_GENE.xlsx
%-------------------------------------------------------------------------------
file_name = strcat('20002_1482.gwas.imputed_v3.both_sexes_',gene,'.xlsx')
sheet = 'Sheet1'
% reading and storing chr:position:ref:alt
range = strcat('A2:A',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
variant_b = text;
% reading and storing the Minor allele frequency in n_complete_samples_f
range = strcat('C2:C',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
minor_AF_b = double(num);
% reading and storing the quality control
range = strcat('E2:E',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
low_confidence_variant_b = raw;
% reading and storing the numeber of individuala genotyped in cases + controls
range = strcat('F2:F',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
n_complete_samples_b = double(num);
% reading and storing the Alternate allele count within n_complete_samples_f
range = strcat('G2:G',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
AC_b = double(num);
% reading and storing the number of alternative alleles in n_cases_f
range = strcat('H2:H',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
ytx_b = double(num);
% reading and storing the p values between cases and controls
range = strcat('L2:L',strn)
[num,text,raw,limits] = xlsread (file_name, sheet, range);
pval_b = double(num);
% testing by writing some of the data
i = 2
variant_b(i)
minor_AF_b(i)
low_confidence_variant_b(i)
n_complete_samples_b(i)
AC_b(i)
ytx_b(i)
pval_b(i)
%-------------------------------------------------------------------------------
% checking for wrong alignments between these 4 files
%-------------------------------------------------------------------------------
error = 0.;
for i=1:n-1
a = variant(i);
b = variant_f(i);
c = variant_m(i);
d = variant_b(i);
% function strcmp (string_1, strin_2) returns 1 if the two strings are identical
% it resturns 0 if the two strings have at least a different character in the same position
test = strcmp(a,b);
if ( (test==0) );
error = error +1;
endif
test = strcmp(a,c);
if ( (test==0) );
error = error +1;
endif
test = strcmp(a,d);
if ( (test==0) );
error = error +1;
endif
endfor
error
%-------------------------------------------------------------------------------
% defining major and minor alleles among all the variants
%-------------------------------------------------------------------------------
for i=1:n-1
frequency = AF(i); %frequency of the alternative allele
if ( frequency <= 0.5 )
minor_allele2(i)=alt(i);
major_allele(i)=ref(i);
else
minor_allele2(i)=ref(i);
major_allele(i)=alt(i);
endif
endfor
%-------------------------------------------------------------------------------
% defining alternate allele frequencies in case groups
%-------------------------------------------------------------------------------
for i=1:n-1
alt_AF_f(i) = ytx_f(i)/(2n_cases_f); alt_AF_m(i) = ytx_m(i)/(2n_cases_m);
alt_AF_b(i) = ytx_b(i)/(2n_cases_b); endfor
%------------------------------------------------------------------------------- 
% defining alternate allele frequencies in control groups %------------------------------------------------------------------------------- 
for i=1:n-1 alt_AF_f_controls(i) = ( AC_f(i)- ytx_f(i) )/(2n_controls_f);
alt_AF_m_controls(i) = ( AC_m(i)- ytx_m(i) )/(2n_controls_m); alt_AF_b_controls(i) = ( AC_b(i)- ytx_b(i) )/(2n_controls_b);
endfor
%-------------------------------------------------------------------------------
% defining reference allele frequencies in case groups
%-------------------------------------------------------------------------------
for i=1:n-1
ref_AF_f(i) = 1-alt_AF_f(i);
ref_AF_m(i) = 1-alt_AF_m(i);
ref_AF_b(i) = 1-alt_AF_b(i);
endfor
%-------------------------------------------------------------------------------
% defining reference allele frequencies in control groups
%-------------------------------------------------------------------------------
for i=1:n-1
ref_AF_f_controls(i) = 1-alt_AF_f_controls(i);
ref_AF_m_controls(i) = 1-alt_AF_m_controls(i);
ref_AF_b_controls(i) = 1-alt_AF_b_controls(i);
endfor
%-------------------------------------------------------------------------------
% checking for errors by comparing minor_allele with minor_allele2
%-------------------------------------------------------------------------------
error2 = 0.;
for i=1:n-1
a = minor_allele(i);
b = minor_allele2(i);
% function strcmp (string_1, strin_2) returns 1 if the two strings are identical
% it resturns 0 if the two strings have at least a different character in the same position
test = strcmp(a,b);
if ( (test==0) );
error2 = error2 +1;
endif
endfor
error2
%-------------------------------------------------------------------------------
% writing of GENE_total_females.csv
%-------------------------------------------------------------------------------
columns = cell(n,20);
columns{1,1} = "variant";
columns{1,2} = "chr";
columns{1,3} = "pos";
columns{1,4} = "rsid";
columns{1,5} = "minor_allele";
columns{1,6} = "major_allele";
columns{1,7} = "ref";
columns{1,8} = "alt";
columns{1,9} = "minor_AF";
columns{1,10} = "low_confidence_variant";
columns{1,11} = "p_hwe";
columns{1,12} = "consequence";
columns{1,13} = "consequence_category";
columns{1,14} = "pval";
columns{1,15} = "alt_AF_cases";
columns{1,16} = "alt_AF_controls";
columns{1,17} = "ref_AF_cases";
columns{1,18} = "ref_AF_controls";
columns{1,19} = "n_cases";
columns{1,20} = "n_controls";
for i=1:n-1
columns{i+1,1} = variant{i};
columns{i+1,2} = chr(i);
columns{i+1,3} = int32(pos(i));
columns{i+1,4} = rsid{i};
columns{i+1,5} = minor_allele{i};
columns{i+1,6} = major_allele{i};
columns{i+1,7} = ref{i};
columns{i+1,8} = alt{i};
columns{i+1,9} = minor_AF(i);
columns{i+1,10} = low_confidence_variant_f{i};
columns{i+1,11} = p_hwe(i);
columns{i+1,12} = consequence{i};
columns{i+1,13} = consequence_category{i};
columns{i+1,14} = pval_f(i);
columns{i+1,15} = alt_AF_f(i);
columns{i+1,16} = alt_AF_f_controls(i);
columns{i+1,17} = ref_AF_f(i);
columns{i+1,18} = ref_AF_f_controls(i);
columns{i+1,19} = n_cases_f;
columns{i+1,20} = n_controls_f;
endfor
file_name = strcat(gene,'_total_females.csv')
cell2csv (file_name, columns," ")
%-------------------------------------------------------------------------------
% writing of GENE_total_males.csv
%-------------------------------------------------------------------------------
columns = cell(n,20);
columns{1,1} = "variant";
columns{1,2} = "chr";
columns{1,3} = "pos";
columns{1,4} = "rsid";
columns{1,5} = "minor_allele";
columns{1,6} = "major_allele";
columns{1,7} = "ref";
columns{1,8} = "alt";
columns{1,9} = "minor_AF";
columns{1,10} = "low_confidence_variant";
columns{1,11} = "p_hwe";
columns{1,12} = "consequence";
columns{1,13} = "consequence_category";
columns{1,14} = "pval";
columns{1,15} = "alt_AF_cases";
columns{1,16} = "alt_AF_controls";
columns{1,17} = "ref_AF_cases";
columns{1,18} = "ref_AF_controls";
columns{1,19} = "n_cases";
columns{1,20} = "n_controls";
for i=1:n-1
columns{i+1,1} = variant{i};
columns{i+1,2} = chr(i);
columns{i+1,3} = int32(pos(i));
columns{i+1,4} = rsid{i};
columns{i+1,5} = minor_allele{i};
columns{i+1,6} = major_allele{i};
columns{i+1,7} = ref{i};
columns{i+1,8} = alt{i};
columns{i+1,9} = minor_AF(i);
columns{i+1,10} = low_confidence_variant_m{i};
columns{i+1,11} = p_hwe(i);
columns{i+1,12} = consequence{i};
columns{i+1,13} = consequence_category{i};
columns{i+1,14} = pval_m(i);
columns{i+1,15} = alt_AF_m(i);
columns{i+1,16} = alt_AF_m_controls(i);
columns{i+1,17} = ref_AF_m(i);
columns{i+1,18} = ref_AF_m_controls(i);
columns{i+1,19} = n_cases_m;
columns{i+1,20} = n_controls_m;
endfor
file_name = strcat(gene,'_total_males.csv')
cell2csv (file_name, columns," ")
%-------------------------------------------------------------------------------
% writing of CLYBL_total_both_sexes.csv
%-------------------------------------------------------------------------------
columns = cell(1523,20);
columns{1,1} = "variant";
columns{1,2} = "chr";
columns{1,3} = "pos";
columns{1,4} = "rsid";
columns{1,5} = "minor_allele";
columns{1,6} = "major_allele";
columns{1,7} = "ref";
columns{1,8} = "alt";
columns{1,9} = "minor_AF";
columns{1,10} = "low_confidence_variant";
columns{1,11} = "p_hwe";
columns{1,12} = "consequence";
columns{1,13} = "consequence_category";
columns{1,14} = "pval";
columns{1,15} = "alt_AF_cases";
columns{1,16} = "alt_AF_controls";
columns{1,17} = "ref_AF_cases";
columns{1,18} = "ref_AF_controls";
columns{1,19} = "n_cases";
columns{1,20} = "n_controls";
for i=1:n-1
columns{i+1,1} = variant{i};
columns{i+1,2} = chr(i);
columns{i+1,3} = int32(pos(i));
columns{i+1,4} = rsid{i};
columns{i+1,5} = minor_allele{i};
columns{i+1,6} = major_allele{i};
columns{i+1,7} = ref{i};
columns{i+1,8} = alt{i};
columns{i+1,9} = minor_AF(i);
columns{i+1,10} = low_confidence_variant_b{i};
columns{i+1,11} = p_hwe(i);
columns{i+1,12} = consequence{i};
columns{i+1,13} = consequence_category{i};
columns{i+1,14} = pval_b(i);
columns{i+1,15} = alt_AF_b(i);
columns{i+1,16} = alt_AF_b_controls(i);
columns{i+1,17} = ref_AF_b(i);
columns{i+1,18} = ref_AF_b_controls(i);
columns{i+1,19} = n_cases_b;
columns{i+1,20} = n_controls_b;
endfor
file_name = strcat(gene,'_total_both_sexes.csv')
cell2csv (file_name, columns," ")