# Sum of independent exponential random variables with the same parameter

Introduction

Let X_1, X_2, …, X_m be independent random variables with an exponential distribution. We have already found the expression of the distribution of the random variable Y = X_1 + X_2 + …+ X_m when X_1, X_2, …, X_m have pairwise distinct parameters (here). The reader will easily recognize that the formula we found in that case has no meaning when the parameters are all equal to λ. I will derive here the law of Y in this circumstance. This can be done with a demonstration by induction, with no particular effort, but I will follow a longer proof. Why? Perhaps because I like pain? Yes, this might be true; but the main reason is that, to me,  this longer demonstration is quite interesting and it gives us the chance to introduce the following proposition. In this blog post, we will use some of the results from the previous one on the same topic and we will follow the same enumeration for propositions.

PROPOSITION 8 (sum of m independent random variables). Let X_1, X_2, … , X_m be independent random variables. The law of Y = X_1 + X_2 + … + X_m is given by:

Proof. We know that the thesis is true for m=2 (PROP. 1). Consider now that:

But we know that X_1, X_2, … , X_m are independent. So we have:

We now operate the substitution

And we obtain:

But this has to be true for any possible interval [a, b], which means that:

This is one of the equations in the thesis. The other ones can be derived in the same way ♦

Guessing the solution

I will solve the problem for m = 2, 3, 4 in order to have an idea of what the general formula might look like.

PROPOSITION 9 (m=2). Let X_1, X_2 be independent exponential random variables with the same parameter λ. The law of Y _1 = X_1 + X_2 is given by:

for y > 0. It is zero otherwise.

Proof. We just have to substitute f_X_1, f_X_2 in Prop. 1. We obtain:

And we find the thesis by solving the integral ♦

PROPOSITION 10 (m=3). Let X_1, X_2, X_3  be independent exponential random variables with the same parameter λ. The law of Y = X_1 + X_2 + X_3 is given by:

for y>0. It is zero otherwise.

Proof. If we define Y_1 = X_1 + X_2 and Y_2 = X_3, then we can say – thanks to Prop. 2 – that Y_1 and Y_2 are independent. This means that – according to Prop. 1 – we have

And the proof is concluded ♦

PROPOSITION 11 (m=4). Let X_1, X_2, X_3, X_4 be independent exponential random variables with the same parameter λ. The law of Y = X_1 + X_2 + X_3 + X_4 is given by:

Proof. We can demonstrate the thesis in the same way we did in Prop. 10. We can draw the same conclusion by directly applying Prop. 8:

The domain of integration is the tetrahedron in Figure 1. The generic point P within the tetrahedron has to belong to the segment AC that, in turns, belongs to the triangle shaded grey. This means that the domain of integration can be written [0, y]×[0, y – x_2]×[0, y – x_2 – x_3]. Thus, we calculate:

and the proof is concluded ♦

Proof

The reader has now likely guessed what the density of Y looks like when m is the generic integer number. But before we can rigorously demonstrate that formula, we need to calculate an integral.

PROPOSITION 12 (lemma). It is true that:

Proof. We have already found that the thesis is true when m = 4. Let us now assume that it is true for m – 1; if we write the thesis for the m – 2 variables x_3, x_4, … , x_m we have:

Now we put η = yx_2:

By integrating both members in [0, y] we obtain:

and the proof is concluded ♦

PROPOSITION 13. Let X_1, X_2, …, X_m be independent exponential random variables with the same parameter λ. The law of Y = X_1 + X_2 + … + X_m is given by:

Proof. By directly applying Prop. 8 we have:

But:

Thus, we can write:

The domain of integration can be written in a more proper way (as we did in Prop. 11) as follows:

But this is the integral calculated in Prop. 12, and the proof is concluded ♦

A numerical application

The reader might have recognized that the density of Y in Prop. 13 is in fact a Gamma distribution: it is the distribution Γ(m, λ), also known as Erlang’s distribution. There might even be a reader who perhaps remembers that I have discussed that distribution in a post of mine (here) and that I have coded a program that plots both the density and the distribution of such a law. In Figure 2 we see the density of Y (left) and the distribution function (right) for λ = 1.5 and for m that goes from 1 to 3 (note that m is indicated with α in that picture).

Annunci

# Sum of independent exponential random variables

For Chiara,

who once encouraged me

to boldly keep trying.

Introduction

For the last four months, I have experienced the worst level of my illness: I have been completely unable to think for most of the time. So I could do nothing but hanging in there, waiting for a miracle, passing from one medication to the other, well aware that this state could have lasted for years, with no reasonable hope of receiving help from anyone. This has been the quality of my life for most of the last two decades. Then, some days ago, the miracle happened again and I found myself thinking about a theorem I was working on in July. And once more, with a great effort, my mind, which is not so young anymore, started her slow process of recovery. I concluded this proof last night. This is only a poor thing but since it is not present in my books of statistics, I have decided to write it down in my blog, for those who might be interested.

I can now come back to my awkward studies, which span from statistics to computational immunology, from analysis of genetic data to mathematical modelling of bacterial growth. Desperately searching for a cure.

The problem

Let X_1, X_2, …, X_m be independent random variables with an exponential distribution with pairwise distinct parameters λ_1, λ_2, …, λ_m, respectively. Our problem is: what is the expression of the distribution of the random variable Y = X_1 + X_2 + …+ X_m? I faced the problem for m = 2, 3, 4. Then, when I was quite sure of the expression of the general formula of  f_Y (the distribution of Y) I made my attempt to prove it inductively. But before starting, we need to mention two preliminary results that I won’t demonstrate since you can find these proofs in any book of statistics.

PROPOSITION 1. Let X_1, X_2 be independent random variables. The distribution of Y = X_1 + X_2 is given by:

where f_X is the distribution of the random vector [X_1, X_2].

PROPOSITION 2. Let X_1, X_2, …, X_m be independent random variables. The two random variables Y _1 = X_1 + X_2 + …+ X_n and Y _2 = X_n+1 + X_n+2 + …+ X_m (with n<m) are independent.

DEFINITION 1. For those who might be wondering how the exponential distribution of a random variable X_i with a parameter λ_i looks like, I remind that it is given by:

Guessing the solution

As mentioned, I solved the problem for m = 2, 3, 4 in order to understand what the general formula for f_Y might have looked like.

PROPOSITION 3 (m = 2). Let X_1, X_2 be independent exponential random variables with distinct parameters λ_1, λ_2, respectively. The law of Y _1 = X_1 + X_2 is given by:

Proof. We just have to substitute f_X_1, f_X_2 in Prop. 1. We obtain:

And the demonstration is complete ♦

PROPOSITION 4 (m = 3). Let X_1, X_2, X_3  be independent exponential random variables with pairwise distinct parameters λ_1, λ_2, λ_3, respectively. The law of Y = X_1 + X_2 + X_3 is given by:

Proof. If we define Y_1 = X_1 + X_2 and Y_2 = X_3, then we can say – thanks to Prop. 2 – that Y_1 and Y_2 are independent. This means that – according to Prop. 1 – we have

The reader will now recognize that we know the expression of  f_Y_1 because of Prop. 3. So, we have:

For the first integral we find:

For the second one we have:

Hence, we find:

And the thesis is proved

PROPOSITION 5 (m = 4). Let X_1, X_2, X_3, X_4 be independent exponential random variables with pairwise distinct parameters λ_1, λ_2, λ_3, λ_4, respectively. The law of Y = X_1 + X_2 + X_3 + X_4 is given by:

for y>0, while it is zero otherwise.

Proof. Let’s consider the two random variables Y_1 = X_1 +  X_2, Y_2 = X_3 + X_4. Prop. 2 tells us that Y_1, Y_2 are independent. This means that – according to Prop. 1 – we can write:

The reader has likely already realized that we have the expressions of f_Y_1 and f_Y_2, thanks to Prop. 3. So we have:

For the four integrals we can easily calculate what follows:

Adding these four integrals together we obtain:

And this proves the thesis

We are now quite confident in saying that the expression of f_Y for the generic value of m is given by:

for y>0, while being zero otherwise. But we aim at a rigorous proof of this expression.

Proof

In order to carry out our final demonstration, we need to prove a property that is linked to the matrix named after Vandermonde, that the reader who has followed me till this point will likely remember from his studies of linear algebra. The determinant of the Vandermonde matrix is given by:

PROPOSITION 6 (lemma). The following relationship is true:

Proof. In the following lines, we calculate the determinant of the matrix below, with respect to the second line. In the end, we will use the expression of the determinant of the Vandermonde matrix, mentioned above:

But this determinant has to be zero since the matrix has two identical lines, which proves the thesis

PROPOSITION 7. Let X_1, X_2, … , X_m be independent exponential random variables with pairwise distinct parameters λ_1, λ_2, … , λ_m, respectively. The law of Y = X_1 + X_2 + … + X_m is given by:

for y > 0, while being zero otherwise.

Proof. We already know that the thesis is true for m = 2, 3, 4. We now admit that it is true for m-1 and we demonstrate that this implies that the thesis is true for (proof by induction). Let’s define the random variables Y_1 = X_1 + X_2 + … + X_m-1 and Y_2 = X_m. These two random variables are independent (Prop. 2) so – according to Prop. 1 – we have:

Now, f_Y_1 is the thesis for m-1 while f_Y_2 is the exponential distribution with parameter λ_m. So we have:

For the sum we have:

The sum within brackets can be written as follows:

So far, we have found the following relationship:

The last sum in this expression can be written as follows:

In a more concise way:

But it is also true that:

Which means that we have found:

So, for the thesis to be true it suffices to prove that:

But, as can be easly seen, we have

Hence, we just need to prove that

But we can write:

Moreover, we have:

So, we just need to prove that:

We note now that:

Moreover, we have

Therefore we have found:

But we also have that:

Which means that:

So we just have to prove that:

But this expression has been demonstrated in Prop. 6 and this proof is concluded ♦

References. A paper on this same topic has been written by Markus Bibinger and it is available here

# Integrals for Statistics

In the following four tables, I present a collection of integrals and integral functions that are of common use in Statistics. The values for the Beta function have been calculated by a code of mine that uses – as a recursive function – another code that I wrote for the calculation of the Gamma Function (see the related blog post). Below these tables, you find a representation of the Beta function, plotted by my code. I have demonstrated each one of these results and these calculations will be included in an upcoming book.

Donate

Consider supporting this website with a donation.

€1,00

# Bivariate normal distribution

In a previous blog post, I have discussed the case of normally distributed random variables. But what happens if we have a vector whose components are random variables with this kind of distribution? In particular, what is the expression of the joint probability density? I don’t know how to derive this formula from the marginal distributions, but I have been able to demonstrate, after several hours of work, the following proposition.

Proposition 1. The joint probability density in the case of a random vector whose two components follow a normal distribution is:

where σ_1 and σ_2 are the standard deviations, μ_1 and μ_2 are the expectations of the two random variables and ρ is the correlation coefficient between the two variables.

Proof. The marginal distribution of x_1 is given by

By operating the substitutions

we obtain

This demonstrates that x_1 is in fact a normally distributed random variable and that its expectation and standard deviation are σ_1 and μ_1, respectively. The same applies to x_2. In order to demonstrate that ρ is the correlation coefficient between x_1 and x_2 we have to calculate the covariance of f_X, which is given by E[X_1,X_2]-E[X_1]E[X_2]. The first addend is given by:

Let’s consider the inner integral and the following substitutions

We have:

The two integrals inside the root square are not the easiest ones, and I will not discuss them here. Just consider that they can be solved using the well known results:

We have

If we put these two integrals inside I, we obtain

The two inner integrals are two difficult ones, again. The first one can be solved using the same exponential integral used above:

For the other inner integral we have

The first and the second integrals can be solved using the table previously reported, while the third one is zero. At the end of all this work, we have:

And this proves that ρ is the correlation coefficient between x_1 and x_2. This is a complete demonstration of the expression of the joint probability density of a bivariate normal distribution ■

This formula can be extended to the case of m components.

Proposition 2. The joint probability density in the case of a random vector whose m components follow a normal distribution is:

where C is the covariance matrix, given by

Proof. This formula is easy to prove in the case of m=2, using proposition 1. I have no idea about how it could be proved when m>2 ■

I have written a code in Fortran (download) that asks you for expectations, standard deviations and correlation coefficient and plots the joint density of the bivariate normal distribution. I have written a code in Octave that does the same job (download). In the picture below you find the joint density for two standard normally distributed random variables with ρ=0 (left), ρ=0.7 (right, upper half) and ρ=-0.7 (right, lower half). These pictures have been plotted by the code in Fortran. The other image is a comparison between the two programs, for another bivariate normal distribution.

Donate

Consider supporting this website with a donation.

€1,00

# The Gamma distribution

1_Introduction

In what follows I will present a set of codes that I have written in Octave and in Fortran for the study of random variables that follow a gamma distribution. In another post, we will discuss the beta distribution.

2_The gamma distribution

A random variable is said to have a gamma distribution with parameters (α, λ) if its density is given by

with α, λ>0 and where Γ(α) is the gamma function, which is quite a pain in the lower back, as you will see. This random variable can be used to describe the probability that the duration of life of an entity (whether it is a mechanical component or a bacterium) depends on time. The gamma distribution allows to describe the following situations:

1. for α>1 the probability of failure of a mechanical component (or death of an organism) increases as the time passes by (increasing hazard rate);
2. for α=1 the probability of failure is constant (constant hazard rate);
3. for α<1 the failure is less likely as time passes by (decreasing hazard rate).

The hazard rate is the limit below:

where F_X(t) is the distribution function of the gamma distribution. I have not been able to find an analytic expression for the hazard rate of the gamma distribution, but I have been able to obtain it with the computer-aided calculation (see below). But in order to calculate F_X(t), we need to calculate the gamma function…

3_The gamma function

Γ(α) (the gamma function) is an integral function that cannot be calculated analytically:

It has to be calculated with numerical integration, but a problem is that Γ(α) is an indefinite integral. One way to face this problem is to stop the numerical integration when the addend is below a certain cut off (let’s say 0.0001). It is worth noting that an important feature of this function is that if you know its value in an interval such as [x, x+1], then you can derive Γ(α) for any other possible value of α. This is due to the recursive formula: Γ(α+1) = αΓ(α). That said, I have calculated Γ(α) in [1, 2] and then I have used these values to define Γ(α) in [0, 4] with the .exe file (written in Fortran) available here. Download it and put it in a folder. When you run it, it gives a .txt file with the values of Γ(α) in [0, 4] and two .mat files that can be used to build the plot of Γ(α) (see figure 1). Values of Γ(α) in [0, 2] are tabulated below, while those of you who like the very long list of numbers can find a huge table with values of the gamma function in [0, 4] at the end of this article.

4_The distribution function

The distribution function of the gamma density cannot be calculated analytically so, again, I had to make good use of my poor coding skills and I wrote a code in Octave (here, it uses the two .mat files generated by the code in Fortran mentioned above!) and one in Fortran (available here). In figure 2 (code in Octave) you can find the density and the distribution function for different values of α.

The code in Fortran, on the other hand, asks the user for the values of α and λ and gives a plot for the density and one for the distribution function. Moreover, it gives a .txt file with the values of the distribution function. Note that you can only place values for α that are below 25, otherwise you generate an overflow condition. I am not sure why you have overflow condition with a value of α that is below 34, but this is another story. Nevertheless, for those who are interested, Γ(n) = (n-1)! and the biggest factorial that can be handled by a computer in single precision mode (where 8 bits are used for the exponent) is 33!. So, we have a little mystery here… The reason why I have stuck with single precision is that DISLIN – the graphics library – seems to be unable to handle double precision numbers.  If you are interested in the gamma function, the density and the distribution function for α = 20 and λ=2, then the code in Fortran gives for Γ(20) a value of 1.216451E+17 and the two plots in figure 3.

5_The hazard rate

As mentioned in paragraph 2, the hazard rate of the gamma distribution cannot be easily studied. I forgot to mention that it is relatively easy to prove that it is constant for α=1. But what happens for α>1 and for α<1? In order to answer, I have written a code (download) that plots it and it seems to be possible to say that the hazard rate increases for α>1 and decreases for α<1. This means that the gamma distribution is suitable for modeling each of the three kinds of patterns for the probability of failure as the time passes by. In figure 4 you find an example of the hazard rate for α<1 (left) and an example for α>1 (right).

A table for the gamma function in [0.01, 3.99] is reported below. In each column, the value of α is indicated on the left while the corresponding value for Γ(α) is indicated on the right.

 0.0100| 99.3302 0.0200| 49.3945 0.0300| 32.7556 0.0400| 24.4404 0.0500| 19.4548 0.0600| 16.1337 0.0700| 13.7639 0.0800| 11.9888 0.0900| 10.6096 0.1000|  9.5079 0.1100|  8.6081 0.1200|  7.8592 0.1300|  7.2268 0.1400|  6.6857 0.1500|  6.2175 0.1600|  5.8089 0.1700|  5.4490 0.1800|  5.1300 0.1900|  4.8450 0.2000|  4.5893 0.2100|  4.3585 0.2200|  4.1492 0.2300|  3.9587 0.2400|  3.7845 0.2500|  3.6246 0.2600|  3.4776 0.2700|  3.3418 0.2800|  3.2161 0.2900|  3.0994 0.3000|  2.9909 0.3100|  2.8898 0.3200|  2.7952 0.3300|  2.7067 0.3400|  2.6237 0.3500|  2.5457 0.3600|  2.4723 0.3700|  2.4031 0.3800|  2.3378 0.3900|  2.2762 0.4000|  2.2178 0.4100|  2.1625 0.4200|  2.1101 0.4300|  2.0602 0.4400|  2.0129 0.4500|  1.9679 0.4600|  1.9249 0.4700|  1.8840 0.4800|  1.8451 0.4900|  1.8078 0.5000|  1.7722 0.5100|  1.7382 0.5200|  1.7056 0.5300|  1.6744 0.5400|  1.6446 0.5500|  1.6159 0.5600|  1.5884 0.5700|  1.5621 0.5800|  1.5367 0.5900|  1.5124 0.6000|  1.4890 0.6100|  1.4665 0.6200|  1.4449 0.6300|  1.4240 0.6400|  1.4040 0.6500|  1.3846 0.6600|  1.3660 0.6700|  1.3480 0.6800|  1.3307 0.6900|  1.3140 0.7000|  1.2979 0.7100|  1.2823 0.7200|  1.2673 0.7300|  1.2528 0.7400|  1.2388 0.7500|  1.2253 0.7600|  1.2122 0.7700|  1.1996 0.7800|  1.1873 0.7900|  1.1755 0.8000|  1.1641 0.8100|  1.1530 0.8200|  1.1424 0.8300|  1.1320 0.8400|  1.1220 0.8500|  1.1124 0.8600|  1.1030 0.8700|  1.0939 0.8800|  1.0852 0.8900|  1.0767 0.9000|  1.0685 0.9100|  1.0606 0.9200|  1.0529 0.9300|  1.0455 0.9400|  1.0383 0.9500|  1.0313 0.9600|  1.0246 0.9700|  1.0181 0.9800|  1.0118 0.9900|  1.0058 1.0000|  1.0000 1.0100|  0.9933 1.0200|  0.9879 1.0300|  0.9827 1.0400|  0.9776 1.0500|  0.9727 1.0600|  0.9680 1.0700|  0.9635 1.0800|  0.9591 1.0900|  0.9549 1.1000|  0.9508 1.1100|  0.9469 1.1200|  0.9431 1.1300|  0.9395 1.1400|  0.9360 1.1500|  0.9326 1.1600|  0.9294 1.1700|  0.9263 1.1800|  0.9234 1.1900|  0.9206 1.2000|  0.9179 1.2100|  0.9153 1.2200|  0.9128 1.2300|  0.9105 1.2400|  0.9083 1.2500|  0.9062 1.2600|  0.9042 1.2700|  0.9023 1.2800|  0.9005 1.2900|  0.8988 1.3000|  0.8973 1.3100|  0.8958 1.3200|  0.8945 1.3300|  0.8932 1.3400|  0.8920 1.3500|  0.8910 1.3600|  0.8900 1.3700|  0.8892 1.3800|  0.8884 1.3900|  0.8877 1.4000|  0.8871 1.4100|  0.8866 1.4200|  0.8862 1.4300|  0.8859 1.4400|  0.8857 1.4500|  0.8855 1.4600|  0.8855 1.4700|  0.8855 1.4800|  0.8856 1.4900|  0.8858 1.5000|  0.8861 1.5100|  0.8865 1.5200|  0.8869 1.5300|  0.8874 1.5400|  0.8881 1.5500|  0.8888 1.5600|  0.8895 1.5700|  0.8904 1.5800|  0.8913 1.5900|  0.8923 1.6000|  0.8934 1.6100|  0.8946 1.6200|  0.8958 1.6300|  0.8971 1.6400|  0.8985 1.6500|  0.9000 1.6600|  0.9016 1.6700|  0.9032 1.6800|  0.9049 1.6900|  0.9067 1.7000|  0.9085 1.7100|  0.9105 1.7200|  0.9125 1.7300|  0.9146 1.7400|  0.9167 1.7500|  0.9190 1.7600|  0.9213 1.7700|  0.9237 1.7800|  0.9261 1.7900|  0.9287 1.8000|  0.9313 1.8100|  0.9340 1.8200|  0.9367 1.8300|  0.9396 1.8400|  0.9425 1.8500|  0.9455 1.8600|  0.9486 1.8700|  0.9517 1.8800|  0.9550 1.8900|  0.9583 1.9000|  0.9617 1.9100|  0.9651 1.9200|  0.9687 1.9300|  0.9723 1.9400|  0.9760 1.9500|  0.9798 1.9600|  0.9836 1.9700|  0.9876 1.9800|  0.9916 1.9900|  0.9957 2.0000|  1.0000 2.0100|  1.0032 2.0200|  1.0076 2.0300|  1.0121 2.0400|  1.0167 2.0500|  1.0214 2.0600|  1.0261 2.0700|  1.0309 2.0800|  1.0358 2.0900|  1.0408 2.1000|  1.0459 2.1100|  1.0510 2.1200|  1.0563 2.1300|  1.0616 2.1400|  1.0670 2.1500|  1.0725 2.1600|  1.0781 2.1700|  1.0838 2.1800|  1.0896 2.1900|  1.0955 2.2000|  1.1014 2.2100|  1.1075 2.2200|  1.1136 2.2300|  1.1199 2.2400|  1.1263 2.2500|  1.1327 2.2600|  1.1393 2.2700|  1.1459 2.2800|  1.1526 2.2900|  1.1595 2.3000|  1.1665 2.3100|  1.1735 2.3200|  1.1807 2.3300|  1.1880 2.3400|  1.1953 2.3500|  1.2028 2.3600|  1.2104 2.3700|  1.2181 2.3800|  1.2260 2.3900|  1.2339 2.4000|  1.2420 2.4100|  1.2501 2.4200|  1.2584 2.4300|  1.2668 2.4400|  1.2754 2.4500|  1.2840 2.4600|  1.2928 2.4700|  1.3017 2.4800|  1.3107
 2.4900|  1.3199 2.5000|  1.3292 2.5100|  1.3386 2.5200|  1.3481 2.5300|  1.3578 2.5400|  1.3676 2.5500|  1.3776 2.5600|  1.3877 2.5700|  1.3979 2.5800|  1.4083 2.5900|  1.4188 2.6000|  1.4294 2.6100|  1.4403 2.6200|  1.4512 2.6300|  1.4623 2.6400|  1.4736 2.6500|  1.4850 2.6600|  1.4966 2.6700|  1.5083 2.6800|  1.5202 2.6900|  1.5323 2.7000|  1.5445 2.7100|  1.5569 2.7200|  1.5694 2.7300|  1.5822 2.7400|  1.5951 2.7500|  1.6082 2.7600|  1.6214 2.7700|  1.6349 2.7800|  1.6485 2.7900|  1.6623 2.8000|  1.6763 2.8100|  1.6905 2.8200|  1.7049 2.8300|  1.7194 2.8400|  1.7342 2.8500|  1.7492 2.8600|  1.7644 2.8700|  1.7797 2.8800|  1.7953 2.8900|  1.8111 2.9000|  1.8271 2.9100|  1.8434 2.9200|  1.8598 2.9300|  1.8765 2.9400|  1.8934 2.9500|  1.9106 2.9600|  1.9279 2.9700|  1.9455 2.9800|  1.9634 2.9900|  1.9815 3.0000|  2.0000 3.0100|  2.0165 3.0200|  2.0354 3.0300|  2.0547 3.0400|  2.0741 3.0500|  2.0938 3.0600|  2.1138 3.0700|  2.1340 3.0800|  2.1545 3.0900|  2.1753 3.1000|  2.1963 3.1100|  2.2177 3.1200|  2.2393 3.1300|  2.2612 3.1400|  2.2835 3.1500|  2.3059 3.1600|  2.3288 3.1700|  2.3519 3.1800|  2.3753 3.1900|  2.3991 3.2000|  2.4231 3.2100|  2.4476 3.2200|  2.4723 3.2300|  2.4974 3.2400|  2.5228 3.2500|  2.5486 3.2600|  2.5747 3.2700|  2.6012 3.2800|  2.6280 3.2900|  2.6552 3.3000|  2.6828 3.3100|  2.7109 3.3200|  2.7392 3.3300|  2.7680 3.3400|  2.7971 3.3500|  2.8266 3.3600|  2.8566 3.3700|  2.8870 3.3800|  2.9178 3.3900|  2.9490 3.4000|  2.9807 3.4100|  3.0128 3.4200|  3.0454 3.4300|  3.0784 3.4400|  3.1119 3.4500|  3.1459 3.4600|  3.1803 3.4700|  3.2152 3.4800|  3.2506 3.4900|  3.2865 3.5000|  3.3229 3.5100|  3.3598 3.5200|  3.3973 3.5300|  3.4352 3.5400|  3.4737 3.5500|  3.5128 3.5600|  3.5524 3.5700|  3.5926 3.5800|  3.6333 3.5900|  3.6746 3.6000|  3.7165 3.6100|  3.7591 3.6200|  3.8022 3.6300|  3.8459 3.6400|  3.8903 3.6500|  3.9353 3.6600|  3.9809 3.6700|  4.0272 3.6800|  4.0742 3.6900|  4.1218 3.7000|  4.1702 3.7100|  4.2192 3.7200|  4.2689 3.7300|  4.3194 3.7400|  4.3705 3.7500|  4.4225 3.7600|  4.4751 3.7700|  4.5286 3.7800|  4.5828 3.7900|  4.6378 3.8000|  4.6936 3.8100|  4.7503 3.8200|  4.8077 3.8300|  4.8660 3.8400|  4.9251 3.8500|  4.9852 3.8600|  5.0461 3.8700|  5.1079 3.8800|  5.1706 3.8900|  5.2342 3.9000|  5.2987 3.9100|  5.3642 3.9200|  5.4307 3.9300|  5.4982 3.9400|  5.5667 3.9500|  5.6361 3.9600|  5.7067 3.9700|  5.7782 3.9800|  5.8508 3.9900|  5.9245

# Normally distributed random variables

I am currently trying to learn as much as I can about basic statistics; and one of the reasons for that is the need to gain access to the tools useful to better understand statistical analyses of medical data that I find in all the research papers on ME/CFS and related topics, which I read almost on a daily basis. Hopefully, I should also be able, at the end of this path, to perform my own statistical analyses of biological data.

When I study a subject where it is required the use of computer-aided calculation, I usually code my own software, even if there are programs ready to use for that purpose. The main reason for that is that in order to write a program that really works, you have to reach a complete understanding of the argument. So, it is a test in some way: if your program works properly, then you have learned something new. The other reason is that I like this activity, even though I can’t perform it for most of the time, because of the cruel encephalopathy due to my mysterious disease.

To give you an example of this attitude that I have, when I decided to study molecular mimicry between B-cell epitopes from a mathematical point of view, I found several programs for peptides alignments; nevertheless I decided to write my own software, and I spent several months of pure joy (but also of profound frustration because I was often too sick to write even a single line of code), learning about substitution matrices, dynamic programming and performing my own in silico experiments of cross-reactivity between enolases (a glycolitic enzyme evolutionary conserved across species) from different organisms. This work was then used in a paper recently published in the scientific journal Medical Hypotheses (Maccallini P et al. 2018).  In figure 1 the flow diagram that I drew, while here and here you can download the main program and the subroutine, respectively, that I coded in Octave in order to perform comparisons between proteins. But this is another story… So, let’s come back to statistics.

Many measurements of physical or biological phenomena obey to what is known as normal distribution: indicated with x the value of a specific measure, we have that values near the mean (also known as expectation and indicated with the letter μ) are more frequently assumed by x. In particular, if we indicate f(t) a function (named density) such that f(t)dt represents the probability that x takes a value between t and t+dt, then f(t) is the well known bell-shaped curve in figure 2 (on the left). As you can see, the higher the standard deviation σ is, the lower the bell: this means that standard deviation represents a measure of how much the values taken by x are spread on the x-axis.

Figure 2 has been obtained with a code that I have written in Octave, which can be downloaded here. So, what is the probability that x takes a value below a? This probability is given by the integral of f(t) between -∞ and a. This function is named distribution function, Φ(x). In figure 2, on the right, Φ(x) is plotted for different values of expectation and standard deviation. I have calculated these integrals using the numerical method known as Simpson’s rule. For μ=0 and σ=1 we have a particular type of normal distribution, usually called standard normal distribution. Some values of its distribution function are collected in the table in figure 3, that I have derived with this code in Octave. A table like this is available in any book of statistics, but I found useful to calculate my own table, for the reasons we have already mentioned.

Now, what about the distribution function for a normal non-standard distribution? Of course, you can calculate Φ(x) for any given values of μ and σ using available tools (like, for instance, function normcdf(x, m, s) of Octave). But, as always, I have coded my own software (in Fortran in this case). You can download the executable file here. Put it in a folder, then run it: it will ask you to type the values of μ, σ and it will give as output two diagrams (one for f, the other for Φ) and a .txt file which contains several values of Φ that can be used for applications. For μ=2, σ=0.5 we obtain the diagrams in figure 4, and a table whose first lines are in figure 5.