# Calculating Probabilities for Normal and Non-Normal Distributions

Updated on May 2, 2018 ## Introduction

We show different ways to calculate probabilities. The first example illustrates a case where the data sample is normal and the second example for a case where the sample distribution is not normal.

It is used three tools:

1. Microsoft Excel: https://www.microsoft.com/en-us/store/d/excel-2016
2. Minitab: www.minitab.com
3. Universal Probability Calculator: http://www.dunamath.com/dunamath.aspx

## Example 1: Normal distribution

Description of the problem:

Assume you measured the commuting time from your house to the office 15 times. By doing that you realise that the average time is 53 minutes. You also wish to know the odds of taking up to 1 hour to go to the office.

## Solution using “Excel - Windows”:

On the Excel menu, Data -> Data Analysis -> Descriptive Statistics to get table below:

Because Kurtosis and Skewness are close to zero, we can assume that the distribution is normal, or at least close to it. The sample size is 15 which is small. We also do not know the variance of the population. For all these reasons it is appropriate to use a Student distribution.

We have a degree of freedom of 14, and t=(60-53.102)/8.280 = 0.833

Using Excel command T.DIST(0.833,14,1), we get that the probability of having a value up to 60 minutes is 79.06%.

## Solution using “Minitab”:

Initially we perform a test of goodness for a normal distribution. On Minitab: Stat-> Basic Statistics -> Normality Test. For Anderson-Darling and Kolmogorov-Smirnov we have the results below, both not rejecting the null hypothesis of normality (p-values not smaller than 0.05). Therefore, it is plausible to assume the distribution is normal.

Because sample size is small and we do not know the variance of the population, we use Student distribution. On Minitab: Calc -> Probability Distributions-> . Select “cumulative probability”, and in the field “input constant” entry 0.833 (same value previously used for the Excel solution). By doing so, we get the answer (on the right) saying the probability of having a value smaller than 60 minutes is 79.06%.

## Solution using the “Universal Probability Calculator (UPC)”:

The first step is showed in the following figure:

In the second step, copy and paste the values directly to the field:

After clicking on the button “Calculate” a message is displayed saying that the odds are 81.9%.

## Discussion of the results

Initially, regarding the source of the data, it was generated a population of 20000 values using the software Matlab, function: (randn(20000,1) * 5 ) + 50 . From this population, we collect 15 values by chance (our working sample).

Summary of the results:

UPC-Dunamath
Excel
Minitab
81.9%
79.06%
79.06%
97.72%

Excel and Minitab returned the same result because they both used Student Distribution with the same parameter . Both assumed normal distribution which is correct in this case because the population was generated from a normal distribution. However, the mean and standard deviation parameters used to calculate are significantly wrong. For the sample, mean and standard deviation are 53.1 and 8.28 respectively, while for the population they are 50 and 5. It explains the big error of the answer.

Other point is that even using Excel and/or Minitab correctly, it is likely that the decision maker will believe in the result (79.06%) and make his decision. These tools do not give information about the accuracy of the answer and many times the users are not even aware of the existence of uncertainty in the result.

The Universal Probability Calculator (UPC) retuned a probability of 81.9%, a little bit better than the others, and it also reports a small confidence level of 64%, alerting the user about it.

The UPC not only computes the probability in a very simple and straight way, but also gives an estimate of the uncertainty present in the result. By doing so, it seems to be fairer with the decision maker. If he wishes to have a smaller uncertainty, he needs to increase the size of the sample.

## Example 2: Non-normal distribution

Problem description:

A product engineer is studying the life time of a hard drive disc. In one experiment it is measured the life time in hours of 10 discs. Results in the following table:

A) What is the probability of having a disc lasting longer than 1900 hours?

## Solution using “Excel - Windows”:

Excel Menu: Data -> Data Analysis -> Descriptive Statistics to get the following table:

Because Kurtosis and Skewness are close to zero, it is plausible to assume the distribution is normal or approximately normal. The sample size is small and we do not know the variance of the population, therefore we decide to use Student distribution.

We have a degree of freedom 9 and t = (1900-1972.09)/55.30 = -1.304

Using the Excel command T.DIST(-1.304,9,1), we have that the probability of having a value greater than 1900 is 88.8%.

## Solution using “Minitab”:

Initially we perform a test of goodness for a normal distribution. On Minitab: Stat-> Basic Statistics -> Normality Test. For Anderson-Darling and Kolmogorov-Smirnov we have the results below, both not rejecting the null hypothesis of normality. Therefore, it is plausible to assume the distribution is normal.

Because sample size is small and we do not know the variance of the population, we use Student distribution. On Minitab: Calc -> Probability Distributions-> . Select “cumulative probability”, and in the field “input constant” entry -1.304 (same computed in Excel). The answer in the figure below, where the probability is (100-11.2)=88.8%.

## Solution using the “Universal Provability Calculator (UPC)”:

Step 1 as follows:

Note that you could have selected ≥ , but because the date refers to consitnuous variables, that is not relevant.

Step 2 as follows, copying and pasting the data:

After clicking on “Calculate” is displayed a message saying that the probability of having a disc lasting longer than 1900 hours is 90.56%.

B) Regarding the probability you have just calculated, how sure you are?

Using UPC-Dunamath, a message is displayed saying:

We are 68% confidence that the true value is between 85.56% and 95.56%”.

It means, we are 68% confident the true value falls within this interval. It also means that if you collect more 15 samples to repeat the test, and keep doing that many times, at least 68% of the probabilities will fall within the interval. Note that we do not have this information from Excel or Minitab.

C) In order to improve the confidence level, you get the lifetime of others 30 discs, and you repeat the test using also the previous sample, totalling a sample of 40 discs. What is the probability of having a disc lasting longer than 1900 hours?

## Solution using “Excel - Windows”:

Excel menu: Data -> Data Analysis -> Descriptive Statistics:

Kurtosis and Skewness are NOT close to zero, not too far also, but in this case, it is safer not assume the distribution is normal.

In Excel there is no straight method to deal with non-normal data. One alternative is to assume the data is not far from normal, and use Student Distribution, with t=(1900-1979.30)/110.29 = -0.719, Excel command T.DIST(-0.719,39,1), resulting in 76.2%.

Another alternative is to use an Empirical Distribution Function (EDF), as showed in the next step.

A table with the Empirical Distribution is showed as follows:

In the Empricial Distributon table, in the first column we have the data sorted in ascending order. In the second column we have for each value the amount of values smaller or equal to the current value (which coincides with the row number). In the third column we have the value of the second column divided by the sample size resulting in a cumulative frequency. Finally, in the fourth column we have the complement of the third column.

We want to calculate the probability of having a value greater than 1900. In the table, the value 1900 is between lines 6 and 7 (1899.61 and 1909.35). By doing so it is possible to say that the probability is around 82.5% and 85%. Note that there is no guarantee the true value is within this interval. But for a non-normal data, this is a simple method to give a notion of the probability.

## Solution using “Minitab”:

Initially we perform a test of goodness for a normal distribution. On Minitab: Stat-> Basic Statistics -> Normality Test. For Anderson-Darling the null hypothesis of normality is rejected. Therefore, it is not plausible to assume the distribution is normal.

Because the distribution is not normal, we need to estimate the type of the distribution. Minitab menu: Stat > Quality Tools > Individual Distribution Identification.

We get the table below with an Anderson Darling test applied to different types of distribution. In general, all distributions with P smaller than 0.05 are immediately discarded. From the remaining ones, we get the one with greatest P value.

In our case, the first is “Johnson Transformation”, then “Box-Cox Transformation”, and after that, “Weibull”. Because the first two are transformations and not native distributions, and also, there is no straight method to use them in Minitab, we pick the “Weibull” distribution.

With the previous table we also have the following table with the parameter of each distribution. In our case, for “Weibull”, there are 2 parameters: 22.30053 (shape) and 2027 (scale).

In the next step, on Minitab menu: Calc -> Probability Distributions->Weibull. Select “cumulative probability”, type the 2 values of the parameters, in the field “input constant”, type the value 1900.

By doing so, we have the answer as follows:

We want the probability of having values greater than 1900, so we have (1-0.2098) = 0.7902 = 79.02%. Phew!!! Finally!

## Solution using the “Universal Provability Calculator (UPC)”:

Specify the cut-off point and paste the data sample as showed in the figures below:

Note there are more values on the right of the field in the picture.

After clicking on “Calculate” is displayed a message saying that the probability of having a disc lasting longer than 1900 hours is 85.09%, with 79% confidence that the true value is between 80.09% and 90.09%.

## Discussion of the results:

Initially, regarding the source of the data, we generated 20000 values using the software Matlab, function: wblrnd(2042.6,25.8773,20000,1) generating a population with Weibull distribution, mean 2000.3 and standard deviation 97.192. From that, we collect our samples by chance.

For the first sample (N=10):

UPC-Dunamath
Excel
Minitab
90.56%
88.8%
88.8%
85.82%

For the extended sample (N=40):

UPC-Dunamath
Excel (using Student Distribution)
Excel (empirical distribution)
Minitab
85.09%
76.2%
[82.5% - 85%]
79.02%
85.82%

In the first table (N=10), we see that Excel and Minitab returned the same result because they both used Student distribution with the same . But note that, despite the approval in the test of goodness for normal distribution, the correct distribution is Weibull.

The mean and standard deviation of the sample are 1972.09 and 55.30 respectively, while for the population we have 2000.3 and 97.192. Note that despite the fact we have assumed the wrong distribution the probability error is small, which might be just lucky, for example a numerical coincidence influenced by the relation between mean and standard deviation.

For UPC, the probability is 90.56%, with 68% confidence that the true value is between 85.56% and 95.56%. The confidence is low due to the small sample which is an alert for the user, not provided by the other tools. Despite that, the true value is within the interval.

In the second table (N=40), both in Excel and Minitab we rejected the assumption of normality. For Excel, we proposed the utilization of the Empirical Distribution Function just to have an idea of the probability, obtaining a value around 82.5% and 85%, which compared with the correct answer is a plausible value.

Using Minitab, after a hard work identifying a suitable distribution type, its parameters, and performing the calculation, the result was even worse than the case with N=10 . This is an inconvenient but possible, because we are using small samples, and maybe the additional samples are less representative of the population than the initial sample, or it is just a numerical coincidence.

For UPC, the probability is 85.09%, with 79% confidence that the true value is between 80.09% and 90.09%. Compared with N=10, the confidence level increased significantly due to a bigger sample size. The error is smaller than Excel and Minitab, and the true value is within the estimated interval.

By this example, we see how complicated these analyses can become. It is complicated to calculate a value for the probability, and after that, you still do not know the uncertainty of the result. The Universal Probability Calculator (UPC) makes this calculation much easier, and also gives an estimate for the uncertainty involved. You do not need to be worried with all statistical assumptions and trick details, it is everything treated by the algorithm.

The Universal Probability Calculator can be accessed in the website: https://dunamath.com/homeUPC.aspx