ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

Calculating Probabilities for Normal and Non-Normal Distributions

Updated on May 2, 2018

Introduction

We show different ways to calculate probabilities. The first example illustrates a case where the data sample is normal and the second example for a case where the sample distribution is not normal.

It is used three tools:

  1. Microsoft Excel: https://www.microsoft.com/en-us/store/d/excel-2016
  2. Minitab: www.minitab.com
  3. Universal Probability Calculator: http://www.dunamath.com/dunamath.aspx

Example 1: Normal distribution

Description of the problem:

Assume you measured the commuting time from your house to the office 15 times. By doing that you realise that the average time is 53 minutes. You also wish to know the odds of taking up to 1 hour to go to the office.

Solution using “Excel - Windows”:

On the Excel menu, Data -> Data Analysis -> Descriptive Statistics to get table below:

Because Kurtosis and Skewness are close to zero, we can assume that the distribution is normal, or at least close to it. The sample size is 15 which is small. We also do not know the variance of the population. For all these reasons it is appropriate to use a Student distribution.

We have a degree of freedom of 14, and t=(60-53.102)/8.280 = 0.833

Using Excel command T.DIST(0.833,14,1), we get that the probability of having a value up to 60 minutes is 79.06%.

Solution using “Minitab”:

Initially we perform a test of goodness for a normal distribution. On Minitab: Stat-> Basic Statistics -> Normality Test. For Anderson-Darling and Kolmogorov-Smirnov we have the results below, both not rejecting the null hypothesis of normality (p-values not smaller than 0.05). Therefore, it is plausible to assume the distribution is normal.

Because sample size is small and we do not know the variance of the population, we use Student distribution. On Minitab: Calc -> Probability Distributions-> . Select “cumulative probability”, and in the field “input constant” entry 0.833 (same value previously used for the Excel solution). By doing so, we get the answer (on the right) saying the probability of having a value smaller than 60 minutes is 79.06%.

Solution using the “Universal Probability Calculator (UPC)”:

The first step is showed in the following figure:

In the second step, copy and paste the values directly to the field:

After clicking on the button “Calculate” a message is displayed saying that the odds are 81.9%.

Discussion of the results

Initially, regarding the source of the data, it was generated a population of 20000 values using the software Matlab, function: (randn(20000,1) * 5 ) + 50 . From this population, we collect 15 values by chance (our working sample).

Summary of the results:

UPC-Dunamath
Excel
Minitab
Correct answer
81.9%
79.06%
79.06%
97.72%

Excel and Minitab returned the same result because they both used Student Distribution with the same parameter . Both assumed normal distribution which is correct in this case because the population was generated from a normal distribution. However, the mean and standard deviation parameters used to calculate are significantly wrong. For the sample, mean and standard deviation are 53.1 and 8.28 respectively, while for the population they are 50 and 5. It explains the big error of the answer.

Other point is that even using Excel and/or Minitab correctly, it is likely that the decision maker will believe in the result (79.06%) and make his decision. These tools do not give information about the accuracy of the answer and many times the users are not even aware of the existence of uncertainty in the result.

The Universal Probability Calculator (UPC) retuned a probability of 81.9%, a little bit better than the others, and it also reports a small confidence level of 64%, alerting the user about it.

The UPC not only computes the probability in a very simple and straight way, but also gives an estimate of the uncertainty present in the result. By doing so, it seems to be fairer with the decision maker. If he wishes to have a smaller uncertainty, he needs to increase the size of the sample.

Example 2: Non-normal distribution

Problem description:

A product engineer is studying the life time of a hard drive disc. In one experiment it is measured the life time in hours of 10 discs. Results in the following table:

A) What is the probability of having a disc lasting longer than 1900 hours?

Solution using “Excel - Windows”:

Excel Menu: Data -> Data Analysis -> Descriptive Statistics to get the following table:

Because Kurtosis and Skewness are close to zero, it is plausible to assume the distribution is normal or approximately normal. The sample size is small and we do not know the variance of the population, therefore we decide to use Student distribution.

We have a degree of freedom 9 and t = (1900-1972.09)/55.30 = -1.304

Using the Excel command T.DIST(-1.304,9,1), we have that the probability of having a value greater than 1900 is 88.8%.

Solution using “Minitab”:

Initially we perform a test of goodness for a normal distribution. On Minitab: Stat-> Basic Statistics -> Normality Test. For Anderson-Darling and Kolmogorov-Smirnov we have the results below, both not rejecting the null hypothesis of normality. Therefore, it is plausible to assume the distribution is normal.

Because sample size is small and we do not know the variance of the population, we use Student distribution. On Minitab: Calc -> Probability Distributions-> . Select “cumulative probability”, and in the field “input constant” entry -1.304 (same computed in Excel). The answer in the figure below, where the probability is (100-11.2)=88.8%.

Solution using the “Universal Provability Calculator (UPC)”:

Step 1 as follows:

Note that you could have selected ≥ , but because the date refers to consitnuous variables, that is not relevant.

Step 2 as follows, copying and pasting the data:

After clicking on “Calculate” is displayed a message saying that the probability of having a disc lasting longer than 1900 hours is 90.56%.

B) Regarding the probability you have just calculated, how sure you are?

Using UPC-Dunamath, a message is displayed saying:

We are 68% confidence that the true value is between 85.56% and 95.56%”.

It means, we are 68% confident the true value falls within this interval. It also means that if you collect more 15 samples to repeat the test, and keep doing that many times, at least 68% of the probabilities will fall within the interval. Note that we do not have this information from Excel or Minitab.

C) In order to improve the confidence level, you get the lifetime of others 30 discs, and you repeat the test using also the previous sample, totalling a sample of 40 discs. What is the probability of having a disc lasting longer than 1900 hours?

Solution using “Excel - Windows”:

Excel menu: Data -> Data Analysis -> Descriptive Statistics:

Kurtosis and Skewness are NOT close to zero, not too far also, but in this case, it is safer not assume the distribution is normal.

In Excel there is no straight method to deal with non-normal data. One alternative is to assume the data is not far from normal, and use Student Distribution, with t=(1900-1979.30)/110.29 = -0.719, Excel command T.DIST(-0.719,39,1), resulting in 76.2%.

Another alternative is to use an Empirical Distribution Function (EDF), as showed in the next step.

A table with the Empirical Distribution is showed as follows:

In the Empricial Distributon table, in the first column we have the data sorted in ascending order. In the second column we have for each value the amount of values smaller or equal to the current value (which coincides with the row number). In the third column we have the value of the second column divided by the sample size resulting in a cumulative frequency. Finally, in the fourth column we have the complement of the third column.

We want to calculate the probability of having a value greater than 1900. In the table, the value 1900 is between lines 6 and 7 (1899.61 and 1909.35). By doing so it is possible to say that the probability is around 82.5% and 85%. Note that there is no guarantee the true value is within this interval. But for a non-normal data, this is a simple method to give a notion of the probability.

Solution using “Minitab”:

Initially we perform a test of goodness for a normal distribution. On Minitab: Stat-> Basic Statistics -> Normality Test. For Anderson-Darling the null hypothesis of normality is rejected. Therefore, it is not plausible to assume the distribution is normal.

Because the distribution is not normal, we need to estimate the type of the distribution. Minitab menu: Stat > Quality Tools > Individual Distribution Identification.

We get the table below with an Anderson Darling test applied to different types of distribution. In general, all distributions with P smaller than 0.05 are immediately discarded. From the remaining ones, we get the one with greatest P value.

In our case, the first is “Johnson Transformation”, then “Box-Cox Transformation”, and after that, “Weibull”. Because the first two are transformations and not native distributions, and also, there is no straight method to use them in Minitab, we pick the “Weibull” distribution.

With the previous table we also have the following table with the parameter of each distribution. In our case, for “Weibull”, there are 2 parameters: 22.30053 (shape) and 2027 (scale).

In the next step, on Minitab menu: Calc -> Probability Distributions->Weibull. Select “cumulative probability”, type the 2 values of the parameters, in the field “input constant”, type the value 1900.

By doing so, we have the answer as follows:

We want the probability of having values greater than 1900, so we have (1-0.2098) = 0.7902 = 79.02%. Phew!!! Finally!

Solution using the “Universal Provability Calculator (UPC)”:

Specify the cut-off point and paste the data sample as showed in the figures below:

Note there are more values on the right of the field in the picture.

After clicking on “Calculate” is displayed a message saying that the probability of having a disc lasting longer than 1900 hours is 85.09%, with 79% confidence that the true value is between 80.09% and 90.09%.

Discussion of the results:

Initially, regarding the source of the data, we generated 20000 values using the software Matlab, function: wblrnd(2042.6,25.8773,20000,1) generating a population with Weibull distribution, mean 2000.3 and standard deviation 97.192. From that, we collect our samples by chance.

For the first sample (N=10):

UPC-Dunamath
Excel
Minitab
Correct answer
90.56%
88.8%
88.8%
85.82%

For the extended sample (N=40):

UPC-Dunamath
Excel (using Student Distribution)
Excel (empirical distribution)
Minitab
Correct answer
85.09%
76.2%
[82.5% - 85%]
79.02%
85.82%

In the first table (N=10), we see that Excel and Minitab returned the same result because they both used Student distribution with the same . But note that, despite the approval in the test of goodness for normal distribution, the correct distribution is Weibull.

The mean and standard deviation of the sample are 1972.09 and 55.30 respectively, while for the population we have 2000.3 and 97.192. Note that despite the fact we have assumed the wrong distribution the probability error is small, which might be just lucky, for example a numerical coincidence influenced by the relation between mean and standard deviation.

For UPC, the probability is 90.56%, with 68% confidence that the true value is between 85.56% and 95.56%. The confidence is low due to the small sample which is an alert for the user, not provided by the other tools. Despite that, the true value is within the interval.

In the second table (N=40), both in Excel and Minitab we rejected the assumption of normality. For Excel, we proposed the utilization of the Empirical Distribution Function just to have an idea of the probability, obtaining a value around 82.5% and 85%, which compared with the correct answer is a plausible value.

Using Minitab, after a hard work identifying a suitable distribution type, its parameters, and performing the calculation, the result was even worse than the case with N=10 . This is an inconvenient but possible, because we are using small samples, and maybe the additional samples are less representative of the population than the initial sample, or it is just a numerical coincidence.

For UPC, the probability is 85.09%, with 79% confidence that the true value is between 80.09% and 90.09%. Compared with N=10, the confidence level increased significantly due to a bigger sample size. The error is smaller than Excel and Minitab, and the true value is within the estimated interval.

By this example, we see how complicated these analyses can become. It is complicated to calculate a value for the probability, and after that, you still do not know the uncertainty of the result. The Universal Probability Calculator (UPC) makes this calculation much easier, and also gives an estimate for the uncertainty involved. You do not need to be worried with all statistical assumptions and trick details, it is everything treated by the algorithm.

The Universal Probability Calculator can be accessed in the website: https://dunamath.com/homeUPC.aspx

© 2018 Douglas Miranda

Comments

    0 of 8192 characters used
    Post Comment

    No comments yet.

    working

    This website uses cookies

    As a user in the EEA, your approval is needed on a few things. To provide a better website experience, hubpages.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

    For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://hubpages.com/privacy-policy#gdpr

    Show Details
    Necessary
    HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
    LoginThis is necessary to sign in to the HubPages Service.
    Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
    AkismetThis is used to detect comment spam. (Privacy Policy)
    HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
    HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
    Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
    CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
    Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
    Features
    Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
    Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
    Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
    Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
    Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
    VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
    PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
    Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
    MavenThis supports the Maven widget and search functionality. (Privacy Policy)
    Marketing
    Google AdSenseThis is an ad network. (Privacy Policy)
    Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
    Index ExchangeThis is an ad network. (Privacy Policy)
    SovrnThis is an ad network. (Privacy Policy)
    Facebook AdsThis is an ad network. (Privacy Policy)
    Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
    AppNexusThis is an ad network. (Privacy Policy)
    OpenxThis is an ad network. (Privacy Policy)
    Rubicon ProjectThis is an ad network. (Privacy Policy)
    TripleLiftThis is an ad network. (Privacy Policy)
    Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
    Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
    Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
    Statistics
    Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
    ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
    Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)
    ClickscoThis is a data management platform studying reader behavior (Privacy Policy)