# Sample Statistics

Updated on October 15, 2014

## Statistical Estimators

There are three statistical estimators. They are unbiased estimators, efficient estimators, and consistent estimators. An unbiased estimator is an estimate of a given parameter when the mean of the sampling distribution of that statistic can be shown to be equal to the parameter being estimated; the mean of a sample can be an unbiased estimator of the mean or the population (Statistics Explained, 2014). An efficient estimator considers the reliability of the estimator in terms of its tendency to have a smaller standard error for the same sample size when compared each other; the median can be an unbiased estimator of sample distribution when it is normally distributed (Pindling, 2009). “A statistics is a consistent estimator of a parameter if its probability that it will be close to the parameter's true value approaches 1 with increasing sample size”( Pindling, 2009).

What criteria must estimates based on sample statistics have? What two types of sample statistics fit these criteria? Explain why other sample statistics would not coincide with the criteria. Also discuss the relationship between the value assigned to alpha and its relationship to the confidence level in a research study.

Easy to understand explanation of question: We are basically talking about statistics that use samples. This includes different kinds of hypothesis testing including ANOVA and t tests. We will also be discussing alpha levels (or p values) and confidence levels. Most studies use a 5% alpha level and what does this really mean?

There are two main types of statistics that use samples: ANOVA and t tests. ANOVA is an analysis of the variance; an ANOVA test is used to compare the means of more than two samples (Explorable Psychology Experiments, 2014). An ANOVA test would be used if a researcher was testing the effect of five different weight loss programs on women; the researcher would recruit 20 women and split them into groups of 4. Each group would them be assigned a different weight loss program and would record their results after a few weeks. The researcher could then use the ANOVA test to find out whether the effect of the weight loss program on them is significantly different or not by comparing the weights of the 5 groups of 4 women each (Explorable Psychology Experiments, 2014). A t test is a “hypothesis-testing procedure in which the population variance is unknown; it compares t scores from a sample to a comparison distribution called a t distribution” (Aron, Coups, & Aron, 2013, p. 227). The t test is used for one to two samples for more samples the ANOVA testing method is used.

There are four main types of hypothesis testing: null hypothesis, alternative hypothesis, simple hypothesis, and composite hypothesis. The null hypothesis “represents a theory that has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved” (Statistics Glossary, 2014). When using the null hypothesis the end result is always reject H0 in favor of H1 or do not reject H0. The alternative hypothesis, referred to as H1, is a simple statement of what the hypothesis is supposed to establish. For example an alternative hypothesis could be Tylenol is better than the Advil, on average. A simple hypothesis specifies the population distribution completely. For example: “H0 X ~ Bi (100, 1/2), i.e. p is specified or H0: X ~ N (5, 20), i.e. µ and sigma^2 are specified” (Statistics Glossary, 2014). A composite hypothesis does not specify the population distribution completely; for example “X ~ Bi (100, p) and H1: p > 0.5 or X ~ N (0, sigma^2) and H1: sigma^2 unspecified” (Statistics Glossary, 2014).

An alpha level is the probability of a type 1 error; a type I error happens when the null hypothesis is rejected when it is in fact true. A confidence interval (CI) is the measure of the reliability of the estimate. An alpha level is calculated by subtracting the confidence interval from 1. For example if the confidence interval should be 90% then the alpha level would be 1-.90= .10 or 10%; if a two tailed test was being used then the alpha level would be divided by two, in this case .10/2=.05 or 5%. Most studies use a 5% alpha value because most studies have a confidence interval of 90%.

## References

http://www.statisticshowto.com/what-is-an-alpha-level/

https://explorable.com/anova

Aron, A., Aron. E., Coups. E. (2014). Statistics for Psychology Pearson Education Inc.

2014.

http://www.stats.gla.ac.uk/steps/glossary/hypothesis_testing.html#h0

## Chapter 9: Introduction to the Analysis of Variance

Analysis of Variance

Testing variation among the means of several groups

ANOVA

One-way analysis of variance

Basic Logic of ANOVA

Null hypothesis

Several populations all have same mean

Do the means of the samples differ more than expected if the null hypothesis were true?

Analyze variances – hence ANOVA

Two different ways of estimating population variance

Basic Logic of ANOVA

Estimating population variance from variation from within each sample

Within-groups estimate of the population variance

Not affected by whether the null hypothesis is true

Basic Logic of ANOVA

Estimating population variance from variation between the means of the samples

Between-groups estimate of the population variance

When the null hypothesis is true

When the null hypothesis is not true

Basic Logic of ANOVA

Sources of variation in within-groups and between-groups variance estimates

Basic Logic of ANOVA

The F ratio

Ratio of the between-groups population variance estimate to the within-groups population variance estimate

The F distribution

The F table

Carrying out an ANOVA

Estimating the population variance from the variation of scores within each group

Carrying out an ANOVA

Estimating the population variance from the differences between group means

Estimate the variance of the distribution of means

Carrying out an ANOVA

Estimating the population variance from the differences between group means

Estimate the variance of the population of individual scores

Figuring the F ratio

Carrying out an ANOVA

An F distribution

Carrying out an ANOVA

The F table

Between-groups degrees of freedom

Within-groups degrees of freedom

Assumptions in ANOVA

Populations have equal variances

Planned Contrasts

Reject null hypothesis

Population means are not all the same

Planned contrasts

Within-groups population variance estimate

Between-groups population variance estimate

Use the two means of interest

Figure F in the usual way

Planned Contrasts

Bonferroni procedure

Provides a more stringent significance level for each comparison

Post-Hoc Comparisons

Exploratory approach

Scheffé test

Figure the F in the usual way

Divide the F by the overall study’s dfBetween

Compare this to the overall study’s F cutoff

Effect Size for ANOVA

Proportion of variance accounted for (R2)

Effect Size for ANOVA

R2 also known as η2 (eta squared)

small R2 = .01

medium R2 = .06

large R2 = .14

Power for ANOVA (.05 significance level)

Approximate Sample Size
Needed in Each Group for
80% Power (.05 significance level)

ANOVA in Research Articles

F(3, 68) = 5.81, p < .01

Means given in a table or in the text

Follow-up analyses

Planned comparisons

Using t tests

Controversies and Limitations

Omnibus test versus planned contrasts

Conduct specific planned contrasts to examine

Theoretical questions

Practical questions

Controversial approach

Reporting in Research Articles

The Structural Model

Flexible way of figuring the two population variance estimates

Handles situation when sample sizes in each group are not equal

Insight into underlying logic of ANOVA

Principles of the
Structural Model

Dividing up the deviations

Deviation of a score from the grand mean

Deviation of the score from the mean of its group

Deviation of the mean of its group from the grand mean

Summing the squared deviations

Principles of the
Structural Model

From the sums of squared deviations to the population variance estimates

Principles of the
Structural Model

Relation of the structural model approach to the previous approach

Within-groups variance estimate

Never figure the variance estimate for each group and average them

Between-groups variance estimate

Never multiply anything by the number of scores in each sample

Same ingredients for the F ratio

Principles of the
Structural Model

Relation of the structural model approach to the previous approach

Previous approach

Emphasizes entire groups

Focuses directly on what contributes to the overall population variance estimates

Structural model

Emphasizes individual scores

Focuses directly on what contributes to the divisions of the deviations of scores from the grand mean

Using the Structural Model to Figure an ANOVA

Example analysis of variance table

Analysis of Variance Table

## Chapter 10: Factorial Analysis of Variance

Basic Logic of Factorial Designs and Interaction Effects

Factorial research design

Effect of two or more variables examined at same time

Efficient research design

Interaction effects

Occur when a combination of variables has a special effect

Basic Logic of Factorial Designs and Interaction Effects

Relationship between one-way analysis of variance and two-way analysis of variance

Main effect

Cell

Cell mean

Marginal means

Recognizing and Interpreting Interaction Effects

Words

Interaction effect occurs when the effect of one variable depends on the level of another variable

Numbers

Recognizing and Interpreting Interaction Effects

Graphically

Basic Logic of the
Two-Way ANOVA

The three F ratios

Column main effect

Row main effect

Interaction effect

Logic of the F ratios for the row and column main effects

Logic of the F ratio for the interaction effect

Assumptions in
Two-Way ANOVA

Populations have equal variances

Assumptions apply to the populations represented in each cell

Extensions and Special Cases of the Factorial ANOVA

Three-way and higher ANOVA designs

Repeated measures ANOVA

Controversies and Limitations

Dichotomizing numeric variables

Median split

Factorial ANOVA in
Research Articles

A 3 X 2 ANOVA on the procedural satisfaction scale yielded a significant main effect for procedure F(1, 136) = 94.28, p < .01 and group belongingness F(2, 136) = 3.70, p < .03. More importantly, the interaction between procedure and group belongingness was also significant, F(2, 136) = 3.46, p < .04. Inclusion leads to stronger effects of voice (M = 4.86) as opposed to the no-voice condition (M = 1.89). The means in the exclusion condition were both lower than in the inclusion condition (voice M = 3.31 and no-voice M = 1.81).

Figuring a Two-Way ANOVA

Structural model for the two-way ANOVA

Each score’s deviation from the grand mean can be divided into

Score’s deviation from the mean of its cell

Score’s row’s mean from the grand mean

Score’s column’s mean from the grand mean

Remainder after other three deviations subtracted from overall deviation from grand mean

Figuring a Two-Way ANOVA

Sums of squares

Figuring a Two-Way ANOVA

Sums of squares

Figuring a Two-Way ANOVA

Population variance estimates

Figuring a Two-Way ANOVA

Population variance estimates

Figuring a Two-Way ANOVA

F ratios

Figuring a Two-Way ANOVA

Degrees of freedom

Figuring a Two-Way ANOVA

Degrees of freedom

Figuring a Two-Way ANOVA

ANOVA table for two-way ANOVA

Effect Size in Factorial ANOVA

Power for Studies Using
2 x 2 or 2 x 3 ANOVA
(.05 significance level)

Approximate Sample Size
Needed in Each Cell for
80% Power (.05 significance level)

## Multiple Comparison

I. Multiple Comparison

A. What is it?

When your Anova has more than 2 groups and it is significant, it only tells you than two or more groups are significantly different from one another, but not which one’s.

- In order to determine which means are different we have to compare different pairs of means to determine which ones are significantly different. We have to do this for multiple pairs (e.g. X1 vs X2, X1 vs X3, X2 vs X3) so it is call a Multiple Comparison.

B. Why do we do it?

When we want to test the differences between a large number of groups we could just use a series of t-tests and not do the ANOVA. However, each time we do a test we add together the type I error rates.

- For example: If I compare 3 groups I will have K(K-1)/2 comparisons to do (3 in this case). If each comparison is tested at the p<.05 level then you will end up with an final alpha (probability of Type I error) of .15, e.g. 15% type I error rate

- This is called the Per Experiment (PE) Error Rate. = It represents the number of Type I errors we expect to make when the Null Hypothesis (Ho) is true.

-Per Comparison (PC) Error rate = alpha for each test

PE error rate = (# comparisons)*(PC).

(3) * (.05) = .15

- Familywise (FW) Error Rate = When we have a group of comparisons between means (Called a Family of Comparisons), then we can estimate the probability that we have at least 1 Type I error in our family of comparisons.

FW = 1- (1-PC)c

FW for a variable with 3 groups = 1- (1-.05)3 = 1-.8574 = .1426

- PC < FW< PE

II. Planned Linear Contrasts (A priori)

A. What, When, & Why

1. What- Planned Linear Contrasts are comparisons that you plan on making between specific means or groups of means based on your hypotheses. They are tested using F statistics with

df between = 1 and df error = N - K degrees of freedom and each test is compared to the critical value of F at the p = .05 level.

2. When- Use them when you have specific hypotheses regarding the means of your groups. In

general you should not use Linear Contrasts when the number of comparisons your are making

exceeds the number of degrees of freedom you have between groups (i.e., K-1). When the number of

contrasts exceeds K-1 then the Bonnferoni procedure is generally preferred.

3. Why- This test holds the FW error rate at .05, because we are not exceeding the Type I error rate for the overall Alpha.

B. Simple Comparisons - Comparing two means at a time.

For Example = Anova with 4 levels. 1 vs 2, 2 vs. 3, 3 vs 4.

C. Complex Comparisons - Comparing groups of means or a group of means to a single mean.

For example in a One-Way Anova with 4 levels you may be interested in testing the differences between group 1 and group 3, group 1 and group 4, group 2 and group 3, and group 2 and group 4. But had no reason to believe group 1 and 2 would differ or that group 3 and 4 would differ. In such a case you could compare the groups of groups:

(Grp 1 + Grp 2)/2 vs. (Grp 3 + Grp 4)/2

IV. Post Hoc Analyses (A posteriori)

A. What, When & Why

1. What- Post Hoc test allow us to make simple and complex comparisons between our group means when we do not have any a priori hypotheses about how the means should be related. Most of these tests use some form of the t statistic. The basic root of all the Post Hoc tests is :

Different tests use different degrees of freedom and different alpha levels for the critical value, depending on how they control the FW error rate.

2. When- Use them when you don’t have a priori hypotheses or when you find interesting, but unexpected results. Also, Post Hoc test can only be examined and interpreted when we have a significant overall F, otherwise it is totally inappropriate to examine Post Hoc tests.

3. Why- Because our hypotheses are not a priori and because we will probably test all possible mean combinations = c ( c = (K(K-1))/2), then we greatly increase the likelihood that we are going to commit a Type I error as the number of groups increases. Post Hoc tests employ varius means for controlling the FW and PE error rates.

NOTE: In SPSS, post hoc tests will not allow you to make complex comparisons. Only the Linear Contrasts options will allow this. However, you can make complex comparisons using post hoc tests by using hand calculations.

B. Choosing Tests

- Different Post Hoc tests use different methods to control FW and PE. Some tests are very conservative. Conservative tests go to great lengths to prevent the user from committing a Type I error. They use more stringent criterion for determining significance. Many of these tests become more and more stringent as the number of groups increases (directly limiting the FW and PE error rate). Although these tests buy you protection against Type I error, it comes at a cost. As the tests become more stringent, you loose Power (1-B). More Liberal tests, buy you Power but the cost is an increased chance of Type I error. There is no set rule for determining which test to use, but different researchers have offered some guidelines for choosing. Mostly it is an issue of pragmatics and whether the number of comparisons exceeds K-1.

C. Fisher’s LSD

-The Fisher LSD (Least Significant Different) is basically the Post Hoc equivalent of a Linear Contrast.

- This tests sets Alpha Level per comparison. Alpha = .05 for every comparison. df = df error (i.e. df within).

- This test is the most liberal of all Post Hoc tests. The critical t for significance is unaffected by the number of groups.

- This test is appropriate when you have 3 means to compare. In general the alpha is held at .05 because of the criterion that you can’t look at LSD’s unless the Anova is significant.

- This test is generally not considered appropriate if you have more than 3 means unless there is reason to believe that there is no more than one true Null Hypothesis hidden in the means.

F. Dunn’s (Bonferroni)

-Dunn’s t’ test can actually be applied to both Post Hoc and A Priori Hypotheses. It does not require the overall Anova to be significant. It is sometimes referred to as the Bonferroni t because it used the Bonferroni PE correction procedure in determining the critical value for significance.

- In general, this test should be used when the number of comparisons you are making exceeds the number of degrees of freedom you have between groups (e.g. K-1) even if your comparisons are a priori

- This test sets alpha per experiment; Alpha = (.05)/c for every comparison. df = df error.

- c = number of comparisons (K(K-1))/2

- For Example; c = 4 then Alpha = .05/4 = .0125. Thus the PE = .05

K = 2, c = 1, Alpha = .05

K = 3, c = 3, Alpha = .0167

K = 4, c = 6, Alpha = .00833

K = 5, c = 10, Alpha = .005

- When doing hand calculations you need to find the critical value from Dunn’s Table of critical values for t’ which simply accounts for the fact that regular t tables do not display critical values for fractions of alpha .05 (e.g., t critical @ Alpha .0125 = ?).

- This test is extremely conservative and rapidly reduces power as the number of comparisons being made increase.

C. Newman-Keuls.

- Newman-Keuls is a step down procedure that is not as conservative as Dunn’s t test. First, the means of the groups are ordered (ascending or descending) and then the largest and smallest means are tested for significant differences. If those means are different, then test smallest with next largest, until you reach a test that is not significant. Once you reach that point then you can only test differences between means that exceed the difference between the means that were found to be non-significant.

For Example. For a test with 5 means

X5 > X1, p < .05. X4 = X1, p = ns. Can’t test diff between X1 and X3, X1 and X2, or X2 and X3. Can test dif between X2 and X5 if the dif between the means exceeds the difference between the means of X1 and X5.

The critical value of this test is dependent on the df error (n-K) and the number of steps between means being compared (e.g. there are 5 steps between means 1 and 5, but only 2 steps between 1 and 2).

- This test sets alpha using a scaled down FW error rate: Alpha =

E.g. K = 5, c = 10

r = 1, Alpha = .05

r = 3, Alpha = .025

r = 6, Alpha = .00851

r = 10, Alpha = .00512

- Newman-Keuls is perhaps one of the most common Post Hoc test, but it is a rather controversial test. The major problem with this test is that when there is more than one true Null Hypothesis in a set of means it will overestimate they FW error rate.

- In general we would use this when the number of comparisons we are making is larger than K-1 and we don’t want to be as conservative as the Dunn’s test is.

E. Tukey’s HSD

- Tukey HSD (Honestly Significant Difference) is essentially like the Newman-Keul, but the tests between each mean are compared to the critical value that is set for the test of the means that are furthest apart (rmax e.g. if there are 5 means we use the critical value determined for the test of X1 and X5).

- This Method corrects for the problem found in the Newman-Keuls where the FW is inflated when there is more than one True Null Hypothesis in a set of means.

- This test buy protection against Type I error, but again at the cost of Power.

- This test sets alpha using the FW error rate: Alpha =

K = 2, rmax = 1, Alpha = .05

K = 3, rmax = 3, Alpha = .025

K = 4, rmax = 6, Alpha = .00851

K = 5, rmax = 10, Alpha = .00512

- this tends to me the most common test and preferred test because it is very conservative with respect to Type I error when the Null hypothesis is true. In general, HSD is preferred when you will make all the possible comparisons between a large set of means (Six or more means).

F. Tukey’s WSD

- Tukey’s WSD (Wholly Significant Difference) is sometimes referred to as the Tukey’sb Test. This test is a compromise the Newman-Keuls and the more conservative HSD. Here the alpha for each test is the Average of the Newman-Keuls Alpha and the HSD Alpha.

Where

E.g. K = 5, c = 10

r = 1, Alpha NK = .05, Alpha rmax = .00512 Alpha WSD = .02756

r = 3, Alpha NK = .025 Alpha rmax = .00512 Alpha WSD = .01506

r = 6, Alpha NK = .00851, Alpha rmax = .00512 Alpha WSD = .00682

r = 10, Alpha NK = .00512, Alpha rmax = .00512 Alpha WSD = .00512

- Thus the WSD is better than Newman-Kuels at preventing Type I error when more than one Null Hypothesis is true for your set of means, but it is not as complete as the HSD. However, with WSD you do not loose as much power as you do with the HSD.

- The WSD is best to use when you are making more than K-1 comparisons, you need more control of Type I error than Newman-Kuels, and you are testing fewer than (K(K-1))/2 comparisons.

G. Sheffé

- The Sheffé Test is designed to protect against a Type I error when all possible complex and simple comparisons are made. That is we are not just looking the possible combinations of comparisons between pairs of means. We are also looking at the possible combinations of comparisons between groups of means. Thus Sheffé is the most conservative of all tests.

- Because this test does give us the capacity to look at complex comparisons, it essentially uses the same statistic as the Linear Contrasts tests. However, Sheffé uses a different critical value (or at least it makes an adjustment to the critical value of F).

- Sheffé sets a more conservative F critical to create an Effective FW error rate.

-First, for each comparison find F critical at the Alpha = .10 (So we start off more liberal)

df btw = 1 df error = K-1

-Second, Multiply the F critical by K-1 and use the quotient as the critical value for all comparisons (both simple and complex) in that family of means.

- This test has less power than the HSD when you are making Pairwise (simple) comparisons, but it has more power than HSD when you are making Complex comparisons.

- In general, only use this when you want to make many Post Hoc complex comparisons (e.g. more than K-1).

H. q the studentized range statistic

The Newman-Kuels, HSD and WSD all use the q statistic which is based on the studentized range (q is often referred to as the studentized range statistic). When finding the critical q, you will need two pieces of information. First you need the df. In the case of post hoc testing use the df error(n-K) from the Anova. Second, you will need r. r is the number of steps between the means you are testing. (e.g. there are 5 steps between means 1 and 5, but only 2 steps between 1 and 2).