Introduction
The Central Limit Theorem tells us that if the sample size is large, then the distribution of sample means approach the Normal Distribution. For distributions that are more skewed, a larger sample size is needed, since that lowers the impact of extreme values on the sample mean.
Skewness
Skewness can be determined by the following formula $$ Sk = E((\frac{X - \mu}{\sigma})^3) = \frac{E((X - \mu)^3)}{\sigma^3} $$ Uniform distributions have a skewness of zero. Poisson distributions however have a skewness of $\lambda^{-\frac{1}{2}}$.
In this lab, we are interested in the sample size needed to obtain a distribution of sample means that is approximately normal.
Shapiro-Wilk Test
In this lab, we will test for normality using the Shapiro-Wilk test. The null hypothesis of this test is that the data is normally distributed. The alternative hypothesis is that the data is not normally distributed. This test is known to favor the alternative hypothesis for a large number of sample means. To circumvent this, we will test normality starting with a small sample size $n$ and steadily increase it until we obtain a distribution of sample means that has a p-value greater than 0.05 in the Shapiro-Wilk test.
This tells us that with a false positive rate of 5%, there is no evidence to suggest that the distribution of sample means don’t follow the normal distribution.
We will use this test to look at the distribution of sample means of both the Uniform and Poisson distribution in this lab.
Properties of the distribution of sample means
The Uniform distribution has a mean of $0.5$ and a standard deviation of $\frac{1}{\sqrt{12n}}$ and the Poisson distribution has a mean of $\lambda$ and a standard deviation of $\sqrt{\frac{\lambda}{n}}$.
Methods
For the first part of the lab, we will sample means from a Uniform distribution and a Poisson distribution of $\lambda = 1$ both with a sample size $n = 5$.
Doing so shows us how the Uniform distribution is roughly symmetric while the Poisson distribution is highly skewed. This begs the question: what sample size $(n)$ do I need for the Poisson distribution to be approximately normal?
Sampling the means
The maximum number of mean observations that the Shapiro-Wilk test allows is 5000 observations. Therefore, we will obtain n
observations separately from both the Uniform or Poisson distribution and calculate the mean from it. Repeating that process 5000 times.
The mean can be calculated from the following way $$ Mean = \frac{\sum x_i}{n} $$ Where $x_i$ is the observation obtained from the Uniform or Poisson distribution
Iterating with the Shapiro-Wilk Test
Having a sample size of a certain amount doesn’t always guarantee that it will fail to reject the Shapiro-Wilk test. Therefore, it is useful to run the test multiple times so that we can create a 95th percentile of sample sizes that fails to reject the Shapiro-Wilk test.
The issue with this is that lower lambda values result in higher skewness’s. Which is described by the skewness formula. If a distribution has a high degree of skewness, then it will take a larger sample size n to make the sample mean distribution approximately normal.
Finding large values of n result in longer computational time. Therefore, the code takes this into account by starting at a larger value of n and/or incrementing by a larger value of n each iteration. Incrementing by a larger value of n decreases the precision, though that is the compromise I’m willing to take in order to achieve faster results.
Finding just the first value of $n$ that generates the sample means that fails to reject the Shapiro-Wilk test doesn’t tell us much in terms of the sample size needed for the distribution of sample means to be approximately normal. Instead, it is better to run this process many times, finding the values of n that satisfy this condition multiple times. That way we can look at the distribution of sample sizes required and return back the 95th percentile.
Returning the 95th percentile tells us that 95% of the time, it was the sample size returned or lower that first failed to reject the Shapiro-Wilk test. One must be careful, because it can be wrongly interpreted as the sample size needed to fail to reject the Shapiro-Wilk test 95% of the time. Using that logic requires additional mathematics outside the scope of this paper. Returning the 95th percentile of the first sample size that failed to reject the Shapiro-Wilk test, however, will give us a good enough estimate for a sample size needed.
Plots
Once a value for n
is determined, we sample the means of the particular distribution (Uniform/Poisson) and create histograms and Q-Q plots for each of the parameters we’re interested in. We’re looking to verify that the histogram looks symmetric and that the points on the Q-Q Plot fit closely to the Q-Q Line with one end of the scattering of points on the opposite side of the line as the other.
Results
Part I
Sampling the mean of the uniform distribution with $n = 5$ results in a mean of $\bar{x} = 0.498$ and standard deviation of $sd = 0.1288$. The histogram and Q-Q Plot can be seen in Figure I and Figure II respectively.
$\bar{x}$ isn’t far from the theoretical 0.5 and the standard deviation is also close to $$ \frac{1}{\sqrt{12(5)}} \approx 0.129 $$ Looking at the histogram and Q-Q plot, it suggests that data is approximately normal. Therefore we can conclude that a sample size of 5 is sufficient for the sample mean distribution coming from the normal distribution to be approximately normal.
Sampling the mean of the Poisson distribution with $n = 5$ and $\lambda = 1$ results in a mean of $\bar{x} = 0.9918$ and a standard deviation of $sd = 0.443$. The histogram and Q-Q Plot can be seen in Figures III and IV respectively.
$\bar{x}$ is not too far from the theoretical $\lambda = 1$, the standard deviation is a bit farther from the theoretical $$ \sqrt{\frac{\lambda}{n}} = \sqrt{\frac{1}{5}} = 0.447 $$ Looking at the Figures, however, shows us that the data does not appear normal. Therefore, we cannot conclude that 5 is a big enough sample size for the Poisson Distribution of $\lambda = 1$ to be approximately normal.
Part II
Running the algorithm I described, I produced the following table
$\lambda$ | Skewness | Sample Size Needed | Shapiro-Wilk P-Value | Average of Sample Means | Standard Deviation of Sample Means | Theoretical Standard Deviation of Sample Means |
---|---|---|---|---|---|---|
0.1 | 3.16228 | 2710 | 0.05778 | 0.099 | 0.0060 | 0.0061 |
0.5 | 1.41421 | 802 | 0.16840 | 0.499 | 0.0250 | 0.0249 |
1 | 1.00000 | 215 | 0.06479 | 1.000 | 0.0675 | 0.0682 |
5 | 0.44721 | 53 | 0.12550 | 4.997 | 0.3060 | 0.3071 |
10 | 0.31622 | 31 | 0.14120 | 9.999 | 0.5617 | 0.5679 |
50 | 0.14142 | 10 | 0.48440 | 50.03 | 2.2461 | 2.2361 |
100 | 0.10000 | 6 | 0.47230 | 100.0027 | 4.1245 | 4.0824 |
The skewness was derived from the formula in the first section while the sample size was obtained by looking at the .95 blue quantile line in Figures XVIII-XIV. The rest of the columns are obtained from the output of the R Code function show_results
.
Looking at the histograms and Q-Q Plots produced by the algorithm, the distribution of sample means distributions are all roughly symmetric. The sample means are also tightly clustered around the Q-Q line, showing that the normal distribution is a good fit. This allows us to be confident that using these values of n
as the sample size would result in the distribution of sample means of Uniform or Poisson (with a given lambda) to be approximately normal.
All the values of the average sampling means are within 0.001 of the theoretical average of sample means. The standard deviation of sample means slightly increase as the value of $\lambda$ increases, but it still is quite low.
Conclusion
The table in the results section clearly show that as the skewness increases, so does the sample size needed to make the distribution of sample means approximately normal. This shows the central limit theorem in action in that no matter the skewness, if you obtain a large enough sample, the distribution of sample means will be approximately normal.
These conclusions pave the way for more interesting applications such as hypothesis testing and confidence intervals.
Appendix
Figures
Figure I, Histogram of Sample Means coming from a Uniform Distribution with sample size of 5
Figure II, Q-Q Plot of Sample Means coming from a Uniform Distribution with sample size of 5
Figure III, Histogram of Sample Means coming from a Poisson Distribution with $\lambda = 1$ and sample size of 5
Figure IV, Q-Q Plot of Sample Means coming from Poisson Distribution with $\lambda = 1$ and sample size of 5
Figure V, Histogram of Sample Means coming from Poisson Distribution with $\lambda = 0.1$ and sample size of 2710
Figure VI, Q-Q Plot of Sample Means coming from Poisson Distribution with $\lambda = 0.1$ and sample size of 2710
Figure VII, Histogram of Sample Means coming from Poisson Distribution with $\lambda = 0.5$ and sample size of 516
Figure VII, Q-Q Plot of Sample Means coming from Poisson Distribution with $\lambda = 0.5$ and sample size of 516
Figure VIII, Histogram of Sample Means coming from Poisson Distribution with $\lambda = 1$ and sample size of 215
Figure IX, Q-Q Plot of Sample Means coming from Poisson Distribution with $\lambda = 1$ and sample size of 215
Figure X, Histogram of Sample Means coming from Poisson Distribution of $\lambda = 5$ and sample size of 53
Figure XI, Q-Q Plot of Sample Means coming from Poisson Distribution of $\lambda = 5$ and sample size of 53
Figure XII, Histogram of Sample Means coming from Poisson Distribution of $\lambda = 10$ and sample size of 31
Figure XIII, Q-Q Plot of Sample Means coming from Poisson Distribution of $\lambda = 10$ and sample size of 31
Figure XIV, Histogram of Sample Means coming from Poisson Distribution of $\lambda = 50$ and sample size of 10
Figure XV, Q-Q Plot of Sample Means coming from Poisson Distribution of $\lambda = 50$ and sample size of 10
Figure XVI, Histogram of Sample Means coming from Poisson Distribution of $\lambda = 100$ and sample size of 6
Figure XVII, Q-Q Plot of Sample Means coming from Poisson Distribution of $\lambda = 100$ and sample size of 6
Figure XVIII, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 0.1$
Figure XIX, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 0.5$
Figure XX, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 1$
Figure XXI, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 5$
####Figure XXII, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 10$
####Figure XXIII, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 50$
Figure XXIV, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 100$
R Code
rm(list=ls())
library(ggplot2)
sample_mean_uniform = function(n) {
xbarsunif = numeric(5000)
for (i in 1:5000) {
sumunif = 0
for (j in 1:n) {
sumunif = sumunif + runif(1, 0, 1)
}
xbarsunif[i] = sumunif / n
}
xbarsunif
}
sample_mean_poisson = function(n, lambda) {
xbarspois = numeric(5000)
for (i in 1:5000) {
sumpois = 0
for (j in 1:n) {
sumpois = sumpois + rpois(1, lambda)
}
xbarspois[i] = sumpois / n
}
xbarspois
}
poisson_n_to_approx_normal = function(lambda) {
print(paste("Looking at Lambda =", lambda))
ns = c()
# Speed up computation of lower lambda values by starting at a different sample size
# and/or lowering the precision by increasing the delta sample size
# and/or lowering the number of sample sizes we obtain from the shapiro loop
increaseBy = 1;
iter = 3;
startingValue = 2
if (lambda == 0.1) {
startingValue = 2000;
iter = 3;
increaseBy = 50;
} else if (lambda == 0.5) {
startingValue = 200;
iter = 5;
increaseBy = 10;
} else if (lambda == 1) {
startingValue = 150;
iter = 25;
} else if (lambda == 5) {
startingValue = 20;
iter = 50;
startingValue = 10;
} else if (lambda == 10) {
iter = 100;
} else {
iter = 500;
}
progressIter = 1
for (i in 1:iter) {
# Include a progress indicator for personal sanity
if (i / iter > .1 * progressIter) {
print(paste("Progress", i / iter * 100, "% complete"))
progressIter = progressIter + 1
}
n = startingValue
dist = sample_mean_poisson(n, lambda)
p.value = shapiro.test(dist)$p.value
while (p.value < 0.05) {
n = n + increaseBy
dist = sample_mean_poisson(n, lambda)
p.value = shapiro.test(dist)$p.value
# More sanity checks
if (n %% 10 == 0) {
print(paste("N =", n, " p.value =", p.value))
}
}
ns = c(ns, n)
}
print(ggplot(data.frame(ns), aes(x = ns)) +
geom_histogram(fill = 'white', color = 'black', bins = 10) +
geom_vline(xintercept = ceiling(quantile(ns, .95)), col = '#0000AA') +
ggtitle(paste("Histogram of N needed for Poisson distribution of lambda =", lambda)) +
xlab("N") +
ylab("Count") +
theme_bw())
ceiling(quantile(ns, .95)) #95% of the time, this value of n will give you a sampling distribution that is approximately normal
}
uniform_n_to_approx_normal = function() {
ns = c()
progressIter = 1
for (i in 1:500) {
# Include a progress indicator for personal sanity
if (i / 500 > .1 * progressIter) {
print(paste("Progress", i / 5, "% complete"))
progressIter = progressIter + 1
}
n = 2
dist = sample_mean_uniform(n)
p.value = shapiro.test(dist)$p.value
while (p.value < 0.05) {
n = n + 1
dist = sample_mean_uniform(n)
p.value = shapiro.test(dist)$p.value
if (n %% 10 == 0) {
print(paste("N =", n, " p.value =", p.value))
}
}
ns = c(ns, n)
}
print(ggplot(data.frame(ns), aes(x = ns)) +
geom_histogram(fill = 'white', color = 'black', bins = 10) +
geom_vline(xintercept = ceiling(quantile(ns, .95)), col = '#0000AA') +
ggtitle("Histogram of N needed for Uniform Distribution") +
xlab("N") +
ylab("Count") +
theme_bw())
ceiling(quantile(ns, .95)) #95% of the time, this value of n will give you a sampling distribution that is approximately normal
}
show_results = function(dist) {
print(paste("The mean of the sample mean distribution is:", mean(dist)))
print(paste("The standard deviation of the sample mean distribution is:", sd(dist)))
print(shapiro.test(dist))
print(ggplot(data.frame(dist), aes(x = dist)) +
geom_histogram(fill = 'white', color = 'black', bins = 20) +
ggtitle("Histogram of Sample Means") +
xlab("Mean") +
ylab("Count") +
theme_bw())
qqnorm(dist, pch = 1, col = '#001155', main = "QQ Plot", xlab = "Sample Data", ylab = "Theoretical Data")
qqline(dist, col="#AA0000", lty=2)
}
## PART I
uniform_mean_dist = sample_mean_uniform(n = 5)
poisson_mean_dist = sample_mean_poisson(n = 5, lambda = 1)
show_results(uniform_mean_dist)
show_results(poisson_mean_dist)
## PART II
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 0.1");
n.01 = poisson_n_to_approx_normal(0.1)
show_results(sample_mean_poisson(n.01, 0.1))
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 0.5");
n.05 = poisson_n_to_approx_normal(0.5)
show_results(sample_mean_poisson(n.05, 0.5))
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 1");
n.1 = poisson_n_to_approx_normal(1)
show_results(sample_mean_poisson(n.1, 1))
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 5");
n.5 = poisson_n_to_approx_normal(5)
show_results(sample_mean_poisson(n.5, 5))
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 10");
n.10 = poisson_n_to_approx_normal(10)
show_results(sample_mean_poisson(n.10, 10))
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 50");
n.50 = poisson_n_to_approx_normal(50)
show_results(sample_mean_poisson(n.50, 50))
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 100");
n.100 = poisson_n_to_approx_normal(100)
show_results(sample_mean_poisson(n.100, 100))
print("Starting Algorithm to Find Sample Size Needed for the Uniform Distribution")
n.uniform = uniform_n_to_approx_normal()
show_results(sample_mean_uniform(n.uniform))