StatsInf_Simulation_IainR_Sep2020.Rmd

---
title: "Simulation to investigate the Central Limit Theorem"
author: "Iain Russell"
date: "04/09/2020"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Overview

This is an investigation of the Central Limit Theorem, hereafter CLT, using the exponential distribution. Simulations containing 40 exponentials are run 1000 times and the mean computed each time; the distribution of 1000 simulation means is compared with 1000 random exponentials to demonstrate the CLT.

A histogram of 1000 random exponentials drawn from the exponential distribution demonstrates exponential growth/decay with a rate of 0.2.  

```{r}
set.seed(3433)
hist(rexp(1000, rate=0.2))
```

## Simulations

Generate 1000 simulations of 40 random exponentials with lambda/rate 0.2, and compute the mean for each simulation.
Lambda (0.2) is the growth/decay rate of the exponential, n (40) is the number of exponentials in each simulation, nosims (1000) is the number of simulations, sims is the vector of 1000 simulated means.  

```{r}
lambda = 0.2
n = 40
nosims = 1000
sims = apply(matrix(rexp(nosims * n, rate=lambda), nosims), 1, mean)
```

The theoretical mean, mu, and standard deviation, sigma, are computed as 1/lambda and in accordance with the CLT are then used to shift and scale the distribution of sample means so that it is comparable with a standard normal distribution.

```{r}
mu = sigma = 1/lambda
se = sigma/sqrt(n)
sims_std = (sims - mu) / se
```

Successful use of the CLT is strongly dependent on the sample size, n (40 in our case). This may or may not be sufficient to ensure that the sample statistics (mean, variance) are representative of the population statistics. The larger that n is, the better the sample statistics approximate those of the population and the narrower the confidence interval for the population mean.

## Sample Mean versus Theoretical Mean

The sample mean of the simulations is approximately 5, the same as the theoretical mean, mu, of the population.

```{r}
mean(sims)
mu
```

A 95% confidence interval for the mean contains the theoretical mean mu (5), indicating that the sample mean is a good estimate of the population mean.

```{r}
mean(sims) + c(-1,1) * qnorm(.975) * se
```

## Sample Variance versus Theoretical Variance

The theoretical variance is sigma^2/n, where sigma is the standard deviation, or 1/lambda.

```{r}
sigma^2/n
```

The sample variance computed from the simulated data is very close to this.

```{r}
var(sims)
```

Therefore we may conclude that the variance of the sample distribution approximates well the theoretical variance.

## Distribution

A histogram of the simulated means showing that the distribution is approximately normal.

```{r}
hist(sims)
```
  
A histogram of the standardized simulated means showing that the distribution is approximately normal with a mean close to zero, and the majority of the data between quantiles -2,2.

```{r}
hist(sims_std)
```