- Let's talk about Statistics and its types.
- Data Types and Level of Measurement
- Moments of Business Decision
- 3.1 Measures of Central tendency
- 3.2 Mean
- 3.3 Median
- 3.4 Mode
- 3.5 Causes of a Multimodal Distribution
- 3.6 Mean vs Median
- 3.7 Variance
- 3.8 Range
- 3.9 Standard Deviation
- The Difference Between Standard Deviation and Average Deviation
- Skewness
- Kurtosis
- 6.1 What is Excess Kurtosis?
- 6.2 Mesokurtic
- 6.3 Leptokurtic
- 6.4 Platykurtic
- Percentile vs. Quartile vs. Quantile: What’s the Difference?
- BoxPlot
- Central Limit Theorem
- Sampling techniques
- 10.1 Sampling Frame
- 10.2 Sampling Frame vs. Sample Space
- 10.3 What is Sampling Distribution
- 10.4 Sample Design
- 10.5 Non-Random Sampling Methods
- 10.6 Random Sampling Methods
- Sampling Error
- Distribution
- Outliers
- 13.1 What are Outliers? 🤔
- 13.2 Why do they occur?
- 13.3 How do they affect?
- 13.4 Detecting and Handling Outliers
- Q-Q Plot
- Distribution Function
- Covariance
- Correlation
- Permutation and Combination
Statistics is broadly categorized into two types:
- Descriptive Statistics
- Inferential Statistics
It utilized numerical and graphical methods to look for patterns in a dataset, to summarize the information revealed in a dataset, and to present the information in a convinient form that individuals can use to make decisions( Mean, Median, Mode, Variance, Standard deviation and charts ore Probability Distribution)
Basically, as part of descriptive Statistics, we measure the following:
- Frequency: no. of times a data point occurs
- Central tendency: the centrality of the data – mean, median, and mode
- Dispersion: the spread of the data – range, variance, and standard deviation
- The measure of position: percentiles and quantile ranks
In Inferential statistics, utilizing sample data, to estimate decisions, predictions or other generalization by running Hypothesis testing to assess the assumptions made about the population parameters.
In simple terms, we interpret the meaning of the descriptive statistics by inferring them to the population. The main goal of inferential statistics is to make a conclusion about a population based off a sample of data from the population.
For example, we are conducting a survey on the number of two-wheelers in a city. Assume the city has a total population of 5L people. So, we take a sample of 1000 people as it is impossible to run an analysis on entire population data.
From the survey conducted, it is found that 800 people out of 1000 (800 out of 1000 is 80%) are using two-wheelers. So, we can infer these results to the population and conclude that 4L people out of the 5L population are using two-wheelers.
There is structured and unstructured data. Then you have qualitative and quantitative data. Now let's explore two more data types (discrete and continuous) and help you understand the difference.
At a higher level, data is categorized into two types: Qualitative and Quantitative.
Qualitative data is non-numerical. Some of the examples are eye colour, car brand, city, etc.
On the other hand, Quantitative data is numerical, and it is again divided into Continuous and Discrete data.
Qualitative types of data in research work around the characteristics of the retrieved information and helps understand customer behavior. This type of data in statistics helps run market analysis through genuine figures and create value out of service by implementing useful information. Qualitative types of data in statistics can drastically affect customer satisfaction if applied smartly.
On the other hand, the Quantitative data types of statistical data work with numerical values that can be measured, answering questions such as ‘how much’, ‘how many’, or ‘how many times’. Quantitative data types in statistics contain a precise numerical value. Therefore, they can help organizations use these figures to gauge improved and faulty figures and predict future trends.
If you pay attention to this, you can give numbering to the ordinal classes, and then it should be called discrete type or ordinal? The truth is that it is still ordinal. The reason for this is that even if the numbering is done, it doesn’t convey the actual distances between the classes.
For instance, consider the grading system of a test. The respective grades can be A, B, C, D, E, and if we number them from starting then it would be 1,2,3,4,5. Now according to the numerical differences, the distance between E grade and D grade is the same as the distance between the D and C grade which is not very accurate as we all know that C grade is still acceptable as compared to E grade but the mid difference declares them as equal.
You can also apply the same technique to a survey form where user experience is recorded on a scale of very poor to very good. The differences between various classes are not clear therefore can’t be quantified directly.
Continuous data: It can be represented in decimal format. Examples are height, weight, time, distance, etc.
Discrete data: It cannot be represented in decimal format. Examples are the number of laptops, number of students in a class.
Discrete data is again divided into Categorical and Count Data.
Categorical data: represent the type of data that can be divided into groups. Examples are age, sex, etc.
Count data: This data contains non-negative integers. Example: number of children a couple has.
In statistics, the level of measurement is a classification that describes the relationship between the values of a variable.
The numerical values which fall under are integers or whole numbers are placed under this category. The number of speakers in the phone, cameras, cores in the processor, the number of sims supported all these are some of the examples of the discrete data type.
Discrete data types in statistics cannot be measured – it can only be counted as the objects included in discrete data have a fixed value. The value can be represented in decimal, but it has to be whole. Discrete data is often identified through charts, including bar charts, pie charts, and tally charts.
The fractional numbers are considered as continuous values. These can take the form of the operating frequency of the processors, the android version of the phone, wifi frequency, temperature of the cores, and so on.
Unlike discrete data types of data in research, with a whole and fixed value, continuous data can break down into smaller pieces and can take any value. For example, volatile values such as temperature and the weight of a human can be included in the continuous value. Continuous types of statistical data are represented using a graph that easily reflects value fluctuation by the highs and lows of the line through a certain period of time.
We have four fundamental levels of measurement. They are:
- Nominal Scale
- Ordinal Scale
- Interval Scale
- Ratio Scale
A. Nominal Scale: These are the set of values that don’t possess a natural ordering. Let’s understand this with some examples. This scale contains the least information since the data have names/labels only. It can be used for classification. We cannot perform mathematical operations on nominal data because there is no numerical value to the options (numbers associated with the names can only be used as tags). The nominal scale uses categories, so finding the median makes no sense. You could put the items in alphabetical order but even then, the middle item would have no meaning as a median. However, a mode (the most frequent item in the set) is possible. For example, if you were to survey a group of random people and ask them what the most romantic city in the World is, Venice or Paris might be the most common response (the mode). It is not possible to state that ‘Red’ is greater than ‘Blue’. The gender of a person is another one where we can’t differentiate between male, female, or others.
Nominal data types in statistics are not quantifiable and cannot be measured through numerical units. Nominal types of statistical data are valuable while conducting qualitative research as it extends freedom of opinion to subjects.
Example: Which country do you belong to? India, Japan, Korea.
- Gender: Male, Female, Other.
- Hair Color: Brown, Black, Blonde, Red, Other.
- Type of living accommodation: House, Apartment, Trailer, Other.
- Genotype: Bb, bb, BB, bB.
- Religious preference: Buddhist, Mormon, Muslim, Jewish, Christian, Other.
B. Ordinal Scale:
These types of values have a natural ordering while maintaining their class of values. If we consider the size of a clothing brand then we can easily sort them according to their name tag in the order of small < medium < large. The grading system while marking candidates in a test can also be considered as an ordinal data type where A+ is definitely better than B grade.
These categories help us deciding which encoding strategy can be applied to which type of data. Data encoding for Qualitative data is important because machine learning models can’t handle these values directly and needed to be converted to numerical types as the models are mathematical in nature.
For nominal data type where there is no comparison among the categories, one-hot encoding can be applied which is similar to binary coding considering there are in less number and for the ordinal data type, label encoding can be applied which is a form of integer encoding.
These scales are generally used to depict non-mathematical ideas such as frequency, satisfaction, happiness, a degree of pain, etc. It is quite straightforward to remember the implementation of this scale as ‘Ordinal’ sounds similar to ‘Order’, which is exactly the purpose of this scale.
Ordinal data is made up of ordinal variables. In other words, if you have a list that can be placed in “first, second, third…” order, you have ordinal data. It sounds simple, but there are a couple of elements that can be confusing:
- You don’t have to have the exact words “first, second, third….” Instead, you can have different rating scales, like “Hot, hotter, hottest” or “Agree, strongly agree, disagree.”
- You don’t know if the intervals between the values are equal. We know that a list of cardinal numbers like 1, 5, 10 have a set value between them (in this case, 5) but with ordinal data you just don’t know. For example, in a marathon you might have first, second and third place. But if you don’t know the exact finishing times, you don’t know what the interval between first and second, or second and third is.
C. Interval Scale: It is a numerical scale. The Interval scale has more information than the nominal, ordinal scales. Along with the order, we know the difference between the two variables (interval indicates the distance between two entities).
These scales are effective as they open doors for the statistical analysis of provided data. Mean, median, or mode can be used to calculate the central tendency in this scale. The only drawback of this scale is that there no pre-decided starting point or a true zero value.
Interval scale contains all the properties of the ordinal scale, in addition to which, it offers a calculation of the difference between variables. The main characteristic of this scale is the equidistant difference between objects.
For instance, consider a Celsius/Fahrenheit temperature scale –
- 80 degrees is always higher than 50 degrees and the difference between these two temperatures is the same as the difference between 70 degrees and 40 degrees.
- Also, the value of 0 is arbitrary because negative values of temperature do exist – which makes the Celsius/Fahrenheit temperature scale a classic example of an interval scale.
- Interval scale is often chosen in research cases where the difference between variables is a mandate – which can’t be achieved using a nominal or ordinal scale. The Interval scale quantifies the difference between two variables whereas the other two scales are solely capable of associating qualitative values with variables. The mean and median values in an ordinal scale can be evaluated, unlike the previous two scales.
- In statistics, interval scale is frequently used as a numerical value can not only be assigned to variables but calculation on the basis of those values can also be carried out.
D. Ratio Scale: The ratio scale has the most information about the data. Unlike the other three scales, the ratio scale can accommodate a true zero point. The ratio scale is simply said to be the combination of Nominal, Ordinal, and Intercal scales.
With the option of true zero, varied inferential, and descriptive analysis techniques can be applied to the variables. In addition to the fact that the ratio scale does everything that a nominal, ordinal, and interval scale can do, it can also establish the value of absolute zero. The best examples of ratio scales are weight and height. In market research, a ratio scale is used to calculate market share, annual sales, the price of an upcoming product, the number of consumers, etc.
- Ratio scale provides the most detailed information as researchers and statisticians can calculate the central tendency using statistical techniques such as mean, median, mode, and methods such as geometric mean, the coefficient of variation, or harmonic mean can also be used on this scale.
- Ratio scale accommodates the characteristic of three other variable measurement scales, i.e. labeling the variables, the significance of the order of variables, and a calculable difference between variables (which are usually equidistant).
- Because of the existence of true zero value, the ratio scale doesn’t have negative values.
- To decide when to use a ratio scale, the researcher must observe whether the variables have all the characteristics of an interval scale along with the presence of the absolute zero value.
- Mean, mode and median can be calculated using the ratio scale.
We have four moments of business decision that help us understand the data.
(It is also known as First Moment Business Decision)
Talks about the centrality of the data. To keep it simple, it is a part of descriptive statistical analysis where a single value at the centre represents the entire dataset.
It is the sum of all the data points divided by the total number of values in the data set. Mean cannot always be relied upon because it is influenced by outliers.
It is the middlemost value of a sorted/ordered dataset. If the size of the dataset is even, then the median is calculated by taking the average of the two middle values. In case of outliers Mean cannot be relied upon as much as median. A Median will have better representation of information when an outliers are present.
It is the most repeated value in the dataset. Data with a single mode is called unimodal, data with two modes is called bimodal, and data with more than two modes is called multimodal.
-
A multimodal distribution is a probability distribution with more than one peak, or “mode.”
-
A distribution with one peak is called unimodal
-
A distribution with two peaks is called bimodal
-
A distribution with two peaks or more is multimodal
-
A bimodal distribution is also multimodal, as there are multiple peaks.
-
A comb distribution is so-called because the distribution looks like a comb, with alternating high and low peaks. A comb shape can be caused by rounding off. For example, if you are measuring water height to the nearest 10 cm and your class width for the histogram is 5 cm, this could cause a comb shape.
An edge peak distribution is where there is an additional, out of place peak at the edge of the distribution. This usually means that you’ve plotted (or collected) your data incorrectly, unless you know for sure your data set has an expected set of outliers (i.e. a few extreme views on a survey).
A multimodal distribution is known as a Plateau Distribution when there are more than a few peaks close together.
A multimodal distribution in a sample is usually an indication that the distribution in the population is not normal. It can also indicate that your sample has several patterns of response or extreme views, preferences or attitudes.
When thinking about the cause of the multimodality, you may want to take a close look at your data; what may be going on is that two or more distributions are being graphed at the same time. This is opposed to a true multimodal distribution, where only one distribution is mapped. For example, the following image shows two groups of students, one of which studied (the peak on the left) and one of which didn’t (the peak on the right).
Mean will tell the Average and in some sense it will give ust the central tendecy of the data. Median will give the middle most value of the data and in both case they are telling us central tendecy (centre of the data). In case of outliers mean cannot be trusted as much as median. A median will have better representation of information when as outliers are present and if the mean is greater than median the distribution is right skewed and if the mean is lesser than median the distribution is right skewed.
It is the average squared distance of all the data points from their mean. The problem with Variance is, the units will also get squared.
It is the difference between the maximum and the minimum values of a dataset.
It is the square root of Variance. Helps in retrieving the original units.
Two of the most popular ways to measure variability or volatility in a set of data are standard deviation and average deviation, also known as mean absolute deviation. Though the two measurements are similar, they are calculated differently and offer slightly different views of data.
Determining volatility—that is, deviation from the center—is important in finance, so professionals in accounting, investing, and economics should be familiar with both concepts.
- Standard deviation is the most common measure of variability and is frequently used to determine the volatility of financial instruments and investment returns.
- Standard deviation is considered the most appropriate measure of variability when using a population sample, when the mean is the best measure of center, and when the distribution of data is normal.
- Some argue that average deviation, or mean absolute deviation, is a better gauge of variability when there are distant outliers or the data is not well distributed.
Standard deviation is the most common measure of variability and is frequently used to determine the volatility of markets, financial instruments, and investment returns. To calculate the standard deviation:
- Find the mean, or average, of the data points by adding them and dividing the total by the number of data points.
- Subtract the mean from each data point and square the difference of each result.
- Find the mean of those squared differences and then the square root of the mean.
Squaring the differences between each point and the mean avoids the issue of negative differences for values below the mean, but it means the variance is no longer in the same unit of measure as the original data. Taking the square root means the standard deviation returns to the original unit of measure and is easier to interpret and use in further calculations.
The average deviation, or mean absolute deviation, is calculated similarly to standard deviation, but it uses absolute values instead of squares to circumvent the issue of negative differences between the data points and their means.
To calculate the average deviation:
- Calculate the mean of all data points.
- Calculate the difference between the mean and each data point.
- Calculate the average of the absolute values of those differences.
Standard deviation is often used to measure the volatility of returns from investment funds or strategies because it can help measure volatility. Higher volatility is generally associated with a higher risk of losses, so investors want to see higher returns from funds that generate higher volatility. For example, a stock index fund should have relatively low standard deviation compared with a growth fund.
The mean average, or mean absolute deviation, is considered the closest alternative to standard deviation. It is also used to gauge volatility in markets and financial instruments, but it is used less frequently than standard deviation.
According to mathematicians, when a data set is of normal distribution—that is, there aren't many outliers—standard deviation is generally the preferable gauge of variability. But when there are large outliers, standard deviation registers higher levels of dispersion (or deviation from the center) than mean absolute deviation.
(It is also known as Third Moment Business Decision)
It measures the asymmetry in the data. The two types of Skewness are:
Positive/right-skewed: Data is said to be positively skewed if most of the data is concentrated to the left side and has a tail towards the right.
Negative/left-skewed: Data is said to be negatively skewed if most of the data is concentrated to the right side and has a tail towards the left.
We refer to it as positively or rightly skewed because, as you can see, the long tail is positioned on the positive, or right, side of the center value. Notwithstanding the skewness, the skewness does aid in our comprehension of the data's concentration on one side and the extent to which outliers deviate from the mean.
A right-skewed or positive distribution means its tail is more pronounced on the right side than on the left. Since the distribution is positive, the assumption is that its value is positive. As such, most of the values end up left of the mean. This means that the most extreme values are on the right side.
In statistics, a positively skewed (or right-skewed) distribution is a type of distribution in which most values are clustered around the left tail of the distribution while the right tail of the distribution is longer. The positively skewed distribution is the direct opposite of the negatively skewed distribution.
Negative or left-skewed means the tail is more pronounced on the left rather than the right. Most values are found on the right side of the mean in negative skewness. As such, the most extreme values are found further to the left.
Unlike with normally distributed data where all measures of the central tendency (mean, median, and mode) equal each other, with positively skewed data, the measures are dispersed. The general relationship among the central tendency measures in a positively skewed distribution may be expressed using the following inequality:
In contrast to a negatively skewed distribution, in which the mean is located on the left from the peak of distribution, in a positively skewed distribution, the mean can be found on the right from the distribution’s peak. However, not all negatively skewed distributions follow the rules. You may encounter many exceptions in real life that violate the rules.
Since a high level of skewness can generate misleading results from statistical tests, the extreme positive skewness is not desirable for a distribution. In order to overcome such a problem, data transformation tools may be employed to make the skewed data closer to a normal distribution.
For positively skewed distributions, the most popular transformation is the log transformation. The log transformation implies the calculations of the natural logarithm for each value in the dataset. The method reduces the skew of a distribution. Statistical tests are usually run only when the transformation of the data is complete.
In finance, the concept of skewness is utilized in the analysis of the distribution of the returns of investments. Although many finance theories and models assume that the returns from securities follow a normal distribution, in reality, the returns are usually skewed.
The positive skewness of a distribution indicates that an investor may expect frequent small losses and a few large gains from the investment. The positively skewed distributions of investment returns are generally more desired by investors since there is some probability of gaining huge profits that can cover all the frequent small losses.
Kurtosis is a measure of the tailedness of a distribution. Tailedness is how often outliers occur. Excess kurtosis is the tailedness of a distribution relative to a normal distribution. Like skewness, kurtosis is a statistical measure that is used to describe distribution. Whereas skewness differentiates extreme values in one versus the other tail, kurtosis measures extreme values in either tail. Distributions with large kurtosis exhibit tail data exceeding the tails of the normal distribution (e.g., five or more standard deviations from the mean). Distributions with low kurtosis exhibit tail data that are generally less extreme than the tails of the normal distribution.
In other words, Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.
Along with skewness, kurtosis is an important descriptive statistic of data distribution. However, the two concepts must not be confused with each other. Skewness essentially measures the symmetry of the distribution, while kurtosis determines the heaviness of the distribution tails.
In finance, kurtosis is used as a measure of financial risk. A large kurtosis is associated with a high risk for an investment because it indicates high probabilities of extremely large and extremely small returns. On the other hand, a small kurtosis signals a moderate level of risk because the probabilities of extreme returns are relatively low. For investors, high kurtosis of the return distribution implies the investor will experience occasional extreme returns (either positive or negative), more extreme than the usual + or - three standard deviations from the mean that is predicted by the normal distribution of returns. This phenomenon is known as kurtosis risk.
Excess kurtosis is a metric that compares the kurtosis of a distribution against the kurtosis of a normal distribution. The kurtosis of a normal distribution equals 3. Therefore, the excess kurtosis is found using the formula below:
Excess Kurtosis = Kurtosis – 3
Distributions with medium kurtosis (medium tails) are mesokurtic. Distributions with low kurtosis (thin tails) are platykurtic. Distributions with high kurtosis (fat tails) are leptokurtic. Tails are the tapering ends on either side of a distribution. They represent the probability or frequency of values that are extremely high or low compared to the mean. In other words, tails represent how often outliers occur.
Example: Types of kurtosis
Types of kurtosis Distributions can be categorized into three groups based on their kurtosis:
From the graph, we can see that the frequency distribution (shown by the gray bars) approximately follows a normal distribution (shown by the green curve). Normal distributions are mesokurtic.
The zoologist calculates the kurtosis of the sample. She finds that the kurtosis is 3.09 and the excess kurtosis is 0.09, and she concludes that the distribution is mesokurtic.
Leptokurtic indicates a positive excess kurtosis. The leptokurtic distribution shows heavy tails on either side, indicating large outliers. In finance, a leptokurtic distribution shows that the investment returns may be prone to extreme values on either side. Therefore, an investment whose returns follow a leptokurtic distribution is considered to be risky.
A platykurtic distribution shows a negative excess kurtosis. The kurtosis reveals a distribution with flat tails. The flat tails indicate the small outliers in a distribution. In the finance context, the platykurtic distribution of the investment returns is desirable for investors because there is a small probability that the investment would experience extreme returns.
Three terms that students often confuse in statistics are percentiles, quartiles, and quantiles.
Here’s a simple definition of each:
Percentiles: Range from 0 to 100.
Quartiles: Range from 0 to 4.
-
Note that percentiles and quartiles share the following relationship:
-
0 percentile = 0 quartile (also called the minimum)
-
25th percentile = 1st quartile
-
50th percentile = 2nd quartile (also called the median)
-
75th percentile = 3rd quartile
-
100th percentile = 4th quartile (also called the maximum)
Quantiles: Range from any value to any other value.
Note that percentiles and quartiles are simply types of quantiles.
Some types of quantiles even have specific names, including:
- 4-quantiles are called quartiles.
- 5-quantiles are called quintiles.
- 8-quantiles are called octiles.
- 10-quantiles are called deciles.
- 100-quantiles are called percentiles.
The word “quantile” comes from the word quantity. In simple terms, a quantile is where a sample is divided into equal-sized, adjacent, subgroups (that’s why it’s sometimes called a “fractile“). It can also refer to dividing a probability distribution into areas of equal probability.
The median is a quantile; the median is placed in a probability distribution so that exactly half of the data is lower than the median and half of the data is above the median. The median cuts a distribution into two equal areas and so it is sometimes called 2-quantile.
Quartiles are also quantiles; they divide the distribution into four equal parts. Percentiles are quantiles that divide a distribution into 100 equal parts and deciles are quantiles that divide a distribution into 10 equal parts
In the most general sense, an outlier is a data point which differs significantly from other observations.
To explain IQR Method easily, let’s start with a box plot.
A box plot tells us, more or less, about the distribution of the data. It gives a sense of how much the data is actually spread about, what’s its range, and about its skewness. As you might have noticed in the figure, that a box plot enables us to draw inference from it for an ordered data, i.e., it tells us about the various metrics of a data arranged in ascending order.
In the above figure,
- minimum is the minimum value in the dataset,
- and maximum is the maximum value in the dataset.
So the difference between the two tells us about the range of dataset.
- The median is the median (or centre point), also called second quartile, of the data (resulting from the fact that the data is ordered).
- Q1 is the first quartile of the data, i.e., to say 25% of the data lies between minimum and Q1.
- Q3 is the third quartile of the data, i.e., to say 75% of the data lies between minimum and Q3.
- The difference between Q3 and Q1 is called the Inter-Quartile Range or IQR.
To detect the outliers using this method, we define a new range, let’s call it decision range, and any data point lying outside this range is considered as outlier and is accordingly dealt with. The range is as given below:
Any data point less than the Lower Bound or more than the Upper Bound is considered as an outlier.
But the question was: Why only 1.5 times the IQR? Why not any other number?
Well, as you might have guessed, the number (here 1.5, hereinafter scale) clearly controls the sensitivity of the range and hence the decision rule. A bigger scale would make the outlier(s) to be considered as data point(s) while a smaller one would make some of the data point(s) to be perceived as outlier(s). And we’re quite sure, none of these cases is desirable.
But this is an abstract way of explaining the reason, it’s quite effective, but naive nonetheless. So to what should we turn our heads for hope?
Maths! Of course! (You saw that coming, right? 😐)
You might be surprised if I tell you that this number, or scale, depends on the distribution followed by the data.
For example, let’s say our data follows, our beloved, Gaussian Distribution.
You all must have seen how a Gaussian Distribution looks like, right? If not, here it is (although I’m suspicious about you 👊).
There are certain observations which could be inferred from this figure:
- About 68.26% of the whole data lies within one standard deviation (<σ) of the mean (μ), taking both sides into account, the pink region in the figure.
- About 95.44% of the whole data lies within two standard deviations (2σ) of the mean (μ), taking both sides into account, the pink+blue region in the figure.
- About 99.72% of the whole data lies within three standard deviations (<3σ) of the mean (μ), taking both sides into account, the pink+blue+green region in the figure.
- And the rest 0.28% of the whole data lies outside three standard deviations (>3σ) of the mean (μ), taking both sides into account, the little red region in the figure. And this part of the data is considered as outliers.
- The first and the third quartiles, Q1 and Q3, lies at -0.675σ and +0.675σ from the mean, respectively.
Lower Bound:
= Q1 - 1 * IQR
= Q1 - 1 * (Q3 - Q1)
= -0.675σ - 1 * (0.675 - [-0.675])σ
= -0.675σ - 1 * 1.35σ
= -2.025σ
Upper Bound:
= Q3 + 1 * IQR
= Q3 + 1 * (Q3 - Q1)
= 0.675σ + 1 * (0.675 - [-0.675])σ
= 0.675σ + 1 * 1.35σ
= 2.025σ
So, when scale is taken as 1, then according to IQR Method any data which lies beyond 2.025σ from the mean (μ), on either side, shall be considered as outlier. But as we know, upto 3σ, on either side of the μ ,the data is useful. So we cannot take scale = 1, because this makes the decision range too exclusive, means this results in too much outliers. In other words, the decision range gets so small (compared to 3σ) that it considers some data points as outliers, which is not desirable.
Lower Bound:
= Q1 - 2 * IQR
= Q1 - 2 * (Q3 - Q1)
= -0.675σ - 2 * (0.675 - [-0.675])σ
= -0.675σ - 2 * 1.35σ
= -3.375σ
Upper Bound:
= Q3 + 2 * IQR
= Q3 + 2 * (Q3 - Q1)
= 0.675σ + 2 * (0.675 - [-0.675])σ
= 0.675σ + 2 * 1.35σ
= 3.375σ
So, when scale is taken as 2, then according to IQR Method any data which lies beyond 3.375σ from the mean (μ), on either side, shall be considered as outlier. But as we know, upto 3σ, on either side of the μ ,the data is useful. So we cannot take scale = 2, because this makes the decision range too inclusive, means this results in too few outliers. In other words, the decision range gets so big (compared to 3σ) that it considers some outliers as data points, which is not desirable either.
Lower Bound:
= Q1 - 1.5 * IQR
= Q1 - 1.5 * (Q3 - Q1)
= -0.675σ - 1.5 * (0.675 - [-0.675])σ
= -0.675σ - 1.5 * 1.35σ
= -2.7σ
Upper Bound:
= Q3 + 1.5 * IQR
= Q3 + 1.5 * (Q3 - Q1)
= 0.675σ + 1.5 * (0.675 - [-0.675])σ
= 0.675σ + 1.5 * 1.35σ
= 2.7σ
When scale is taken as 1.5, then according to IQR Method any data which lies beyond 2.7σ from the mean (μ), on either side, shall be considered as outlier. And this decision range is the closest to what Gaussian Distribution tells us, i.e., 3σ. In other words, this makes the decision rule closest to what Gaussian Distribution considers for outlier detection, and this is exactly what we wanted.
Instead of analyzing entire population data, we always take out a sample for analysis. The problem with sampling is that “sample means is a random variable – varies for different samples”. And random sample we draw can never be an exact representation of the population. This phenomenon is called sample variation.
To nullify the sample variation, we use the central limit theorem. And according to the Central Limit Theorem:
-
The distribution of sample means follows a normal distribution if the population is normal.
-
the distribution of sample means follows a normal distribution even though the population is not normal. But the sample size should be large enough.
-
The grand average of all the sample mean values give us the population mean.
-
Theoretically, the sample size should be 30. And practically, the condition on the sample size (n) is:
n > 10(k3)2, where k3 is the sample skewness.
n > 10(k4), where K4 is the sample Kurtosis.
In probability theory, the central limit theorem (CLT) states that the distribution of a sample variable approximates a normal distribution (i.e., a “bell curve”) as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population's actual distribution shape.
Put another way, CLT is a statistical premise that, given a sufficiently large sample size from a population with a finite level of variance, the mean of all sampled variables from the same population will be approximately equal to the mean of the whole population. Furthermore, these samples approximate a normal distribution, with their variances being approximately equal to the variance of the population as the sample size gets larger, according to the law of large numbers.
According to the central limit theorem, the mean of a sample of data will be closer to the mean of the overall population in question, as the sample size increases, notwithstanding the actual distribution of the data. In other words, the data is accurate whether the distribution is normal or aberrant.
As a general rule, sample sizes of around 30-50 are deemed sufficient for the CLT to hold, meaning that the distribution of the sample means is fairly normally distributed. Therefore, the more samples one takes, the more the graphed results take the shape of a normal distribution. Note, however, that the central limit theorem will still be approximated in many cases for much smaller sample sizes, such as n=8 or n=5. 3
The central limit theorem is often used in conjunction with the law of large numbers, which states that the average of the sample means and standard deviations will come closer to equaling the population mean and standard deviation as the sample size grows, which is extremely useful in accurately predicting the characteristics of populations.
- Sampling is random. All samples must be selected at random so that they have the same statistical possibility of being selected.
- Samples should be independent. The selections or results from one sample should have no bearing on future samples or other sample results.
- Samples should be limited. It's often cited that a sample should be no more than 10% of a population if sampling is done without replacement. In general, larger population sizes warrant the use of larger sample sizes.
- Sample size is increasing. The central limit theorem is relevant as more samples are selected.
The CLT is useful when examining the returns of an individual stock or broader indices, because the analysis is simple, due to the relative ease of generating the necessary financial data. Consequently, investors of all types rely on the CLT to analyze stock returns, construct portfolios, and manage risk.
Say, for example, an investor wishes to analyze the overall return for a stock index that comprises 1,000 equities. In this scenario, that investor may simply study a random sample of stocks to cultivate estimated returns of the total index. To be safe, at least 30-50 randomly selected stocks across various sectors should be sampled for the central limit theorem to hold. Furthermore, previously selected stocks must be swapped out with different names to help eliminate bias.
The central limit theorem is useful when analyzing large data sets because it allows one to assume that the sampling distribution of the mean will be normally-distributed in most cases. This allows for easier statistical analysis and inference. For example, investors can use central limit theorem to aggregate individual security performance data and generate distribution of sample means that represent a larger population distribution for security returns over a period of time.
A sample size of 30 is fairly common across statistics. A sample size of 30 often increases the confidence interval of your population data set enough to warrant assertions against your findings. 4 The higher your sample size, the more likely the sample will be representative of your population set.
The central limit theorem doesn't have its own formula, but it relies on sample mean and standard deviation. As sample means are gathered from the population, standard deviation is used to distribute the data across a probability distribution curve.
A complete list of all the units in a population is called the sampling frame. A unit of population is a relative term. If all the workers in a factory make up a population, a single worker is a unit of the population. If all the factories in a country are being studied for some purpose, a single factory is a unit of the population of factories. The sampling frame contains all the units of the population. The units which are to be included in the frame must be clearly defined. The frame provides a base for the selection of the sample.
A sampling frame is a list of all the items in your population. It’s a complete list of everyone or everything you want to study. The difference between a population and a sampling frame is that the population is general and the frame is specific. For example, the population could be “People who live in Jacksonville, Florida.” The frame would name all of those people, from Adrian Abba to Felicity Zappa. A couple more examples:
-
Population: People in STAT101.
-
Sampling Frame: Adrian, Anna, Bob, Billy, Howie, Jess, Jin, Kate, Kaley, Lin, Manuel, Norah, Paul, Roger, Stu, Tim, Vanessa, Yasmin.
-
Population: Birds that are pink.
-
Sampling Frame:
-
Brown-capped Rosy-Finch.
-
White-winged Crossbill.
-
American Flamingo.
-
Roseate Spoonbill.
-
Black Rosy-Finch.
-
Cassin’s Finch.
A sampling frame is a list of things that you draw a sample from. A sample space is a list of all possible outcomes for an experiment. For example, you might have a sampling frame of names of people in a certain town for a survey you’re going to be conducting on family size. The sample space is all possible outcomes from your survey: 1 person, 2 people, 3 people…10 or more.
A sampling distribution is a probability distribution of a statistic obtained from a larger number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population.
In statistics, a population is the entire pool from which a statistical sample is drawn. A population may refer to an entire group of people, objects, events, hospital visits, or measurements. A population can thus be said to be an aggregate observation of subjects grouped together by a common feature.
A lot of data drawn and used by academicians, statisticians, researchers, marketers, analysts, etc. are actually samples, not populations. A sample is a subset of a population. For example, a medical researcher that wanted to compare the average weight of all babies born in North America from 1995 to 2005 to those born in South America within the same time period cannot draw the data for the entire population of over a million childbirths that occurred over the ten-year time frame within a reasonable amount of time. They will instead only use the weight of, say, 100 babies, in each continent to make a conclusion. The weight of 100 babies used is the sample and the average weight calculated is the sample mean.
In sample studies, we have to make a plan regarding the size of the sample, selection of the sample, collection of the sample data and preparation of the final results based on the sample study. The whole procedure involved is called the sample design. The term sample survey is used for a detailed study of the sample. In general, the term sample survey is used for any study conducted on the sample taken from some real world data.
Non-random sampling methods select locations for sampling by either: according to regular (i.e., systematic) patterns, targeting specific features or events, using personal or anecdotal (based on anecdotes and possibly not true or accurate) information, or without any specific plan. Care must be exercised when using non-random sample selection methods because the samples may not be representative of the entire population. If this is the case, then inference cannot extend beyond the set of sampling units. Some common non-random sample design techniques are discussed below. Unless otherwise stated, the primary reference for these discussions was Elzinga et al. (2001).
Random sampling methods rely on randomization at some point in the sample design process in an attempt to achieve statistically unbiased samples. Random sampling methods are a form of design-based inferencewhere
1): the population being measured is assumed to have fixed parameters at the time they are sampled, and
- that a randomly-selected set of samples for the population represents one realization of all possible sample sets (i.e., the sample set is a random variable).
There are many different random sampling techniques. Some of the most common techniques are described below. Unless otherwise stated, the primary source for information on these methods is Elzinga et al. (2001).
Simple random sampling is the foundation for all of the other random sampling techniques. With this approach, all of the sampling units are enumerated and a specified number of sampling units are selected at random from all the sampling units. Selection of samples for simple random sampling follow two criteria:
- each sampling unit has the same likelihood of being selected, and
- the selection of one sampling unit does not influence the selection of any other sampling unit.
Disadvantages of simple random sampling
- it does not take into account variability caused by other measurable factors (e.g., aspect, soils, elevation)
- it can yield high variance estimates and make detection of differences difficult if the population being sampled is not evenly distributed throughout the sample area.
- it can be an inefficient means of sampling because of the time required to visit all of the sample sites
- by chance, some areas may be heavily sampled while other areas are not sampled at all
Stratification is the process of dividing a set of sampling units into one or more subgroups (i.e., strata) prior to selection of units for sampling. Sampling units are then selected randomly within each stratum. The purpose of using stratification is to account for variability in a population that can be explained by another variable (e.g., vegetation type, aspect, soil type). Therefore, strata should be defined so that the population conditions are similar within the strata.
Sampling effort does not need to be equally allocated between strata. It is common for sampling intensity to be varied between strata based on either the variability of the population parameter within the strata or the size of the strata.
A stratified sample is a mini-reproduction of the population. Before sampling, the population is divided into characteristics of importance for the research. For example, by gender, social class, education level, religion, etc. Then the population is randomly sampled within each category or stratum. If 38% of the population is college-educated, then 38% of the sample is randomly selected from the college-educated population. Stratified samples are as good as or better than random samples, but they require fairly detailed advance knowledge of the population characteristics, and therefore are more difficult to construct.
Advantages of stratified random sampling
- it increases efficiency of sampling over simple random sampling when the variable of interest responds differently to some clearly definable features.
Disadvantages of stratified random sampling
- the formulas for estimating population parameters and conducting hypothesis tests are more complicated than for simple random selection.
- each stratum should be relatively homogeneous with regard to the population parameter being measured.
Systematic sampling is the selection of units for sampling or the placement of sampling locations within an area according to a regularly-repeating pattern. Examples of systematic sampling are: locating sample sites on a 1km grid within a pasture, taking measurements every meter along a transect, or orienting transects along cardinal directions. Systematic techniques are commonly used to locate sub-plot sampling sites (e.g., points, transects, frames) within a sampling site where the location of the sampling site has been selected randomly. Alternatively, larger sampling units can be selected systematically and then the location of the specific sampling unit randomly selected within the larger unit (i.e., a form of two-stage sampling or restricted random sampling – see below). This technique is often used with regional- or national-scale assessment and monitoring programs like the NRCS Natural Resource Inventory (NRI) or the USFS Forest Inventory and Analysis (FIA) programs.
-
Advantages of systematic sampling are:
-
it allows even sampling across an area
-
it is quick and easy to implement
-
it is often more efficient than random sampling, and can perform as well or better than random methods in some situations (see Elzinga et al. (2001), p125)
-
When combined with an appropriate randomization method, the data can be analyzed as if it were a random design
-
Disadvantages of systematic sampling are:
-
it can yield biased data if there are regularly-occurring patterns in the population being sampled. For instance, when sampling for road impacts, transects oriented along cardinal directions may yield biased estimates of road impacts because many roads are oriented along cardinal directions too (M. Duniway, pers. comm.)
-
systematic sampling can miss or under-represent small or narrow features of a landscape if the sampling interval is too large.
Cluster sampling is a technique that can be applied when it is not possible or desirable to take a random sample from the entire population. With cluster sampling, the known or accessible sampling units are grouped into clusters. A random selection of clusters is then made and each sampling unit is measured within each of the selected clusters. Cluster sampling is typically applied to monitoring of rare plants or invasive species when the objective is to estimate a property related to individual plants (e.g., mean height, number of flowers per plant).
Advantages of cluster sampling are:
It can be less expensive and more efficient to sample all of the sampling units within a cluster than to sample an equal number of units across the entire population. Cluster sampling can be an efficient choice when clusters naturally occur and when the clusters are similar to each other but have a high degree of internal variability Disadvantages of cluster sampling are:
all elements within the selected clusters must be measured. If clusters are large or contain a large number of elements, then two-stage sampling may be more efficient. It can be difficult to determine how many clusters to sample versus how large the clusters should be. Analysis of sample data collected using a cluster analysis design is more complex than other methods.
In two-stage sampling, elements of the population are grouped together into large groups called primary sampling units. The individual sampling units within each primary sampling unit are called secondary sampling units. A random selection of the primary sampling units is made, and then a selection of secondary sampling units is made (usually random, but can be systematic) within each of the selected primary sampling units.
Two-stage sampling is a powerful sample design method for systems that are hierarchical in nature. For example, allotments within a BLM District could be considered primary sampling units. A random selection of allotments could be made and then sample sites selected within the selected allotments. This design would allow for inference at the allotment level (e.g., average allotment condition) as well as at the district level.
The concept of two-stage sampling can be generalized to multi-stage sampling where there are more than two hierarchical levels for sampling. However, as the number of stages increases, sample size requirements go up and degrees of freedom for statistical hypothesis testing decrease. Accordingly, the number of stages is generally small (i.e., two or three).
Advantages of two-stage sampling are:
It is often more efficient to sample secondary sampling units within a limited number of primary sampling units than to sample the same number of secondary units randomly spread across a landscape. Inferences can be made at multiple scales (i.e., the scale of the primary sampling unit, and the entire population). Disadvantages of two-stage sampling are:
Calculation of sample statistics is more complicated than with other simpler sample designs.
Adaptive sampling refers to a technique where the sample design is modified in the field based on observations made at a set of pre-selected sampling units. Perhaps the best way to describe adaptive sampling is through an example. Consider sampling for the presence or abundance of rare plants. A random selection of sample units will yield many sample units where the plant is not detected, but the rare plant is likely to occur in sample units nearby to those units where it was detected. With adaptive sampling, the detection of the rare plant at one site triggers the selection and sampling of additional nearby sites that were not originally selected as part of the sample set. Thus the biggest difference between adaptive sampling and many other random selection techniques is that the observed conditions at one sampling unit influence the selection of other sampling units.
One typical implementation of adaptive sampling is that whenever a specified event occurs (e.g., detection of a target species, measurement over a specified threshold), all of the neighboring sample units are searched/sampled. This continues until no new detections occur.
Adaptive sampling introduces bias into the samples that must be corrected for. More specifically, adding additional units to the sample that contain high values for the parameter being measured will result in overestimation of the population mean (Thompson 1992). Various techniques are available for correcting for the bias introduced by adaptive sampling.
Advantages of Adaptive Sampling
It is an efficient method for sampling rare species or events It works well with populations that are naturally aggregated or clustered and does not require the exact nature of the aggregation to be known ahead of time Disadvantages of Adaptive Sampling
Population estimates must be corrected for bias. Calculations for population parameter estimates and hypothesis tests are more complicated than for simpler sampling designs. Estimation of sample size requirements is difficult.
A Binomial Distribution describes the probability of an event that only has 2 possible outcomes. For Example, Heads or tails.
It can also be used to describe the probability of a series of independent events that only have 2 possible outcomes occurring. For Example. Flipping a coin 10 times and having it land with 5 on heads exactly 5 times. Or, flipping a coin 10 times and having at least 5 land on heads.
Furthermore, the probability of an event with n trials and f failures follows a binomial distribution.
The Binomial Distribution looks like so when graphed:
Requirements and Conditions for a Binomial Distribution
- Must be a fixed number of trials.
- Continuous data are not binomial.
- Probability of success should be the same on every trial, and also probability of success is constant.
- Two state. Two possible outcomes.
- (true or false, hot or cold, success or failure, defective or not defective.)
- Independent trials – trials are statistically independent.
- (For example: The flip of one coin means nothing to the results of the next coin flip.)
- Use Binomial Distribution when you are sampling with replacement.
You Cannot Use Binomial Distribution On:
- When the probability of success is not constant for an event.
- Ex. The probability of it snowing or not snowing in NYC would not fit the criteria for a Binomial Distribution because the probability of success is not constant. The chance of snow on winter days is higher than summer days.
- When you sample WITHOUT replacement.
- In that case, use Hypergeometric.
If a coin is tossed 10 times, what is the probability of obtaining exactly four heads? (If an honest coin is tossed 10 times, what is the probability of 4 heads?)
With a coin toss there are only 2 possible outcomes; heads or tails. So you should be on alert for using binomial.
A manufacturing process creates 3.4% defective parts. A sample of 10 parts from the production process is selected. What is the probability that the sample contains exactly 3 defective parts?
As soon as you see the word defective, you should be alert to using the Binomial equation. Since defect in this sense means that a part is in a binary state, either functioning or defective, it meets our criteria.
You can see that the same percentage was found on the chart. (In this case, x is the probability of the event happening instead of p.)
Examples of Cumulative Binomial Probability Questions Cumulative Calculation Example: “At least” or “Fewer Than” or “At Most” 79% percent of the students of a large class passed the final exam. A random sample of 4 students are selected to be analyzed by the school. What is the probability that the sample contains fewer than 2 students that passed the exam?
Since we are examining data that only has a binary state, pass or fail, you should be on alert to using the binomial equation.
Also, since you’re asked for ‘x or fewer’, you have to calculate the probability of ALL POSSIBLE events.
In this example we must calculate the odds of EXACTLY one pass PLUS the odds of EXACTLY no pass in the sample.
We all have heard of the idiom ‘odd one out which means something unusual in comparison to the others in a group.
Similarly, an Outlier is an observation in a given dataset that lies far from the rest of the observations. That means an outlier is vastly larger or smaller than the remaining values in the set.
An outlier may occur due to the variability in the data, or due to experimental error/human error.
They may indicate an experimental error or heavy skewness in the data(heavy-tailed distribution).
In statistics, we have three measures of central tendency namely Mean, Median, and Mode. They help us describe the data.
Mean is the accurate measure to describe the data when we do not have any outliers present.
Median is used if there is an outlier in the dataset.
Mode is used if there is an outlier AND about ½ or more of the data is the same.
‘Mean’ is the only measure of central tendency that is affected by the outliers which in turn impacts Standard deviation.
If our dataset is small, we can detect the outlier by just looking at the dataset. But what if we have a huge dataset, how do we identify the outliers then? We need to use visualization and mathematical techniques.
Below are some of the techniques of detecting outliers
Boxplots Z-score Inter Quantile Range(IQR)
Criteria: any data point whose Z-score falls out of 3rd standard deviation is an outlier.
-
Steps: loop through all the data points and compute the Z-score using the formula (Xi-mean)/std. define a threshold value of 3 and mark the datapoints whose absolute value of Z-score is greater than the threshold as outliers.
import numpy as np outliers = [] def detect_outliers_zscore(data): thres = 3 mean = np.mean(data) std = np.std(data) # print(mean, std) for i in data: z_score = (i-mean)/std if (np.abs(z_score) > thres): outliers.append(i) return outliers# Driver code sample_outliers = detect_outliers_zscore(sample) print("Outliers from Z-scores method: ", sample_outliers)
Detecting outliers using the Inter Quantile Range(IQR)
Till now we learned about detecting the outliers. The main question is WHAT do we do with the outliers?
Below are some of the methods of treating the outliers
Trimming/removing the outlier Quantile based flooring and capping Mean/Median imputation
Trimming/Remove the outliers In this technique, we remove the outliers from the dataset. Although it is not a good practice to follow.
Quantile based flooring and capping In this technique, the outlier is capped at a certain value above the 90th percentile value or floored at a factor below the 10th percentile value.
The above code outputs: New array: [15, 20.7, 18, 7.2, 13, 16, 11, 20.7, 7.2, 15, 10, 9]
The data points that are lesser than the 10th percentile are replaced with the 10th percentile value and the data points that are greater than the 90th percentile are replaced with 90th percentile value.
Mean/Median imputation As the mean value is highly influenced by the outliers, it is advised to replace the outliers with the median value.
A Normal Q-Q plot is a kind of scatter plot that is plotted by creating two sets of quantiles. It is used to check if the data is following normality or not.
On the x-axis, we have the Z-scores and on the y-axis, we have the actual sample quantiles. If the scatter plot forms a straight line, data is said to be normal.
Statisticians have observed that frequently used data occur in familiar patterns and so have sort to understand and define them. Frequently seen patterns include the normal distribution, uniform distribution, binomial distribution, etc.
The cumulative distribution function (CDF) is the probability that a random variable, say X, will take a value equal to or less than x.
For example, if you roll a die, the probability of obtaining a 1 or 2 or 3 or 4 or 5 or 6 is 16.667% (=1/6) individually. The cumulative distribution function (CDF) of 1 is the probability that the next roll will take a value less than or equal to 1 and is 16.667%. There is only one possible way to get a 1. The cumulative distribution function (CDF) of 2 is the probability that the next roll will take a value less than or equal to 2. The cumulative distribution function (CDF) of 2 is 33.33% as there are two possible ways to get a 2 or below (the roll giving a 1 or 2).
Cumulative Distribution Function for x=2 in the case of a roll of a dice. The cumulative distribution function (CDF) of 6 is 100%. The cumulative distribution function (CDF) of 6 is the probability that the next roll will take a value less than or equal to 6 and is equal to 100% as all possible results will be less than or equal to 6.
The probability density function (PDF) is the probability that a random variable, say X, will take a value exactly equal to x. Note the difference between the cumulative distribution function (CDF) and the probability density function (PDF) – Here the focus is on one specific value. Whereas, for the cumulative distribution function, we are interested in the probability of taking on a value equal to or less than the specified value. The probability density function is also referred to as the probability mass function. So do not get perturbed if you encounter the probability mass function.
For example, if you roll a die, the probability of obtaining 1, 2, 3, 4, 5, or 6 is 16.667% (=1/6). The probability density function (PDF) or the probability that you will get exactly 2 will be 16.667%. Whereas, the cumulative distribution function (CDF) of 2 is 33.33% as described above.
The CDF is the probability that random variable values less than or equal to x whereas the PDF is a probability that a random variable, say X, will take a value exactly equal to x.
Probability Mass Function vs Cumulative Distribution Function for Continuous Distributions and Discrete Distributions
We have seen above that the probability density function is relevant in the case of discrete distributions (roll of a dice). Why is the probability density function not relevant in the case of continuous distributions?
There is an infinite number of values between the min and max in the case of continuous distributions. Therefore, we can say that the probability of a specific value will be 1/infinity or practically zero! So we conclude that the probability density functions are not relevant in the case of continuous distributions.
Correlation coefficients take the values between minus one and plus one. The positive correlation signifies that the ranks of both the variables are increasing. On the other hand, the negative correlation signifies that as the rank of one variable is increased, the rank of the other variable is decreased.
Correlation analyses can be used to test for associations in hypothesis testing. The null hypothesis is that there is no association between the variables under study. Thus, the purpose is to investigate the possible association in the underlying variables. It would be incorrect to write the null hypothesis as having no rank correlation between the variables.
A Pearson correlation is used when assessing the relationship between two continuous variables. The non-parametric equivalent to the Pearson correlation is the Spearman correlation (ρ), and is appropriate when at least one of the variables is measured on an ordinal scale.
Inferential statistical procedures generally fall into two possible categorizations: parametric and non-parametric. Depending on the level of the data you plan to examine (e.g., nominal, ordinal, continuous), a particular statistical approach should be followed.
Parametric tests rely on the assumption that the data you are testing resembles a particular distribution (often a normal or “bell-shaped” distribution).
Non-parametric tests are frequently referred to as distribution-free tests because there are not strict assumptions to check in regards to the distribution of the data.
As a general rule of thumb, when the dependent variable’s level of measurement is nominal (categorical) or ordinal, then a non-parametric test should be selected. When the dependent variable is measured on a continuous scale, then a parametric test should typically be selected. Fortunately, the most frequently used parametric analyses have non-parametric counterparts. This can be useful when the assumptions of a parametric test are violated because you can choose the non-parametric alternative as a backup analysis.
Kendall's rank correlation provides a distribution free test of independence and a measure of the strength of dependence between two variables.
Spearman's rank correlation is satisfactory for testing a null hypothesis of independence between two variables but it is difficult to interpret when the null hypothesis is rejected. Kendall's rank correlation improves upon this by reflecting the strength of the dependence between the variables being compared.
There are two accepted measures of non-parametric rank correlations: Kendall’s tau and Spearman’s (rho) rank correlation coefficient.
Correlation analyses measure the strength of the relationship between two variables.
Kendall’s Tau and Spearman’s rank correlation coefficient assess statistical associations based on the ranks of the data. Ranking data is carried out on the variables that are separately put in order and are numbered.
Kendall’s Tau: usually smaller values than Spearman’s rho correlation. Calculations based on concordant and discordant pairs. Insensitive to error. P values are more accurate with smaller sample sizes.
Spearman’s rho: usually have larger values than Kendall’s Tau. Calculations based on deviations. Much more sensitive to error and discrepancies in data.
The main advantages of using Kendall’s tau are as follows:
The distribution of Kendall’s tau has better statistical properties. The interpretation of Kendall’s tau in terms of the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs is very direct. In most of the situations, the interpretations of Kendall’s tau and Spearman’s rank correlation coefficient are very similar and thus invariably lead to the same inferences.
Kendall’s Tau is a non-parametric measure of relationships between columns of ranked data. The Tau correlation coefficient returns a value of 0 to 1, where:
0 is no relationship, 1 is a perfect relationship. A quirk of this test is that it can also produce negative values (i.e. from -1 to 0). Unlike a linear graph, a negative relationship doesn’t mean much with ranked columns (other than you perhaps switched the columns around), so just remove the negative sign when you’re interpreting Tau.
Several version’s of Tau exist.
Tau-A and Tau-B are usually used for square tables (with equal columns and rows). Tau-B will adjust for tied ranks. Tau-C is usually used for rectangular tables. For square tables, Tau-B and Tau-C are essentially the same. Most statistical packages have Tau-B built in, but you can use the following formula to calculate it by hand:
Kendall’s Tau = (C – D / C + D) Where C is the number of concordant pairs and D is the number of discordant pairs.
Example Problem
Sample Question: Two interviewers ranked 12 candidates (A through L) for a position. The results from most preferred to least preferred are:
Interviewer 1: ABCDEFGHIJKL. Interviewer 2: ABDCFEHGJILK. Calculate the Kendall Tau correlation.
Step 1: Make a table of rankings. The first column, “Candidate” is optional and for reference only. The rankings for Interviewer 1 should be in ascending order (from least to greatest).
Concordant pairs and discordant pairs refer to comparing two pairs of data points to see if they “match.” The meaning is slightly different depending on if you are finding these pairs from various coefficients (like Kendall’s Tau) or if you are performing experimental studies and clinical trials.
- In Coefficient Calculations Concordant pairs and discordant pairs are used in Kendall’s Tau, for Goodman and Kruskal’s Gamma and in Logistic Regression. They are calculated for ordinal (ordered) variables and tell you if there is agreement (or disagreement) between scores. To calculate concordance or discordance, your data must be ordered and placed into pairs.
Example of Tied, Concordant and Discordant Pairs Let’s say you had two interviewers rate a group of twelve job applicants:
Symbolically, Spearman’s rank correlation coefficient is denoted by rs . It is given by the following formula:
rs = 1- (6∑di2 )/ (n (n2-1))
*Here di represents the difference in the ranks given to the values of the variable for each item of the particular data
This formula is applied in cases when there are no tied ranks. However, in the case of fewer numbers of tied ranks, this approximation of Spearman’s rank correlation coefficient provides sufficiently good approximations.
- Key terms:
Non-parametric test: it does not depend upon the assumptions of various underlying distributions; this means that it is distribution free.
Concordant pairs: if both members of one observation are larger than their respective members of the other observations
Discordant pairs: if the two numbers in one observation differ in opposite directions
This guide will tell you when you should use Spearman's rank-order correlation to analyse your data, what assumptions you have to satisfy, how to calculate it, and how to report it. If you want to know how to run a Spearman correlation in SPSS Statistics, go to our Spearman's correlation in SPSS Statistics guide.
When should you use the Spearman's rank-order correlation? The Spearman's rank-order correlation is the nonparametric version of the Pearson product-moment correlation. Spearman's correlation coefficient, (ρ, also signified by rs) measures the strength and direction of association between two ranked variables.
What are the assumptions of the test? You need two variables that are either ordinal, interval or ratio (see our Types of Variable guide if you need clarification). Although you would normally hope to use a Pearson product-moment correlation on interval or ratio data, the Spearman correlation can be used when the assumptions of the Pearson correlation are markedly violated. However, Spearman's correlation determines the strength and direction of the monotonic relationship between your two variables rather than the strength and direction of the linear relationship between your two variables, which is what Pearson's correlation determines.
What is a monotonic relationship? A monotonic relationship is a relationship that does one of the following: (1) as the value of one variable increases, so does the value of the other variable; or (2) as the value of one variable increases, the other variable value decreases. Examples of monotonic and non-monotonic relationships are presented in the diagram below:
Why is a monotonic relationship important to Spearman's correlation? Spearman's correlation measures the strength and direction of monotonic association between two variables. Monotonicity is "less restrictive" than that of a linear relationship. For example, the middle image above shows a relationship that is monotonic, but not linear.
A monotonic relationship is not strictly an assumption of Spearman's correlation. That is, you can run a Spearman's correlation on a non-monotonic relationship to determine if there is a monotonic component to the association. However, you would normally pick a measure of association, such as Spearman's correlation, that fits the pattern of the observed data. That is, if a scatterplot shows that the relationship between your two variables looks monotonic you would run a Spearman's correlation because this will then measure the strength and direction of this monotonic relationship. On the other hand if, for example, the relationship appears linear (assessed via scatterplot) you would run a Pearson's correlation because this will measure the strength and direction of any linear relationship. You will not always be able to visually check whether you have a monotonic relationship, so in this case, you might run a Spearman's correlation anyway.
How to rank data? In some cases your data might already be ranked, but often you will find that you need to rank the data yourself (or use SPSS Statistics to do it for you). Thankfully, ranking data is not a difficult task and is easily achieved by working through your data in a table. Let us consider the following example data regarding the marks achieved in a maths and English exam:
You need to rank the scores for maths and English separately. The score with the highest value should be labelled "1" and the lowest score should be labelled "10" (if your data set has more than 10 cases then the lowest score will be how many cases you have). Look carefully at the two individuals that scored 61 in the English exam (highlighted in bold). Notice their joint rank of 6.5. This is because when you have two identical values in the data (called a "tie"), you need to take the average of the ranks that they would have otherwise occupied. We do this because, in this example, we have no way of knowing which score should be put in rank 6 and which score should be ranked 7. Therefore, you will notice that the ranks of 6 and 7 do not exist for English. These two ranks have been averaged ((6 + 7)/2 = 6.5) and assigned to each of these "tied" scores.
What is the definition of Spearman's rank-order correlation? There are two methods to calculate Spearman's correlation depending on whether: (1) your data does not have tied ranks or (2) your data has tied ranks. The formula for when there are no tied ranks is:
What values can the Spearman correlation coefficient, rs, take? The Spearman correlation coefficient, rs, can take values from +1 to -1. A rs of +1 indicates a perfect association of ranks, a rs of zero indicates no association between ranks and a rs of -1 indicates a perfect negative association of ranks. The closer rs is to zero, the weaker the association between the ranks.
An example of calculating Spearman's correlation To calculate a Spearman rank-order correlation on data without any ties we will use the following data:
It is a measure of the linear relationship between two continuous random variables. It does not assume normality although it does assume finite variances and finite covariance. When the variables are bivariate normal, Pearson's correlation provides a complete description of the association.
Spearman's correlation applies to ranks and so provides a measure of a monotonic relationship between two continuous random variables. It is also useful with ordinal data and is robust to outliers (unlike Pearson's correlation).
The distribution of either correlation coefficient will depend on the underlying distribution, although both are asymptotically normal because of the central limit theorem.
Permutation and combination form the principles of counting and they are applied in various situations. A permutation is a count of the different arrangements which can be made from the given set of things. In permutation the details matter, as the order or sequence is important. Writing the names of three countries {USA, Brazil, Australia} or {Australia, USA, Brazil) or { Brazil, Australia, USA} is different and this sequence in which the names of the countries are written is important. In combinations, the name of three countries is just a single group, and the sequence or order does not matter. Let us learn more about permutation and combination in the below content.
Permutation and combination are the methods employed in counting how many outcomes are possible in various situations. Permutations are understood as arrangements and combinations are understood as selections. As per the fundamental principle of counting, there are the sum rules and the product rules to employ counting easily.
Suppose there are 14 boys and 9 girls. If a boy or a girl has to be selected to be the monitor of the class, the teacher can select 1 out of 14 boys or 1 out of 9 girls. She can do it in 14 + 9 = 23 ways(using the sum rule of counting). Let us look at another scenario. Suppose Sam usually takes one main course and a drink. Today he has the choice of burger, pizza, hot dog, watermelon juice, and orange juice. What are all the possible combinations that he can try? There are 3 snack choices and 2 drink choices. We multiply to find the combinations. 3 × 2 = 6. Thus Sam can try 6 combinations using the product rule of counting. This can be shown using tree diagrams as illustrated below.
In order to understand permutation and combination, the concept of factorials has to be recalled. The product of the first n natural numbers is n! The number of ways of arranging n unlike objects is n!.
A permutation is an arrangement in a definite order of a number of objects taken some or all at a time. Let us take 10 numbers: 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. The number of different 4-digit-PIN which can be formed using these 10 numbers is 5040. P(10,4) = 5040. This is a simple example of permutations. The permutations of 4 numbers taken from 10 numbers equal to the factorial of 10 divided by the factorial of the difference of 10 and 4. The permutations is easily calculated using
A combination is all about grouping. The number of different groups which can be formed from the available things can be calculated using combinations. Let us try to understand this with a simple example. A team of 2 is formed from 5 students(William, James, Noah, Logan, and Oliver). This the combination of 'r' persons from the available 'n' persons is given as . The combinations can happen in the following 10 ways by which the team of 2 could be formed.
- William James
- William Noah
- William Logan
- William Oliver
- James Noah
- James Logan
- James Oliver
- Logan Noah
- Logan Oliver
- Oliver Noah
- This is a simple example of combinations. C(5,2) = 10.