1: Definining Exploratory Data Analysis with an overview of the whole project.
2: Importing libraries and Exploring the Dataset.
3: Checking missing values and Outliers.
4: Creating visual methods to analyze the data.
5: Analyzing trends, patterns, and relationships in the Data. Hypotheses Testing
In statistics, exploratory data analysis is an approach to analyzing
data sets to summarize their main characteristics, often with visual methods.
A statistical model can be used or not, but primarily EDA is for seeing what
the data can tell us beyond the formal modeling or hypothesis testing task.
Exploratory data analysis was promoted by John Tukey to encourage statisticians
to explore the data, and possibly formulate hypotheses that could lead to new
data collection and experiments. EDA is different from initial data analysis (IDA),
which focuses more narrowly on checking assumptions required for model fitting and
hypothesis testing, and handling missing values and making transformations of variables
as needed. EDA encompasses IDA.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
import copy
sns.set()
insurance_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
Expected output:
The data should consist of 1338 instances with 7 attributes. 2 integer type, 2 float type and 3 object type (Strings in the column)
insurance_df.isna().apply(pd.value_counts)
insurance_df.describe().T
Output should include this Analysis:
-
All the statistics seem reasonable.
-
Age column: data looks representative of the true age distribution of the adult population with (39) mean.
-
Children Column: Few people have more than 2 children (75% of the people have 2 or less children).
-
The claimed amount is higly skewed as most people would require basic medi-care and only few suffer from diseases which cost more to get rid of.
Output should include this Analysis:
-
bmi looks normally distributed.
-
Age looks uniformly distributed.
-
As seen in the previous step, charges are highly skewed.
Output should include this Analysis:
-
There are lot more non-smokers than smokers.
-
Instances are distributed evenly accross all regions.
-
Gender is also distributed evenly.
-
Most instances have less than 3 children and very few have 4 or 5 children.
# Label encoding the variables before doing a pairplot because pairplot ignores strings
insurance_df_encoded = copy.deepcopy(insurance_df)
insurance_df_encoded.loc[:,['sex', 'smoker', 'region']] = insurance_df_encoded.loc[:,['sex', 'smoker', 'region']].apply(LabelEncoder().fit_transform)
sns.pairplot(insurance_df_encoded) #pairplot
plt.show()
Output should include this Analysis:
-
There is an obvious correlation between 'charges' and 'smoker'
-
Looks like smokers claimed more money than non-smokers
-
There's an interesting pattern between 'age' and 'charges'. Notice that older people are charged more than the younger ones
print("Do charges of people who smoke differ significantly from the people who don't?")
insurance_df.smoker.value_counts()
Do charges of people who smoke differ significantly from the people who don't?
no 1064
yes 274
Name: smoker, dtype: int64
There is no apparent relation between gender and charges
# T-test to check dependency of smoking on charges
Ho = "Charges of smoker and non-smoker are same" # Stating the Null Hypothesis
Ha = "Charges of smoker and non-smoker are not the same" # Stating the Alternate Hypothesis
x = np.array(insurance_df[insurance_df.smoker == 'yes'].charges) # Selecting charges corresponding to smokers as an array
y = np.array(insurance_df[insurance_df.smoker == 'no'].charges) # Selecting charges corresponding to non-smokers as an array
t, p_value = stats.ttest_ind(x,y, axis = 0) #Performing an Independent t-test
if p_value < 0.05: # Setting our significance level at 5%
print(f'{Ha} as the p_value ({p_value}) < 0.05')
else:
print(f'{Ho} as the p_value ({p_value}) > 0.05')
Charges of smoker and non-smoker are not the same as the p_value (8.271435842177219e-283) < 0.05
Thus, Smokers seem to claim significantly more money than non-smokers
#Does bmi of males differ significantly from that of females?
print ("Does bmi of males differ significantly from that of females?")
insurance_df.sex.value_counts() #Checking the distribution of males and females
Does bmi of males differ significantly from that of females?
male 676
female 662
Name: sex, dtype: int64
# T-test to check dependency of bmi on gender
Ho = "Gender has no effect on bmi" # Stating the Null Hypothesis
Ha = "Gender has an effect on bmi" # Stating the Alternate Hypothesis
x = np.array(insurance_df[insurance_df.sex == 'male'].bmi) # Selecting bmi values corresponding to males as an array
y = np.array(insurance_df[insurance_df.sex == 'female'].bmi) # Selecting bmi values corresponding to females as an array
t, p_value = stats.ttest_ind(x,y, axis = 0) #Performing an Independent t-test
if p_value < 0.05: # Setting our significance level at 5%
print(f'{Ha} as the p_value ({p_value.round()}) < 0.05')
else:
print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.05')
Gender has no effect on bmi as the p_value (0.09) > 0.05
bmi of both the genders are identical
#Is the proportion of smokers significantly different in different genders?
# Chi_square test to check if smoking habits are different for different genders
Ho = "Gender has no effect on smoking habits" # Stating the Null Hypothesis
Ha = "Gender has an effect on smoking habits" # Stating the Alternate Hypothesis
crosstab = pd.crosstab(insurance_df['sex'],insurance_df['smoker']) # Contingency table of sex and smoker attributes
chi, p_value, dof, expected = stats.chi2_contingency(crosstab)
if p_value < 0.05: # Setting our significance level at 5%
print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.05')
else:
print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.05')
crosstab
Gender has an effect on smoking habits as the p_value (0.007) < 0.05
Proportion of smokers in males is significantly different from that of the females
# Chi_square test to check if smoking habits are different for people of different regions
Ho = "Region has no effect on smoking habits" # Stating the Null Hypothesis
Ha = "Region has an effect on smoking habits" # Stating the Alternate Hypothesis
crosstab = pd.crosstab(insurance_df['smoker'], insurance_df['region']) # Contingency table of sex and smoker attributes
chi, p_value, dof, expected = stats.chi2_contingency(crosstab)
if p_value < 0.05: # Setting our significance level at 5%
print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.05')
else:
print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.05')
crosstab
Region has no effect on smoking habits as the p_value (0.062) > 0.05
- Smoking habbits of people of different regions are similar
# Is the distribution of bmi across women with no children, one child and two children, the same ?
# Test to see if the distributions of bmi values for females having different number of children, are significantly different
Ho = "No. of children has no effect on bmi" # Stating the Null Hypothesis
Ha = "No. of children has an effect on bmi" # Stating the Alternate Hypothesis
female_df = copy.deepcopy(insurance_df[insurance_df['sex'] == 'female'])
zero = female_df[female_df.children == 0]['bmi']
one = female_df[female_df.children == 1]['bmi']
two = female_df[female_df.children == 2]['bmi']
f_stat, p_value = stats.f_oneway(zero,one,two)
if p_value < 0.05: # Setting our significance level at 5%
print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.05')
else:
print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.05')
No. of children has no effect on bmi as the p_value (0.716) > 0.05
Connect- Linkedin
Website- RakibHHridoy