- Datasets
- Preprocessing Data in Python
- Exploratory Data Analysis (EDA)
- Model Development
- Model Evaluation and Refinement
Understanding Datasets
Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/
Data Format | Read | Save |
---|---|---|
csv |
pd.read_csv() |
df.to_csv() |
json |
pd.read_json() |
df.to_json() |
Excel |
pd.read_excel() |
df.to_excel() |
sql |
pd.read_sql() |
df.to_sql() |
Basic insights from the data
- Understand your data before you begin any analysis
- Should check:
- data types
df.dtypes
- data distribution
df.describe()
df.describe(include="all")
, provides full summary statisticsunique
top
freq
- data types
- Locate potential issues with the data
- potential info and type mismatch
- compatibility with python methods
- Identify and handle missing values
- Data formatting
- Data normalization (centering / scaling)
- Data binning
- Turning categorical values to numeric variables
- Check with the data collection source
- Drop the missing values
- drop the variable
- drop the data entry
- Replace the missing values
- replace it with an average (of similar datapoints)
- replace it by frequency
- replace it based on other functions
- Leave it as missing data
df.dropna(subset=["price"], axis=0, inplace=True)
is equivalent to
df = df.dropna(subset=["price"], axis=0)
Non-formatted:
- confusing
- hard to aggregate
- hard to compare
Formatted:
- more clear
- easy to aggregate
- easy to compare
Correcting data types
- use
df.dtypes()
to identify data type - use
df.astype()
to convert data type- e.g.
df["price"] = df["price"].astype("int")
- e.g.
Approaches for normalization:
- Simple feature scaling: xnew = xold/xmax
df["length"] = df["length"] / df["length"].max()
- Min-Max: xnew = (xold-xmin)/(xmax-xmin)
df["length"] = (df["length"]-df["length"].min()) / (df["length"].max()-df["length"].min())
- Z-score: xnew = (xold-μ)/σ
df["length"] = (df["length"]-df["length"].mean()) / df["length"].std()
bins = np.linspace(min(df["price"]), max(df["price"]), 4)
group_names = ["Low", "Medium", "High"]
df["price-binned"] = pd.cut(df["price"], bins, labels=group_names, include_lowest=True)
- Question:
- "What are the characteristics which have the most impact on the car price?"
- Preliminary step in data analysis to:
- Summarize main characteristics of the data
- Gain better understanding of the data set
- Uncover relationships between variables
- Extract important variables
Learning Objectives:
- Descriptive Statistics
- GroupBy
- Correlation
- Correlation - Statistics
- Summarize statistics using pandas
describe()
methoddf.describe()
- Summarize categorical data is by using the
value_counts()
method - Box Plot
- Scatter Plot
- each observation represented as a point
- scatter plot show the relationship between two variables
- predictor/independent variables on x-axis
- target/dependent variables on y-axis
- use
df.groupby()
method:- can be applied on categorical variables
- group data into categories
- single or multiple variables
A table of this form isn't the easiest to read and also not very easy to visualize.
To make it easier to understand, we can transform this table to a pivot table by using the pivot
method.
The price data now becomes a rectangular grid, which is easier to visualize. This is similar to what is usually done in Excel spreadsheets. Another way to represent the pivot table is using a heat map plot.
The correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.
Categorical variables
- use the Chi-square Test for Association (denoted as 𝜒2)
- The test is intended to test how likely it is that an observed distribution is due to chance
Chi-Square Test of association
- The Chi-square tests a null hypothesis that the variables are independent.
- The Chi-square does not tell you the type of relationship that exists between both variables; but only that a relationship exists.
See also: Chi-Square Test of Independence
- simple linear regression
- multiple linear regression
- polynomial regression
Regression plot gives us a good estimate of:
- the relationship between two variables
- the strength of the correlation
- the direction of the relationship (positive or negative)
Regression plot shows us a combination of:
- the scatterplot: where each point represents a different
y
- the fitted linear regression line (ŷ)
import seaborn as sns
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)
We expect to see the results to have zero mean, distributed evenly around the x
axis with similar variance.
import seaborn as sns
sns.residplot(df["highway-mpg"], df["price"])
A distribution plot counts the predicted value versus the actual value. These plots are extremely useful for visualizing models with more than one independent variable or feature.
import seaborn as sns
ax1 = sns.distplot(df["price"], hist=False, color="r", label="Actual Value")
sns.distplot(Yhat, hist=False, color="b", label="Fitted Value", ax=ax1)
Numpy's polyfit function cannot perform this type of regression. We use the preprocessing library in scikit-learn to create a polynomial feature object.
from sklearn.preprocessing import PolynomialFeatures
pr = PolynomialFeatures(degree=2, include_bias=False)
x_poly = pr.fit_transform(x[['horsepower', 'curb-weight']])
As the dimension of the data gets larger, we may want to normalize multiple features in scikit-learn. Instead we can use the preprocessing module to simplify many tasks. For example, we can standardize each feature simultaneously. We import StandardScaler
.
from sklearn.preprocessing import StandardScaler
SCALE = StandardScaler()
SCALE.fit(x_data[['horsepower', 'highway-mpg']])
x_scale = SCALE.transform(x_data[['horsepower', 'highway-mpg']])
We can simplify our code by using a pipeline library.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
Input = [('scale', StandardScaler()), ('polynomial', PolynomialFeatures(degree=2),...), ('model', LinearRegression())]
pipe = Pipeline(Input)
pipe.fit(df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y)
yhat = pipe.predict(X[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
Measures for In-Sample Evaluation
- A way to numerically determine how good the model fits on dataset
- Two important measures to determine the fit of a model:
- Mean Squared Error (MSE)
- R-squared (R2)
from sklearn.metrics import mean_square_error
mean_square_error(df['price'], Y_predict_simple_fit)
- The Coefficient of Determination or R-squared (R2)
- Is a measure to determine how close the data is to the fitted regression line.
- R2: the percentage of variation of the target variable (Y) that is explained by the linear model.
- think about as comparing a regression model to a simple model i.e. the mean of the data points
R2=(1-(MSE of regression line)/(MSE of the average of the data))
- Generally the values of the MSE are between 0 and 1
- We can calculate the R2 as follows
X = df[['highway-mpg']]
Y = df['price']
lm.fit(X, Y)
lm.score(X, Y) # 0.496591188
We can say that approximately 49.695% of the variation of price is explained by this simple linear model.
Training/Testing Sets
- Split dataset into:
- Training set (70%)
- Testing set (30%)
- Build and train the model with a training set
- Use testing set to assess the performance of a predictive model
- When we have completed testing our model we should use all the data to train the model to get the best performance
One of the most common out of sample evaluation metrics is cross-validation.
- In this method, the dataset is split into K equal groups.
- Each group is referred to as a fold. For example, four folds. Some of the folds can be used as a training set which we use to train the model and the remaining parts are used as a test set, which we use to test the model.
- For example, we can use three folds for training, then use one fold for testing. This is repeated until each partition is used for both training and testing.
- At the end, we use the average results as the estimate of out-of-sample error.
- The evaluation metric depends on the model, for example, the r squared.
The simplest way to apply cross-validation is to call the cross_val_score
function, which performs multiple out-of-sample evaluations.
from sklearn.model_selection import cross_val_score
score = cross_val_score(lr, x_data, y_data, cv=3)
np.mean(scores)
- It returns the prediction that was obtained for each element when it was in the test set
- Has a similar interface to
cross_val_score()
from sklearn.model_selection import cross_val_predict
yhat = cross_val_predict(lr2e, x_data, y_data, cv=3)
Calculate different R-squared values as follows:
Rsqu_test = []
order = [1,2,3,4]
for n in order:
pr = PolynomialFeatures(degree=n)
x_train_pr = pr.fit_transform(x_train[['horsepower']])
x_test_pr = pr.fit_transform(x_test[['horsepower']])
lr.fit(x_train_pr, y_train)
Rsqu_test.append(lr.score(x_test_pr, y_test))
Ridge regression is a regression that is employed in a Multiple regression model when Multicollinearity occurs. Multicollinearity is when there is a strong relationship among the independent variables. Ridge regression is very common with polynomial regression.
The column corresponds to the different polynomial coefficients, and the rows correspond to the different values of alpha.
- As alpha increases, the parameters get smaller. This is most evident for the higher order polynomial features.
- But Alpha must be selected carefully.
- If alpha is too large, the coefficients will approach zero and underfit the data.
- If alpha is zero, the overfitting is evident.
- The term alpha in Ridge regression is called a hyperparameter.
- Scikit-learn has a means of automatically iterating over these hyperparameters using cross-validation called Grid Search.
Grid Search takes the model or objects you would like to train and different values of the hyperparameters. It then calculates the mean square error or R-squared for various hyperparameter values, allowing you to choose the best values.
Use the validation dataset to pick the best hyperparameters.
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
parameters1 = [{'alpha': [0.001, 0.1, 1, 10, 100, 1000, 10000, 100000]}]
RR = Ridge()
Grid1 = GridSearchCV(RR, parameters1, cv=4)
Grid1.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_data)
Grid1.best_estimator_
scores = Grid1.cv_results_
scores['mean_test_score']
What are the advantages of Grid Search is how quickly we can test multiple parameters.
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
parameters2 = [{'alpha': [0.001, 0.1, 1, 10, 100], 'normalize': [True, False]}]
RR = Ridge()
Grid1 = GridSearchCV(RR, parameters2, cv=4)
Grid1.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_data)
Grid1.best_estimator_
scores = Grid1.cv_results_
for param, mean_val, mean_test in zip(scores['params'], scores['mean_test_score'], scores['mean_train_score']):
print(param, "R^2 on test data:", mean_val, "R^2 on train data:", mean_test)