Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.
Solution: homework_3.ipynb
In this homework, we will use the Car price dataset. Download it from here.
Or you can do it with wget
:
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
We'll keep working with the MSRP
variable, and we'll transform it to a classification task.
For the rest of the homework, you'll need to use only these columns:
Make
,Model
,Year
,Engine HP
,Engine Cylinders
,Transmission Type
,Vehicle Style
,highway MPG
,city mpg
,MSRP
- Select only the features from above and transform their names using the next line:
data.columns = data.columns.str.replace(' ', '_').str.lower()
- Fill in the missing values of the selected features with 0.
- Rename
MSRP
variable toprice
.
What is the most frequent observation (mode) for the column transmission_type
?
AUTOMATIC
MANUAL
AUTOMATED_MANUAL
DIRECT_DRIVE
Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
What are the two features that have the biggest correlation in this dataset?
engine_hp
andyear
engine_hp
andengine_cylinders
highway_mpg
andengine_cylinders
highway_mpg
andcity_mpg
- Now we need to turn the
price
variable from numeric into a binary format. - Let's create a variable
above_average
which is1
if theprice
is above its mean value and0
otherwise.
- Split your data in train/val/test sets with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the
train_test_split
function) and set the seed to42
. - Make sure that the target value (
above_average
) is not in your dataframe.
- Calculate the mutual information score between
above_average
and other categorical variables in our dataset. Use the training set only. - Round the scores to 2 decimals using
round(score, 2)
.
Which of these variables has the lowest mutual information score?
make
model
transmission_type
vehicle_style
- Now let's train a logistic regression.
- Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
- Fit the model on the training dataset.
- To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.
What accuracy did you get?
- 0.60
- 0.72
- 0.84
- 0.95
- Let's find the least useful feature using the feature elimination technique.
- Train a model with all these features (using the same parameters as in Q4).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
Which of following feature has the smallest difference?
year
engine_hp
transmission_type
city_mpg
Note: the difference doesn't have to be positive
- For this question, we'll see how to use a linear regression model from Scikit-Learn.
- We'll need to use the original column
price
. Apply the logarithmic transformation to this column. - Fit the Ridge regression model on the training data with a solver
'sag'
. Set the seed to42
. - This model also has a parameter
alpha
. Let's try the following values:[0, 0.01, 0.1, 1, 10]
. - Round your RMSE scores to 3 decimal digits.
Which of these alphas leads to the best RMSE on the validation set?
- 0
- 0.01
- 0.1
- 1
- 10
Note: If there are multiple options, select the smallest
alpha
.
- Submit your results here: https://forms.gle/FFfNjEP4jU4rxnL26
- You can submit your solution multiple times. In this case, only the last submission will be used
- If your answer doesn't match options exactly, select the closest one
The deadline for submitting is 2 October (Monday), 23:00 CEST.
After that, the form will be closed.