Skip to content

Latest commit

 

History

History
157 lines (103 loc) · 4.64 KB

File metadata and controls

157 lines (103 loc) · 4.64 KB

Homework

Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.

Solution: homework_3.ipynb

Dataset

In this homework, we will use the Car price dataset. Download it from here.

Or you can do it with wget:

wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv

We'll keep working with the MSRP variable, and we'll transform it to a classification task.

Features

For the rest of the homework, you'll need to use only these columns:

  • Make,
  • Model,
  • Year,
  • Engine HP,
  • Engine Cylinders,
  • Transmission Type,
  • Vehicle Style,
  • highway MPG,
  • city mpg,
  • MSRP

Data preparation

  • Select only the features from above and transform their names using the next line:
    data.columns = data.columns.str.replace(' ', '_').str.lower()
    
  • Fill in the missing values of the selected features with 0.
  • Rename MSRP variable to price.

Question 1

What is the most frequent observation (mode) for the column transmission_type?

  • AUTOMATIC
  • MANUAL
  • AUTOMATED_MANUAL
  • DIRECT_DRIVE

Question 2

Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

  • engine_hp and year
  • engine_hp and engine_cylinders
  • highway_mpg and engine_cylinders
  • highway_mpg and city_mpg

Make price binary

  • Now we need to turn the price variable from numeric into a binary format.
  • Let's create a variable above_average which is 1 if the price is above its mean value and 0 otherwise.

Split the data

  • Split your data in train/val/test sets with 60%/20%/20% distribution.
  • Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
  • Make sure that the target value (above_average) is not in your dataframe.

Question 3

  • Calculate the mutual information score between above_average and other categorical variables in our dataset. Use the training set only.
  • Round the scores to 2 decimals using round(score, 2).

Which of these variables has the lowest mutual information score?

  • make
  • model
  • transmission_type
  • vehicle_style

Question 4

  • Now let's train a logistic regression.
  • Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
  • Fit the model on the training dataset.
    • To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    • model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
  • Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

  • 0.60
  • 0.72
  • 0.84
  • 0.95

Question 5

  • Let's find the least useful feature using the feature elimination technique.
  • Train a model with all these features (using the same parameters as in Q4).
  • Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
  • For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

  • year
  • engine_hp
  • transmission_type
  • city_mpg

Note: the difference doesn't have to be positive

Question 6

  • For this question, we'll see how to use a linear regression model from Scikit-Learn.
  • We'll need to use the original column price. Apply the logarithmic transformation to this column.
  • Fit the Ridge regression model on the training data with a solver 'sag'. Set the seed to 42.
  • This model also has a parameter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10].
  • Round your RMSE scores to 3 decimal digits.

Which of these alphas leads to the best RMSE on the validation set?

  • 0
  • 0.01
  • 0.1
  • 1
  • 10

Note: If there are multiple options, select the smallest alpha.

Submit the results

  • Submit your results here: https://forms.gle/FFfNjEP4jU4rxnL26
  • You can submit your solution multiple times. In this case, only the last submission will be used
  • If your answer doesn't match options exactly, select the closest one

Deadline

The deadline for submitting is 2 October (Monday), 23:00 CEST.

After that, the form will be closed.