Skip to content

A compilation of main commands for scikit-learn with examples

Notifications You must be signed in to change notification settings

musja007/scikit-learn-cheat-sheet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 

Repository files navigation

scikit-learn-cheat-sheet

A compilation of main commands for scikit-learn with examples. Inspired by https://inria.github.io/scikit-learn-mooc/index.html.

1. Numerical data preprocessing

Standardizes data by removing the mean and scaling to unit variance.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
scaler.transform(data)

Transforms the data so that it values appear in the given range.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(data)
scaler.transform(data)

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

from sklearn.preprocessing import Normalizer
transformer = Normalizer()
transformer.fit(data)
transformer.transform(data)

Binarizes data (set feature values to 0 or 1) according to a threshold.

from sklearn.preprocessing import Binarizer
transformer = Binarizer().fit(data)
transformer.transform(data)

Replaces missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value. Parameters: missing_values specifies what we assume as a missing value, strategy specifies what we will replace the missing values with.

import numpy as np
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(data)
imputer.transform(data)

Generates polynomial and interaction features. Parameters: degree specifies the maximal degree of the polynomial features

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = 2)
poly.fit_transform(data)

2. Encoding

OrdinalEncoder will encode each category with a different number.

from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
data_encoded = encoder.fit_transform(data)

For a given feature, OneHotEncoder will create as many new columns as there are possible categories. For a given sample, the value of the column corresponding to the category will be set to 1 while all the columns of the other categories will be set to 0.

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
data_encoded = encoder.fit_transform(data)

3. Column selection and transformation

Selects columns based on datatype or the columns name.

Applies specific transformations to the subset of columns in the data.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)
preprocessor = ColumnTransformer([
    ('cat_preprocessor', categorical_preprocessor, categorical_columns)],
    remainder='passthrough', sparse_threshold=0)

4. Pipelines

Allows to construct a pipeline – a set of commands/models/etc. which will be executed consequently.

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    OneHotEncoder(handle_unknown="ignore"), LogisticRegression(max_iter=500)
)

Allows to vizualize the pipelines in Jupyter, needs to be set once at the beginning of your notebook.

from sklearn import set_config
set_config(display="diagram")

5. Model training

Split arrays or matrices into random train and test subsets.

from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42)

Allows to see how the model performance changes when choosing different train/test split size.

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor()

from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=30, test_size=0.2)

from sklearn.model_selection import learning_curve

train_sizes=[0.3, 0.6, 0.9]
results = learning_curve(
    regressor, data, target, train_sizes=train_sizes, cv=cv,
    scoring="neg_mean_absolute_error", n_jobs=2)

6. Metrics

from sklearn.metrics import mean_squared_error
mean_squared_error(y_true, y_pred)

average parameter is required for multiclass/multilabel targets.

from sklearn.metrics import precision_score
precision_score(y_true, y_pred, average='macro')

average parameter is required for multiclass/multilabel targets.

from sklearn.metrics import recall_score
recall_score(y_true, y_pred, average='macro')

The balanced accuracy in binary and multiclass classification problems to deal with imbalanced datasets. It is defined as the average of recall obtained on each class.

from sklearn.metrics import balanced_accuracy_score
balanced_accuracy_score(y_true, y_pred)
from sklearn.metrics import confusion_matrix
labels=["a", "b", "c"]
cm = confusion_matrix(y_true, y_pred, labels=labels)

Confusion Matrix visualization.

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

labels=["a", "b", "c"]
cm = confusion_matrix(y_true, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot()
plt.show()

pos_label parameter defines the label of the positive class.

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_true, y_scores, pos_label=1)

Compute Area Under the Curve (AUC).

from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_true, y_scores, pos_label=1)
roc_auc = auc(fpr, tpr)

ROC Curve visualization.

from sklearn.metrics import roc_curve, auc, RocCurveDisplay
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_true, y_scores, pos_label=1)
roc_auc = auc(fpr, tpr), estimator_name='example estimator')
disp.plot()
plt.show()
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

Precision-Recall visualization.

from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay
import matplotlib.pyplot as plt
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
disp = PrecisionRecallDisplay(precision=precision, recall=recall)
disp.plot()
plt.show()

7. Parameter tuning

Greedy search over specified parameter values for an estimator.

from sklearn.model_selection import GridSearchCV
param_grid = {
    'parameter_A': (0.01, 0.1, 1, 10),
    'parameter_B': (3, 10, 30)}
model_grid_search = GridSearchCV(model, param_grid=param_grid,
                                 n_jobs=2, cv=2)
model_grid_search.fit(data, target)

In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.

from sklearn.model_selection import RandomizedSearchCV
param_grid = {
    'parameter_A': (0.01, 0.1, 1, 10),
    'parameter_B': (3, 10, 30)}
model_random_search = RandomizedSearchCV(
    model, param_distributions=param_grid, n_iter=10,
    cv=5, verbose=1,
)
model_random_search.fit(data, target)

8. Model selection

Evaluate metric(s) by cross-validation and also record fit/score times. scoring parameters is used to define which metric(s) will be computed during each fold. In the cv parameter, one can pass any type of splitting strategy: k-fold, stratified and etc.

from sklearn.model_selection import cross_validate
cv_results = cross_validate(
    model, data, target, cv=5, scoring="neg_mean_absolute_error")

Identical to calling the cross_validate function and to select the test score only.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, data, target)

Determine training and test scores for varying parameter values.

from sklearn.model_selection import validation_curve
param_A = [1, 5, 10, 15, 20, 25]
train_scores, test_scores = validation_curve(
    model, data, target, param_name="param_A", param_range=param_A,
    cv=cv, scoring="neg_mean_absolute_error", n_jobs=2)

K-Folds cross-validator.

from sklearn.model_selection import KFold
cv = KFold(n_splits=2)
cv..get_n_splits(data)

Random permutation cross-validator.

from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=5, random_state=0)
cv.get_n_splits(data)

Stratified K-Folds cross-validator, generates test sets such that all contain the same distribution of classes, or as close as possible.

from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=2)
cv.get_n_splits(data, target)

K-fold iterator variant with non-overlapping groups, which makes each group appear exactly once in the test set across all folds. groups should be an array of the same length of data. For each row groups should indicate which group it belongs to.

from sklearn.model_selection import GroupKFold
cv = GroupKFold(n_splits=2)
cv.get_n_splits(data, target, groups=groups)

Time Series cross-validator, provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets.

from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=2)
cv.get_n_splits(data, target)

Leave One Group Out cross-validator, provides train/test indices to split data such that each training set is comprised of all samples except ones belonging to one specific group. groups should be an array of the same length of data. For each row groups should indicate which group it belongs to.

from sklearn.model_selection import LeaveOneGroupOut
cv = LeaveOneGroupOut()
cv.get_n_splits(data, target, groups=groups)

9. Dummy models

Predicts the same value based on a (simple) rule without using training features. strategy can be {“mean”, “median”, “quantile”, “constant”}.

from sklearn.dummy import DummyRegressor
model = DummyRegressor(strategy="mean")
model.fit(data, target)

Predicts the same class based on a (simple) rule without using training features. strategy can be {“most_frequent”, “prior”, “stratified”, “uniform”, “constant”}.

from sklearn.dummy import DummyClassifier
model = DummyClassifier(strategy="most_frequent")
model.fit(data, target)

10. Linear models

Ordinary least squares Linear Regression.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(data, target)

Linear least squares with l2 regularization. alpha parameter defines the l2 multiplier coefficient.

from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(data, target)

Ridge regression with built-in cross-validation. alphas defines the array of alpha values to try.

from sklearn.linear_model import RidgeCV
model = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1])
model.fit(data, target)

Logistic Regression classifier. penalty parameter is by default l2, C defines inverse of regularization strength, must be a positive float.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C = 1.0)
model.fit(data, target)

11. kNN

Regression based on k-nearest neighbors. n_neighbors defines number of neighbors to use.

from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=2)
model.fit(data, target)

Regression based on k-nearest neighbors. n_neighbors defines number of neighbors to use.

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsRegressor(n_neighbors=2)
model.fit(data, target)

12. Tree models

A decision tree regressor. max_depth defines the maximum depth of a tree.

from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(max_depth=2)
model.fit(data, target)

A decision tree classifier. max_depth defines the maximum depth of a tree.

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=2)
model.fit(data, target)

12. Ensemble models

Gradient Boosting trees for regression. max_depth defines the maximum depth of a tree, learning_rate defines the "contribution" of each tree, n_estimators controls the number of trained trees.

from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(max_depth=2, learning_rate=1.0, n_estimators=100)
model.fit(data, target)

Gradient Boosting for classification. max_depth defines the maximum depth of a tree, learning_rate defines the "contribution" of each tree, n_estimators controls the number of trained trees.

from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingRegressor(max_depth=2, learning_rate=1.0, n_estimators=100)
model.fit(data, target)

A random forest regressor. max_depth defines the maximum depth of a tree, n_estimators controls the number of trained trees.

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(max_depth=2, n_estimators=100)
model.fit(data, target)

A random forest classifier. max_depth defines the maximum depth of a tree, n_estimators controls the number of trained trees.

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(max_depth=2, n_estimators=100)
model.fit(data, target)

About

A compilation of main commands for scikit-learn with examples

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published