BUSINESS PROBLEM: explore a hotel booking dataset and predict cancellation

STAGE 1: preprecessing and EDA

Load Data and Explore

import pandas as pd
df_hotel = pd.read_csv('~/Downloads/hotel_bookings.csv')
df_hotel.head(1)

	hotel	is_canceled	lead_time	arrival_date_year	arrival_date_month	arrival_date_week_number	arrival_date_day_of_month	stays_in_weekend_nights	stays_in_week_nights	adults	...	deposit_type	agent	company	days_in_waiting_list	customer_type	adr	required_car_parking_spaces	total_of_special_requests	reservation_status	reservation_status_date
0	Resort Hotel	0	342	2015	July	27	1	0	0	2	...	No Deposit	NaN	NaN	0	Transient	0.0	0	0	Check-Out	2015-07-01

1 rows × 32 columns

# Check Columns
df_hotel.columns

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')

CONSIDERATIONS: Our target feature to predict is 'is canceled'. We should treat and select all other features. First we need to explore it.

# Check missing values
import numpy as np
import pandas as pd

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(df_hotel)

Your selected dataframe has 32 columns and 119390 Rows.
There are 4 columns that have missing values.

	Zero Values	Missing Values	% of Total Values	Total Zero Missing Values	% Total Zero Missing Values	Data Type
company	0	112593	94.3	112593	94.3	float64
agent	0	16340	13.7	16340	13.7	float64
country	0	488	0.4	488	0.4	object
children	110796	4	0.0	110800	92.8	float64

CONSIDERATIONS: country and children doesn't have many missing values ('0' in children means it's not a missing value); our problem is with company and agent, we should explore and treat it.

# count of id companies when made reservation
len(df_hotel['company'].unique())

# Aggregate and check distribution from the variable
df_company_agg = df_hotel.groupby('company')['company'].count().sort_values(ascending = False)
df_company_agg.head(5)

company
40.0     927
223.0    784
67.0     267
45.0     250
153.0    215
Name: company, dtype: int64

CONSIDERATIONS: we've too many missing values for company; probably those missing values means that the reservation weren't scheduled by any company; it would be better to create a boolean feature to detect if it's a company reservation or not.

import numpy as np
# Create a new boolean feature considering if it's a company reservation or not
df_hotel['is_company'] = np.where(df_hotel['company'].isna(), False, True)
# Drop company column as we don't need it anymore and we can get rid from nan values
df_hotel.drop(['company'], axis=1)
# check the number of reservations by companies
df_hotel['is_company'].sum()

# Drop old column to clean our data (it will help plotting correlation matrix and fitting the model)
df_hotel = df_hotel.drop(['company'], axis=1)

CONSIDERATIONS: Most cancelations are from ONLINE booking and GROUPS. Direct bookings are rarely cancelled.

# Create a new boolean feature considering if it's a reservation made by an agent or not
df_hotel['is_agent'] = np.where(df_hotel['agent'].isna(), False, True)
df_hotel['is_agent'].sum()

CONSIDERATIONS: booking by agents are much more common than by companies; but probably direct reservations are also important.

# Aggegate and check which market segments cancels more by type of hotel
df_hotel.groupby(['market_segment', 'is_canceled'])['hotel'].count()

market_segment  is_canceled
Aviation        0                185
                1                 52
Complementary   0                646
                1                 97
Corporate       0               4303
                1                992
Direct          0              10672
                1               1934
Groups          0               7714
                1              12097
Offline TA/TO   0              15908
                1               8311
Online TA       0              35738
                1              20739
Undefined       1                  2
Name: hotel, dtype: int64

CONSIDERATIONS: Most cancelations are from ONLINE booking and GROUPS.

# Check types of reservation status by cancelation type
df_hotel.groupby(['reservation_status', 'is_canceled'])['hotel'].count()

reservation_status  is_canceled
Canceled            1              43017
Check-Out           0              75166
No-Show             1               1207
Name: hotel, dtype: int64

Transform Data

# Correct datetime
df_hotel[['reservation_status_date']] = df_hotel[['reservation_status_date']].astype('datetime64[ns]')

# Correct all categorical (except our target 'is_canceled') features including booleans (we can't use bool type or target variable as categorical to fit the model)
df_hotel[["hotel", "meal", "country", 
      "market_segment", "distribution_channel", 
      "reserved_room_type", "assigned_room_type", 
      "deposit_type", "customer_type", 
      "reservation_status"]] = df_hotel[["hotel", "meal", "country", "market_segment", 
                                       "distribution_channel", "reserved_room_type", "assigned_room_type", 
                                       "deposit_type", "customer_type", 
                                       "reservation_status",]].astype('category')

df_hotel['is_repeated_guest'] = df_hotel['is_repeated_guest'].astype(bool)



# Fill missing values from agent column and transform it to a categorical column
from statistics import mode
df_hotel[["agent"]] = df_hotel[["agent"]].astype(pd.Int32Dtype())
df_hotel_list = df_hotel[["agent"]].values.tolist()
def flatten(t):
    return [item for sublist in t for item in sublist]
flat_hotel = flatten(df_hotel_list)
flat_mode = mode(flat_hotel)
df_hotel[["agent"]] = df_hotel[["agent"]].fillna(flat_hotel)
df_hotel[["agent"]] = df_hotel[["agent"]].astype('int64')

# Fill na's from float column with the median
df_hotel['children'] = df_hotel['children'].fillna(df_hotel['children'].median()).astype('int64')

# Drop nan's rows from columns with inexpressive nan's count
df_hotel = df_hotel.dropna(subset = ['country', 'children'], axis = 0)

df_hotel.dtypes

hotel                                   category
is_canceled                                int64
lead_time                                  int64
arrival_date_year                          int64
arrival_date_month                        object
arrival_date_week_number                   int64
arrival_date_day_of_month                  int64
stays_in_weekend_nights                    int64
stays_in_week_nights                       int64
adults                                     int64
children                                   int64
babies                                     int64
meal                                    category
country                                 category
market_segment                          category
distribution_channel                    category
is_repeated_guest                           bool
previous_cancellations                     int64
previous_bookings_not_canceled             int64
reserved_room_type                      category
assigned_room_type                      category
booking_changes                            int64
deposit_type                            category
agent                                      int64
days_in_waiting_list                       int64
customer_type                           category
adr                                      float64
required_car_parking_spaces                int64
total_of_special_requests                  int64
reservation_status                      category
reservation_status_date           datetime64[ns]
is_company                                  bool
is_agent                                    bool
dtype: object

CONSIDERATION 1: We could have explored more our columns to set which could be ordinal encoded (such as months or year) but the time it would consume could be too much and it's not guaranteed it would help us;
CONSIDERATION 2: We could also have done one hot encoding for our categorical columns, but expand our dataframe and it would be more expensive to train the model or scale it if we would deploy based on big data;
CONSIDERATION 3: We could have predicted our missing values with a regression model, but in our case it wasn't necessary;
CONSIDERATION 4: We just need to treat datetimes and we are ready to fit for modelling! Let's create new features!

# Create a new feature which corrects datetime from reservation_status_date
df_hotel[['reservation_status_date_dt']] = df_hotel[['reservation_status_date']].astype('datetime64[ns]')

# Create a new feature which contains the full date of arrival and convert it to datetime
df_hotel['arrivel_date'] = df_hotel['arrival_date_year'].astype(str) + '-' + df_hotel['arrival_date_month'] + '-' + df_hotel['arrival_date_day_of_month'].astype(str)
df_hotel[['arrivel_date_dt']] = df_hotel[['arrivel_date']].astype('datetime64[ns]')

#Check transformation
df_hotel[['arrivel_date','arrivel_date_dt', 'reservation_status_date', 'reservation_status_date_dt']].tail(5)

	arrivel_date	arrivel_date_dt	reservation_status_date	reservation_status_date_dt
119385	2017-August-30	2017-08-30	2017-09-06	2017-09-06
119386	2017-August-31	2017-08-31	2017-09-07	2017-09-07
119387	2017-August-31	2017-08-31	2017-09-07	2017-09-07
119388	2017-August-31	2017-08-31	2017-09-07	2017-09-07
119389	2017-August-29	2017-08-29	2017-09-07	2017-09-07

from datetime import timedelta
# Create a datetime feature which tells how many days before the reservations was cancelled and check if it matches
df_hotel['lead_time_dt'] = df_hotel['lead_time'].apply(lambda x: timedelta(days = x))
df_hotel['booking_date'] = df_hotel['arrivel_date_dt'] - df_hotel['lead_time_dt']
df_hotel[['booking_date', 'arrivel_date', 'lead_time', 'lead_time_dt']].head(5)

	booking_date	arrivel_date	lead_time	lead_time_dt
0	2014-07-24	2015-July-1	342	342 days
1	2013-06-24	2015-July-1	737	737 days
2	2015-06-24	2015-July-1	7	7 days
3	2015-06-18	2015-July-1	13	13 days
4	2015-06-17	2015-July-1	14	14 days

# Drop old columns after check
# df_hotel = df_hotel.drop(['arrivel_date', 'lead_time', 'reservation_status_date'], axis=1)

# Create a new feature to identify HIGH SEASON BY COUNTRY and if the reservation was made in high or low season
df_hotel['n_rows_country'] = df_hotel.groupby(['country'])['country'].transform('count')
# Filter by most relevant countries
df_hotel['grouped_country'] = np.where(df_hotel['n_rows_country'] < 5000, 'other', df_hotel['country'])
table = df_hotel.groupby(['grouped_country', 'arrival_date_year', 'arrival_date_month']).agg(n_canceled = pd.NamedAgg('is_canceled', 'sum'), n_total = pd.NamedAgg('is_canceled', 'count')).reset_index()
table['country_year_total'] = (table.groupby(['grouped_country', 'arrival_date_year'])['n_total'].transform('sum')) / 12
# Set season by trheshold
table['season_index'] = table['n_total'] / table['country_year_total']
table['high_season'] = np.where(table['season_index'] < 1, 0, 1)
table_agg = table.sort_values(['grouped_country', 'arrival_date_year']).groupby(['grouped_country', 'arrival_date_month']).agg(n_high = pd.NamedAgg('high_season', 'sum'), n_total = pd.NamedAgg('high_season', 'count')).reset_index()
table_agg['high_season'] = np.where(table_agg['n_high'] >= table_agg['n_total'], 1, 0)
table_agg = table_agg[['grouped_country', 'arrival_date_month', 'high_season']]
df_hotel['grouped_country'] = df_hotel['grouped_country'].astype('category')
# Merge table created to define high season with original table
df_hotel = df_hotel.merge(table_agg, how = 'left', on = ['grouped_country', 'arrival_date_month'])
df_hotel

	hotel	is_canceled	lead_time	arrival_date_year	arrival_date_month	arrival_date_week_number	arrival_date_day_of_month	stays_in_weekend_nights	stays_in_week_nights	adults	...	is_company	is_agent	reservation_status_date_dt	arrivel_date	arrivel_date_dt	lead_time_dt	booking_date	n_rows_country	grouped_country	high_season
0	Resort Hotel	0	342	2015	July	27	1	0	0	2	...	False	False	2015-07-01	2015-July-1	2015-07-01	342 days	2014-07-24	48590	PRT	0
1	Resort Hotel	0	737	2015	July	27	1	0	0	2	...	False	False	2015-07-01	2015-July-1	2015-07-01	737 days	2013-06-24	48590	PRT	0
2	Resort Hotel	0	7	2015	July	27	1	0	1	1	...	False	False	2015-07-02	2015-July-1	2015-07-01	7 days	2015-06-24	12129	GBR	1
3	Resort Hotel	0	13	2015	July	27	1	0	1	1	...	False	True	2015-07-02	2015-July-1	2015-07-01	13 days	2015-06-18	12129	GBR	1
4	Resort Hotel	0	14	2015	July	27	1	0	2	2	...	False	True	2015-07-03	2015-July-1	2015-07-01	14 days	2015-06-17	12129	GBR	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
118897	City Hotel	0	23	2017	August	35	30	2	5	2	...	False	True	2017-09-06	2017-August-30	2017-08-30	23 days	2017-08-07	2342	other	1
118898	City Hotel	0	102	2017	August	35	31	2	5	3	...	False	True	2017-09-07	2017-August-31	2017-08-31	102 days	2017-05-21	10415	FRA	1
118899	City Hotel	0	34	2017	August	35	31	2	5	2	...	False	True	2017-09-07	2017-August-31	2017-08-31	34 days	2017-07-28	7287	DEU	1
118900	City Hotel	0	109	2017	August	35	31	2	5	2	...	False	True	2017-09-07	2017-August-31	2017-08-31	109 days	2017-05-14	12129	GBR	0
118901	City Hotel	0	205	2017	August	35	29	2	7	2	...	False	True	2017-09-07	2017-August-29	2017-08-29	205 days	2017-02-05	7287	DEU	1

118902 rows × 41 columns

df_hotel['grouped_country'] = df_hotel['grouped_country'].astype('category')
df_hotel['grouped_country'].dtypes

CategoricalDtype(categories=['DEU', 'ESP', 'FRA', 'GBR', 'PRT', 'other'], ordered=False)

# BONUS: create new boolean feature which tells if it's deposit or not since it looks redundant
df_hotel['is_deposit'] = np.where(df_hotel['deposit_type'] == 'No Deposit', 0, 1)
df_hotel['deposit_type'].unique()
df_hotel['deposit_type'].value_counts()

No Deposit    104167
Non Refund     14573
Refundable       162
Name: deposit_type, dtype: int64

# drop old column to help correlation plot and fitting the model
df_hotel = df_hotel.drop(['is_deposit', 'n_rows_country'], axis=1)

Plot, explore, transform and interpret

import seaborn as sns
sns.displot(df_hotel, x="is_canceled")

Imbalanced distribution of our target variable.
We could have even better models if we would have a more data-centric approach and do more feature engineering.

sns.clustermap(df_hotel.corr())

Correlation and dependencies between our features

import seaborn as sns
sns.boxplot(x = df_hotel['lead_time'] + 1)

CONSIDERATION 1: boxplotting we can see that most bookings (our interquartile interval) is somewhere in between 1 and 180 days; above 400 are our outliers;
CONSIDERATION 2: boxplotting we can see that most bookings (our interquartile interval) is somewhere in between 1 and 180 days; above 400 are our outliers;
CONSIDERATION 3: we can see that most reservations are somewhere in between 80 and 180 days before;
CONSIDERATION 4: we can see that we still have a considerable ammount of bookings somewehere in between 1 and 10;

df_hotel['log_lead_time'] = np.log(df_hotel['lead_time'] + 1)
sns.histplot(df_hotel['lead_time'] + 1, log_scale = True)

CONSIDERATION 1: On our histogram we've observed the same as before, a there's a considerable ammount of bookings betwen day 1 and 10;
CONSIDERATION 2: On our histogram we've observed an odd behaviour ('non-linearity'), a lot of bookings are on day 1;
CONSIDERATION 3: That means we've two populations and distributions, which we can use to create a new feature: who books at the last minute and who schedule;

# Creating new feature: reservation_type
df_hotel['reservation_type'] = np.where((df_hotel['lead_time'] + 1) < 10, 'last_minute', 'scheduled')
df_hotel.groupby(['reservation_type', 'is_canceled'])['reservation_type'].count()

reservation_type  is_canceled
last_minute       0              18450
                  1               2102
scheduled         0              56295
                  1              42055
Name: reservation_type, dtype: int64

enc_minute = {'last_minute': True, 'scheduled': False}
df_hotel['is_last_minute'] = df_hotel['reservation_type'].map(enc_minute).astype(bool)

df_hotel.dtypes

hotel                                    category
is_canceled                                 int64
lead_time                                   int64
arrival_date_year                           int64
arrival_date_month                         object
arrival_date_week_number                    int64
arrival_date_day_of_month                   int64
stays_in_weekend_nights                     int64
stays_in_week_nights                        int64
adults                                      int64
children                                    int64
babies                                      int64
meal                                     category
country                                  category
market_segment                           category
distribution_channel                     category
is_repeated_guest                            bool
previous_cancellations                      int64
previous_bookings_not_canceled              int64
reserved_room_type                       category
assigned_room_type                       category
booking_changes                             int64
deposit_type                             category
agent                                       int64
days_in_waiting_list                        int64
customer_type                            category
adr                                       float64
required_car_parking_spaces                 int64
total_of_special_requests                   int64
reservation_status                       category
reservation_status_date            datetime64[ns]
is_company                                   bool
is_agent                                     bool
reservation_status_date_dt         datetime64[ns]
arrivel_date                               object
arrivel_date_dt                    datetime64[ns]
lead_time_dt                      timedelta64[ns]
booking_date                       datetime64[ns]
grouped_country                          category
high_season                                 int64
log_lead_time                             float64
reservation_type                           object
is_last_minute                               bool
dtype: object

import seaborn as sns
import matplotlib.pyplot as plt
#Using Pearson Correlation
plt.figure(figsize=(22,20))
cor = df_hotel.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

LIST OF RELEVANT CORRELATIONS TO A BUSINESS SPECIALIST INTERPRET:
       1) Is canceled with lead time, special request, parking spaces and previous cancellations; some with booking changes;
       2) Lead time with agent, days in waiting list, high season, adults and stays in week nights;
       3) Arrival date year with average daily rate;
       4) Adults and number of children with average daily rate;
       5) Is repeated guest with previous booking not canceled, is company, average daily rate;
       6) Previous booking not canceled with is company;
       7) Average daily rate with number of children, adults, special requests;
       8) Is company with previous booking not cancelled, is repeated guest, adult;
       9) Is Last minute booking is related to stays in week nights;

STAGE 2: Comparing two models

Linear model: logistic regression

import statsmodels.formula.api as fsm
model = fsm.logit(formula = 'is_canceled ~ log_lead_time' , data = df_hotel)
fit = model.fit()
fit.summary()

df_hotel['pred_baseline'] = fit.predict()
sns.scatterplot(data = df_hotel, x = 'lead_time', y = 'pred_baseline')

Optimization terminated successfully.
         Current function value: 0.602606
         Iterations 6

import matplotlib.pyplot as plt
import statsmodels.formula.api as fsm
model = fsm.logit(formula = 'is_canceled ~ log_lead_time : reservation_type * is_company * high_season', data = df_hotel)
fit = model.fit()
fit.summary()
df_hotel['pred_m1'] = fit.predict()

fig,ax = plt.subplots(1,2, figsize = (16,8))
sns.scatterplot(data = df_hotel, x = 'lead_time', y = 'pred_m1', hue = 'is_company', size = 'high_season', ax = ax[0])
sns.scatterplot(data = df_hotel, x = 'log_lead_time', y = 'pred_m1', hue = 'is_company', size = 'high_season', ax = ax[1])

Optimization terminated successfully.
         Current function value: 0.600648
         Iterations 6

CONSIDERATION 1: We can see our break from two populations (last minute and scheduled) between day 0 and 10;
CONSIDERATION 2: We can assume that, when it is company, we've MUCH fewer cancellations;
CONSIDERATION 4: During high season, companies cancel more;
CONSIDERATION 5: During high season, cancellations done by non-companies slightly decrease;
CONSIDERATION 6: The gap between high season and low season cancellations get bigger as higher is the lead time;
CONSIDERATION 7: This gap is stronger when 'is company';

import matplotlib.pyplot as plt
import statsmodels.formula.api as fsm
model = fsm.logit(formula = 'is_canceled ~ log_lead_time : reservation_type * is_agent * high_season', data = df_hotel)
fit = model.fit()
fit.summary()
df_hotel['pred_cancel'] = fit.predict()

fig,ax = plt.subplots(1,2, figsize = (16,8))
sns.scatterplot(data = df_hotel, x = 'lead_time', y = 'pred_cancel', hue = 'is_agent', size = 'high_season', ax = ax[0])
sns.scatterplot(data = df_hotel, x = 'log_lead_time', y = 'pred_cancel', hue = 'is_agent', size = 'high_season', ax = ax[1])

Optimization terminated successfully.
         Current function value: 0.601806
         Iterations 7

CONSIDERATION 1: After 10 days, when the book is done by an agent, it's more probable to cancel than when it's not booked by an agent.

We can plot a ROC curve to check the performance of a classification model at all classification thresholds,
considering the trade-off between sensitivity (True Positive Rate) and specificity (True Negative Rate).

from sklearn.metrics import f1_score

threshold_list = np.linspace(0.05, 0.95, 200)

f1_list = []
for threshold in threshold_list:
    pred_label = np.where(df_hotel['pred_cancel'] < threshold, 0, 1)
    f1 = f1_score(df_hotel['is_canceled'], pred_label)
    f1_list.append(f1)
    
df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)

Text(0.5, 1.0, 'Best Threshold: 0.26 w/ F-1: 0.6')

F-1 score finds the best spot between Precision and Recall and it's reasonable to use with binary classification or balanced multiclass data.
Most of the times our business problem determines if we should privilege Precision or Recall, but F-1 score is a good generalization to compare the performance of the models.
We can check and set the best threshold for our model so we can improve our desired score.

from sklearn.metrics import cohen_kappa_score, precision_score, roc_curve
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, roc_auc_score

threshold_list = np.linspace(0.05, 0.95, 200)

score_list = []
for threshold in threshold_list:
    pred_label = np.where(df_hotel['pred_cancel'] < threshold, 0, 1)
    score = cohen_kappa_score(df_hotel['is_canceled'], pred_label)
    score_list.append(score)

df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == max(df_score['score_score'])]
bt = df_score[df_score['score_score'] == max(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == max(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Kappa: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)

Text(0.5, 1.0, 'Best Threshold: 0.39 w/ Kappa: 0.24')

Cohen's Kappa score is a good measure for a classification model since it considers imbalanced data.

from sklearn.metrics import roc_curve
#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(df_hotel['is_canceled'], df_hotel['pred_cancel'])

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111, aspect=1)

sns.lineplot(x = fpr, y = fpr, ax = ax)
sns.lineplot(x = fpr, y = tpr, ax = ax)

A ROC curve shows the performance of one classification model at all classification thresholds.
It can be used to evaluate the strength of a model. ROC Curves can also be used to compare two models and we've this mind.

CONCLUSIONS

We did a hard work to get here. We didn't achieve a reasonable score (0.6) for today parameters, though. But not all is about score.

Advantages
- Logistic regression is a simple linear model which is easy and fast to set;
- It's not computationally expensive, you can run on ordinary hardware;
- It's highly scalabe;
- It's not a complex model, no black box, it's highly explainable and interpretable; you can even use it to explore more your data;
- Excelent to plot, visualize and get insights that wouldn't be possible on complex models;
- Excelent for risk analysis such as bookings, so we could infer the rate of daily room rate considering risk of cancellation;
Disadvantage
- It's more handmade, demands more time, work and analytical capabilities from the data scientist who works as a craftsman;
- It's sensible to outliers and you should be minimalist, to choose less and more important features and treat outliers without loosing data;
- More dependant of a good feature engineering to get a better score;
- You can also stratify you modelling and build regressions on top of regressions, but most of the times it's not worth the hard work;
- You will never get a good result as state-of-art ensemble of hierarchical models and neural networks;

Ensemble of hierarchical models: XGBoost

Feature Selection: GridSearchCV

%%time
import numpy as np
import xgboost as xgb
from sklearn.experimental import enable_halving_search_cv 
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from datetime import datetime

# Automate best parameters search for XGBoost using GridSearch
folds = 3
param_comb = 20

# Choose parameters to compare
param_grid = {
     "max_depth": [6, 10, 20, 30],
     "min_child_weight": [1, 3, 10],
     "gamma"            : [0.2 , 0.7, 1, 2],
     "alpha"            : [1],
     "learning_rate":    [0.10, 1, 3],
     "missing":          [np.nan],                      
     "num_parallel_tree": [1, 2],                 
#     "use_rmm": True,                      
     "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7],
     "scale_pos_weight": [1, 3], 
     "reg_lambda": [1, 5, 10, 50],
#     "gradient_based": True,
     "subsample": [0.1, 0.5],                    
     }

# skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

# Halving Grid Search CV implementation
xgb_cl = xgb.XGBClassifier(n_estimators=1000,
                           objective='binary:logistic',
                           silent=True, 
                           nthread=40,
                           tree_method='gpu_hist', 
                           eval_metric='auc')

halving_cv = HalvingGridSearchCV(xgb_cl, 
                                 param_grid, 
                                 scoring="roc_auc", 
                                 #n_jobs=4, 
                                 min_resources="exhaust",
                                 factor=3,
                                 #scoring="neg_log_loss", 
                                 #cv=skf.split(X_train,Y_train),
                                 verbose=1, 
                                 random_state=1001,
                                )

# # Here we go
# start_time = timer(None) # timing starts from this point for "start_time" variable

# halving_cv.fit(X_train, Y_train)
# # Return set of parameters with the best performance
# halving_cv.best_params_
# # Return the performance metric score
# halving_cv.best_score_

# model.dump_model('dump.raw.txt')

# timer(start_time) # timing ends here for "start_time" variable

CPU times: user 80.7 ms, sys: 12.2 ms, total: 92.9 ms
Wall time: 95.6 ms

CONSIDERATION: We would like to automate the search for the best parameters, but it takes more time than we had.
We decided to securely set the parameters on a conservative approach as suggested by the community, after empirical testing.

BEST MODEL TO COMPARE: XGboost using all features with binary logistic and avoiding LEAKAGE

%%time
import os
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from sklearn.metrics import cohen_kappa_score, precision_score
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, roc_auc_score

# Define your features and your target to predict (selecting ALL columns WITHOUT LEAKAGE!)
X = df_hotel[['hotel', 'lead_time', 'arrival_date_year', 'arrival_date_month', 
                    'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 
                    'babies', 'meal','country', 'market_segment', 'distribution_channel',
                    'is_repeated_guest', 'previous_cancellations',
                    'previous_bookings_not_canceled', 'reserved_room_type',
                    'assigned_room_type', 'booking_changes', 'deposit_type',
                    'days_in_waiting_list', 'customer_type', 'adr',
                    'required_car_parking_spaces', 'total_of_special_requests',
                    'is_company','is_agent']]

Y = df_hotel['is_canceled']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=7)
Y_train = Y_train
Y_test = Y_test

full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), X_train.columns)], remainder='passthrough')

encoder = full_pipeline.fit(X_train)
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)

# train the model
model = XGBRegressor(n_estimators= 200,
                     max_depth= 30,                         # Lower ratios avoid over-fitting. Default is 6.
                     objective = 'binary:logistic',         # Default is reg:squarederror. 'multi:softprob' for multiclass and get proba.  
                     #num_class = 2,                        # Use if softprob is set.
                     reg_lambda = 10,                       # Larger ratios avoid over-fitting. Default is 1.
                     gamma = 0.3,                           # Larger values avoid over-fitting. Default is 0. # Values from 0.3 to 0.8 if you have many columns (especially if you did one-hot encoding), or 0.8 to 1 if you only have a few columns.
                     alpha = 1,                             # Larger ratios avoid over-fitting. Default is 0.
                     learning_rate= 0.10,                   # Lower ratios avoid over-fitting. Default is 3.
                     colsample_bytree= 0.7,                 # Lower ratios avoid over-fitting.
                     scale_pos_weight = 1,                  # Default is 1. Control balance of positive and negative weights, for unbalanced classes.
                     subsample = 0.1,                       # Lower ratios avoid over-fitting. Default 1. 0.5 recommended. # 0.1 if using GPU.
                     min_child_weight = 3,                  # Larger ratios avoid over-fitting. Default is 1.
                     missing = np.nan,                      # Deal with missing values
                     num_parallel_tree = 2,                 # Parallel trees constructed during each iteration. Default is 1.
                     importance_type = 'weight',
                     eval_metric = 'auc',
                     #use_label_encoder = True,
                     #enable_categorical = True,
                     verbosity = 1,
                     nthread = -1,                          # Set -1 to use all threads.
                     #use_rmm = True,                       # Use GPU if available
                     tree_method = 'auto', # auto           # 'gpu_hist'. Default is auto: analyze the data and chooses the fastest.
                     #gradient_based = True,                # If True you can set subsample as low as 0.1. Only use with gpu_hist 
                    )

# fit model              
model.fit(X_train_enc, Y_train.values.ravel(),
          # early_stopping_rounds=20
         )

# check best ntree limit
display(model.best_ntree_limit)

# extract the training set predictions
preds_train = model.predict(X_train_enc,
                            ntree_limit=model.best_ntree_limit
                           )
# extract the test set predictions
preds_test = model.predict(X_test_enc,
                           ntree_limit=model.best_ntree_limit
                           )

# save model
output_dir = "models"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
# save in JSON format
model.save_model(f'{output_dir}/hotel_xgboost.json')
# save in text format
model.save_model(f'{output_dir}/hotel_xgboost.txt')

print('FINISHED!')

Found ntree_limit: 400

FINISHED!
CPU times: user 4min 11s, sys: 384 ms, total: 4min 12s
Wall time: 34.8 s

Setting our parameters at a conservative approach and with a number of steps of 200 took about 4 minutes.
We'll use this 'step' pattern to compare other settings.
If we would put this model in production we probably should expand our model to train for more time considering learning rate and steps.

Plot & Score

%%time
# Plot F1-Score and Threshold
from sklearn.metrics import f1_score

threshold_list = np.linspace(0.05, 0.95, 200)

f1_list = []
for threshold in threshold_list:
    pred_label = np.where(preds_test < threshold, 0, 1)
    f1 = f1_score(Y_test, pred_label)
    f1_list.append(f1)

df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)
plt.show()

# Plot Kappa Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)

score_list = []
for threshold in threshold_list:
    pred_label = np.where(preds_test < threshold, 0, 1)
    score = cohen_kappa_score(Y_test, pred_label)
    score_list.append(score)

df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == max(df_score['score_score'])]
bt = df_score[df_score['score_score'] == max(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == max(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Kappa: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)
plt.show()

from sklearn.metrics import roc_curve

#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(Y_test, preds_test)

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111, aspect=1)

sns.lineplot(x = fpr, y = fpr, ax = ax)
sns.lineplot(x = fpr, y = tpr, ax = ax)
plt.show()

CPU times: user 1min 13s, sys: 572 ms, total: 1min 14s
Wall time: 1min 13s

Our ROC curve is MUCH better on all thresholds compared with our logistic regression! We've a nice and even curve which doesn't appear to be overfitting!

from sklearn.metrics import roc_auc_score

best_preds = np.where(preds_test < bt, 0, 1)

print("Roc_auc = {}".format(roc_auc_score(Y_test, best_preds)))
print("Precision = {}".format(precision_score(Y_test, best_preds)))
print("Recall = {}".format(recall_score(Y_test, best_preds)))
print("F1 = {}".format(f1_score(Y_test, best_preds)))
print("Kappa_score = {}".format(cohen_kappa_score(Y_test, best_preds)))
print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, best_preds)))
print("Mean_squared_error_test = {}".format(mean_squared_error(Y_test, best_preds)))
print("Logloss_test = {}".format(log_loss(Y_test, best_preds)))

Roc_auc = 0.8492998797960843
Precision = 0.8289085545722714
Recall = 0.7955832389580973
F1 = 0.8119040739670615
Kappa_score = 0.7043968112863238
Matthews_corrcoef = 0.704763163688087
Mean_squared_error_test = 0.13687397502207646
Logloss_test = 4.727508371460749

Now we've a reasonable predictive model! It could be even improved findind best parameters.
Using the best threshold for Kappa score improved our overall scores compared with best F-1 score threshold, probably due to imbalanced classes.

XGBoost with manually selected features using Patsy and Softprob

%%time
import xgboost as xgb
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
import patsy
# Selecting features I manually found interesting analyzing matthew's correlation and using patsy to automatic interact between features.
y, X = patsy.dmatrices('is_canceled ~ hotel + lead_time + arrival_date_year + arrival_date_month + stays_in_weekend_nights + \
                       stays_in_week_nights + adults + children + babies + meal + country + market_segment + distribution_channel + \
                       is_repeated_guest + previous_cancellations + previous_bookings_not_canceled + reserved_room_type + \
                       assigned_room_type + booking_changes + deposit_type + days_in_waiting_list + customer_type + adr + \
                       required_car_parking_spaces + total_of_special_requests + is_company + is_agent', data = df_hotel)



                    
# Display patsy features
#display(X)

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)

D_train = xgb.DMatrix(X_train, label=Y_train)#, enable_categorical=True)
D_test = xgb.DMatrix(X_test, label=Y_test)#, enable_categorical=True)


param = {
    'eta': 0.10,                      # Lower ratios avoid over-fitting. Default is 3.
    'max_depth': 30,                  # Lower ratios avoid over-fitting. Default is 6.
    "min_child_weight": 3,            # Larger ratios avoid over-fitting. Default is 1.
    "gamma": 0.3,                     # Larger values avoid over-fitting. Default is 0. 
    "colsample_bytree" : 0.7,         # Lower ratios avoid over-fitting. Values from 0.3 to 0.8 if you have many columns (especially if you did one-hot encoding), or 0.8 to 1 if you only have a few columns.
    "scale_pos_weight": 1,            # Default is 1. Control balance of positive and negative weights, for unbalanced classes.
    "reg_lambda": 10,                 # Larger ratios avoid over-fitting. Default is 1.
    "alpha": 1,                       # Larger ratios avoid over-fitting. Default is 0.
    'subsample':0.5,                  # Lower ratios avoid over-fitting. Default 1. 0.5 recommended.
    'num_parallel_tree': 2,           # Parallel trees constructed during each iteration. Default is 1.
    'objective': 'multi:softprob',    # Default is reg:squarederror. 'multi:softprob' for multiclass.  
    'num_class': 2,                   # Use if softprob is set.
    'verbosity':1,
    'eval_metric': 'auc',
    'use_rmm':False,                   # Use GPU if available
    'nthread':-1,                      # Set -1 to use all threads.
    'tree_method': 'auto',             # 'gpu_hist'. Default is auto: analyze the data and chooses the fastest.
    'gradient_based': False,           # If True you can set subsample as low as 0.1. Only use with gpu_hist 
} 

steps = 200  # The number of training iterations

model = xgb.train(param, D_train, steps)
import numpy as np
from sklearn.metrics import precision_score, recall_score, accuracy_score

preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds])

print("Precision = {}".format(precision_score(Y_test, best_preds)))
print("Recall = {}".format(recall_score(Y_test, best_preds)))
print("f1 = {}".format(f1_score(Y_test, best_preds)))
print("kappa_score = {}".format(cohen_kappa_score(Y_test, best_preds)))
print("matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, best_preds)))
#print("mean_squared_error_train = {}".format(mean_squared_error(Y_train, best_preds)))
# print("mean_squared_error_test = {}".format(mean_squared_error(Y_test, best_preds)))
print("logloss_test = {}".format(log_loss(Y_test, best_preds)))
#print("logloss_train = {}".format(log_loss(Y_train, best_preds)))

# from xgboost import plot_importance
# import matplotlib.pyplot as pyplot
# plot_importance(model)
# pyplot.show()

Precision = 0.8658173592094297
Recall = 0.8151552516534021
f1 = 0.8397228637413394
kappa_score = 0.7480610552387499
matthews_corrcoef = 0.7489019304166113
logloss_test = 4.031812977527266
CPU times: user 1h 16min 6s, sys: 6.13 s, total: 1h 16min 12s
Wall time: 10min 13s

Using patsy and softprob (contains predicted probability of each data point belonging to each class) improved our overall scores! And we didn't even explored our thresholds!

CONCLUSIONS

We achieved a F-1 score of more than 82% without GridSearchCV or moving our thresholds, within just about 4 minutes of training!

Advantages
- Hierarchical tree models, with root and leaves, reproduces better human stratified decision;
- It's much less sensible to outliers than linear models;
- It's a bit less dependant of engineering new features, you can provide to the model most existing features if leakage is avoided;
- Ensemble of tree models as XGBoost have great prediction capability;
- You don't have to be too analytical, it's more about engineering optimization and avoiding overfitting;
- It's less handmade; if you achieve to put in a production pipeline after a hard work, it's very powerful for a mass industry level automation;
Disadvantage
- It's more prone to overfit as the model gets more complex;
- It's less acurate than state-of-art neural networks models (we didn't trained neural networks for computational reasons);
- Although not as costly as neural networks, it still needs time and hardware for training with proper parameters;
- It's much harder to scale on big data and put in a production pipeline;
- It's a black box model, it's not as easy to be analytical and get insights as it's with linear regressions;
- It's hard to explain. Although we can explain with tools such as SHAP it's very expensive and it's not easy to explain in layman's terms to managers;

Final Remarks after 4 days of work (two weekends)

We did our exploratory data analysis (EDA) and feature engineering treating our data and creating more variables.
We've created two models using a notebook and this is our POC (proof-of-concept).
- LINEAR: A logistic regression which achieved a poor F-1 score of 60%, but was important to get insights on our business problem and its risks;
- HIERARCHICAL: An ensemble of tree models using XGBoost, we've great scores of more than 82%;
It would be better if we've a better understanding of our business problem by communicating with our client and product manager and owners;
If we would start from zero, we would first work on a better linear model improving our baseline model to deal with our business problem instantly;
We would have a data-centric approach to improve the quality of our dataset, by scrapping data or better querying our database, by feature engineering, by collecting more data;
We would refactorize and modularize our code so we could put a pipeline in production fast;
After we've a fast and good quality minimun viable product (MVP) automated we would start to think on creating more complex and costly models to put in production;
Probably we would opt for state-of-art neural network models using friendly APIs such as PyTorch and Tensorflow, using hardware resources from cloud services; but we can't be sure yet;

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
WIT		WIT
data		data
notebooks		notebooks
LICENSE		LICENSE
README.md		README.md
output_31_1.png		output_31_1.png
output_33_1.png		output_33_1.png
output_35_1.png		output_35_1.png
output_37_1.png		output_37_1.png
output_42_0.png		output_42_0.png
output_46_2.png		output_46_2.png
output_47_2.png		output_47_2.png
output_49_2.png		output_49_2.png
output_52_1.png		output_52_1.png
output_54_1.png		output_54_1.png
output_56_1.png		output_56_1.png
output_67_0.png		output_67_0.png
output_67_1.png		output_67_1.png
output_67_2.png		output_67_2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BUSINESS PROBLEM: explore a hotel booking dataset and predict cancellation

STAGE 1: preprecessing and EDA

Load Data and Explore

Transform Data

Plot, explore, transform and interpret

STAGE 2: Comparing two models

Linear model: logistic regression

CONCLUSIONS

Ensemble of hierarchical models: XGBoost

Feature Selection: GridSearchCV

BEST MODEL TO COMPARE: XGboost using all features with binary logistic and avoiding LEAKAGE

XGBoost with manually selected features using Patsy and Softprob

CONCLUSIONS

Final Remarks after 4 days of work (two weekends)

About

Releases

Packages

Languages

License

ggnicolau/Project-18-Predict-Hotel-Cancellation

Folders and files

Latest commit

History

Repository files navigation

BUSINESS PROBLEM: explore a hotel booking dataset and predict cancellation

STAGE 1: preprecessing and EDA

Load Data and Explore

Transform Data

Plot, explore, transform and interpret

STAGE 2: Comparing two models

Linear model: logistic regression

CONCLUSIONS

Ensemble of hierarchical models: XGBoost

Feature Selection: GridSearchCV

BEST MODEL TO COMPARE: XGboost using all features with binary logistic and avoiding LEAKAGE

XGBoost with manually selected features using Patsy and Softprob

CONCLUSIONS

Final Remarks after 4 days of work (two weekends)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages