├── LICENSE
├── README.md <- The top-level README for developers using this project.
├── data
│  ├── interim <- Intermediate data that has been transformed.
│  ├── processed <- The final, canonical data sets.
│  └── raw <- The original, immutable data dump.
│
├── models <- Trained and serialized models, model predictions, or model summaries.
│
├── notebooks <- Jupyter notebooks with steps for training and evaluating models.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│  └── figures <- Generated graphics and figures to be used in reporting
│
└── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
Project based on the cookiecutter data science project template. #cookiecutterdatascience
In the context of marketplaces, an algorithm is needed to predict if an item listed is new or used.
Your tasks involve the data analysis, designing, processing and modeling of a machine learning solution to predict if an item is new or used and then evaluate the model over held-out test data.
To assist in that task a dataset is provided in MLA_100k_checked_v3.jsonlines
.
For the evaluation, you will use the accuracy metric in order to get a result of 0.86 as minimum. Additionally, you will have to choose an appropiate secondary metric and also elaborate an argument on why that metric was chosen.
The deliverables are:
- The file, including all the code needed to define and evaluate a model.
- A document with an explanation on the criteria applied to choose the features, the proposed secondary metric and the performance achieved on that metrics.
- Optionally, you can deliver an EDA analysis with other formart like .ipynb
-
You will find our first selected columns at section 2.1, then you can check our definitive columns after treatment and feature engineering.
-
We didn't predict our classes (0 and 1), but we decided to predict the probability for our binary classification problem, since it's more meaningful (literally, we calculate the probability to be class 0 or 1). Thus, we didn't calculate accuracy, precision, recall, F1-score, Kappa or other metrics. For probability evaluation, we opted to use mean squared error, log loss and Brier score (lower is better). We also used the ROC curve to evaluate the model and calculated ROC AUC score (higher is better).
-
Our metrics only make sense if we compare between models. We compared four models.
(a) Our first one is our baseline, we used logistic regressions with no parameters and got a bad result with a score of 0.69. We mostly used it because applying a linear model can help to get insights from the data;
(b) Our second one is more complex and with less interpretability, we used an ensemble of non-linear hierarchical tree models, called XGBoost. We got a ROC AUC of 0.89, which is better than our baseline model and it's a good result, with high computational cost though;
(c) For our third one, we first used embeddings (neural networks) to encode our categorical features (category and seller city) with high cardinality which we couldn't do One Hot Encoding (due to computational cost) or Label Encoding (since unique values are independent from each other). After that, we just applied a Logistic Regression. Impressively, we got a ROC AUC of 0.9 with a simple linear model for binary classification;
(d) For our fourth one we also used embeddings for encoding, but then used a XGBoost. We got a ROC AUC of 0.93, as we expected to get a better result from the previous one;
(e) Finally, for our fifth model we also used embeddings for encoding, but then used a four-layers Neural Network. We got the same results as the last model;
- REMARKS: Personally, I think the embeddings encoding with Logistic Regression is the best model, because it's simpler and has more interpretability. Occam's Razor principle states that other things equal, explanations that posit fewer entities, or fewer kinds of entities, are to be preferred to explanations that posit more.
Metrics | XGBoost | Emb_Logistic | Emb_XGBoost | Emb_NNet | |
---|---|---|---|---|---|
0 | mean_squared_error_test | 0.36 | 0.36 | 0.32 | 0.32 |
1 | Roc_auc | 0.89 | 0.90 | 0.93 | 0.93 |
2 | Brier_error | 0.13 | 0.12 | 0.10 | 0.10 |
3 | Logloss | 0.40 | 0.39 | 0.34 | 0.35 |
- The Mean Squared Error (or MSE) is much like the mean absolute error in that it provides a gross idea of the magnitude of error.
Taking the square root of the mean squared error converts the units back to the original units of the output variable and can be meaningful for description and presentation.
This is called the Root Mean Squared Error (or RMSE).
- Logistic loss (or log loss) is a performance metric for evaluating the predictions of probabilities of membership to a given class.
The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm. Predictions that are correct or incorrect are rewarded or punished proportionally to the confidence of the prediction.
It heavily penalizes predicted probabilities far away from their expected value.
- The Brier score calculates the mean squared error between predicted probabilities and the expected values.
It`s gentler than log loss but still penalizes proportional to the distance from the expected value.
- Area Under ROC Curve (or ROC AUC for short) is a performance metric for binary classification problems.
The AUC represents a model’s ability to discriminate between positive and negative classes.
An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random.
A ROC Curve is a plot of the true positive rate and the false positive rate for a given set of probability predictions at different thresholds used to map the probabilities to class labels.
The area under the curve is then the approximate integral under the ROC Curve.
The area under ROC curve that summarizes the likelihood of the model predicting a higher probability for true positive cases than true negative cases.
*(machinelearningmastery.com)*
- numpy
- pandas
- re
- matplotlib
- seaborn
- embedding_encoder
- sklearn
- xgboost
- keras
- Python version 3.9
- Git
- VS Studio
- Jupyter IPython
- Github
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
dfs = pd.read_json('MLA_100k_checked_v3.jsonlines', lines=True)
dfs = dfs.rename(columns = {'tags':'tag'})
dfs = dfs.rename(columns = {'id':'Id'})
# Get region
dfs['seller_country'] = dfs.apply(lambda x : x['seller_address']['country']['name'], axis = 1)
dfs['seller_state'] = dfs.apply(lambda x : x['seller_address']['state']['name'], axis = 1)
dfs['seller_city'] = dfs.apply(lambda x : x['seller_address']['city']['name'], axis = 1)
# Transform id (named as descriptions) column to get data
import ast
def str_to_dict(column):
for i in range(len(column)):
try:
column[i] = ast.literal_eval(column[i][0])
except:
return
str_to_dict(dfs['descriptions'])
# get data from descriptions and shipping
dfs = pd.concat([dfs, dfs["descriptions"].apply(pd.Series)], axis=1)
dfs = pd.concat([dfs, dfs["shipping"].apply(pd.Series)], axis=1)
pd.set_option('display.max_columns', None)
dfs.head(5)
seller_address | warranty | sub_status | condition | deal_ids | base_price | shipping | non_mercado_pago_payment_methods | seller_id | variations | site_id | listing_type_id | price | attributes | buying_mode | tag | listing_source | parent_item_id | coverage_areas | category_id | descriptions | last_updated | international_delivery_mode | pictures | Id | official_store_id | differential_pricing | accepts_mercadopago | original_price | currency_id | thumbnail | title | automatic_relist | date_created | secure_thumbnail | stop_time | status | video_id | catalog_product_id | subtitle | initial_quantity | start_time | permalink | sold_quantity | available_quantity | seller_country | seller_state | seller_city | 0 | id | local_pick_up | methods | tags | free_shipping | mode | dimensions | free_methods | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {'country': {'name': 'Argentina', 'id': 'AR'},... | None | [] | new | [] | 80.0 | {'local_pick_up': True, 'methods': [], 'tags':... | [{'description': 'Transferencia bancaria', 'id... | 8208882349 | [] | MLA | bronze | 80.0 | [] | buy_it_now | [dragged_bids_and_visits] | MLA6553902747 | [] | MLA126406 | {'id': 'MLA4695330653-912855983'} | 2015-09-05T20:42:58.000Z | none | [{'size': '500x375', 'secure_url': 'https://a2... | MLA4695330653 | NaN | NaN | True | NaN | ARS | http://mla-s1-p.mlstatic.com/5386-MLA469533065... | Auriculares Samsung Originales Manos Libres Ca... | False | 2015-09-05T20:42:53.000Z | https://a248.e.akamai.net/mla-s1-p.mlstatic.co... | 2015-11-04 20:42:53 | active | None | NaN | NaN | 1 | 2015-09-05 20:42:53 | http://articulo.mercadolibre.com.ar/MLA4695330... | 0 | 1 | Argentina | Capital Federal | San CristĂłbal | NaN | MLA4695330653-912855983 | True | [] | [] | False | not_specified | None | NaN | |
1 | {'country': {'name': 'Argentina', 'id': 'AR'},... | NUESTRA REPUTACION | [] | used | [] | 2650.0 | {'local_pick_up': True, 'methods': [], 'tags':... | [{'description': 'Transferencia bancaria', 'id... | 8141699488 | [] | MLA | silver | 2650.0 | [] | buy_it_now | [] | MLA7727150374 | [] | MLA10267 | {'id': 'MLA7160447179-930764806'} | 2015-09-26T18:08:34.000Z | none | [{'size': '499x334', 'secure_url': 'https://a2... | MLA7160447179 | NaN | NaN | True | NaN | ARS | http://mla-s1-p.mlstatic.com/23223-MLA71604471... | Cuchillo Daga Acero CarbĂłn Casco Yelmo Solinge... | False | 2015-09-26T18:08:30.000Z | https://a248.e.akamai.net/mla-s1-p.mlstatic.co... | 2015-11-25 18:08:30 | active | None | NaN | NaN | 1 | 2015-09-26 18:08:30 | http://articulo.mercadolibre.com.ar/MLA7160447... | 0 | 1 | Argentina | Capital Federal | Buenos Aires | NaN | MLA7160447179-930764806 | True | [] | [] | False | me2 | None | NaN | |
2 | {'country': {'name': 'Argentina', 'id': 'AR'},... | None | [] | used | [] | 60.0 | {'local_pick_up': True, 'methods': [], 'tags':... | [{'description': 'Transferencia bancaria', 'id... | 8386096505 | [] | MLA | bronze | 60.0 | [] | buy_it_now | [dragged_bids_and_visits] | MLA6561247998 | [] | MLA1227 | {'id': 'MLA7367189936-916478256'} | 2015-09-09T23:57:10.000Z | none | [{'size': '375x500', 'secure_url': 'https://a2... | MLA7367189936 | NaN | NaN | True | NaN | ARS | http://mla-s1-p.mlstatic.com/22076-MLA73671899... | Antigua Revista Billiken, N° 1826, Año 1954 | False | 2015-09-09T23:57:07.000Z | https://a248.e.akamai.net/mla-s1-p.mlstatic.co... | 2015-11-08 23:57:07 | active | None | NaN | NaN | 1 | 2015-09-09 23:57:07 | http://articulo.mercadolibre.com.ar/MLA7367189... | 0 | 1 | Argentina | Capital Federal | Boedo | NaN | MLA7367189936-916478256 | True | [] | [] | False | me2 | None | NaN | |
3 | {'country': {'name': 'Argentina', 'id': 'AR'},... | None | [] | new | [] | 580.0 | {'local_pick_up': True, 'methods': [], 'tags':... | [{'description': 'Transferencia bancaria', 'id... | 5377752182 | [] | MLA | silver | 580.0 | [] | buy_it_now | [] | None | [] | MLA86345 | {'id': 'MLA9191625553-932309698'} | 2015-10-05T16:03:50.306Z | none | [{'size': '441x423', 'secure_url': 'https://a2... | MLA9191625553 | NaN | NaN | True | NaN | ARS | http://mla-s2-p.mlstatic.com/183901-MLA9191625... | Alarma Guardtex Gx412 Seguridad Para El Automo... | False | 2015-09-28T18:47:56.000Z | https://a248.e.akamai.net/mla-s2-p.mlstatic.co... | 2015-12-04 01:13:16 | active | None | NaN | NaN | 1 | 2015-09-28 18:47:56 | http://articulo.mercadolibre.com.ar/MLA9191625... | 0 | 1 | Argentina | Capital Federal | Floresta | NaN | MLA9191625553-932309698 | True | [] | [] | False | me2 | None | NaN | |
4 | {'country': {'name': 'Argentina', 'id': 'AR'},... | MI REPUTACION. | [] | used | [] | 30.0 | {'local_pick_up': True, 'methods': [], 'tags':... | [{'description': 'Transferencia bancaria', 'id... | 2938071313 | [] | MLA | bronze | 30.0 | [] | buy_it_now | [dragged_bids_and_visits] | MLA3133256685 | [] | MLA41287 | {'id': 'MLA7787961817-902981678'} | 2015-08-28T13:37:41.000Z | none | [{'size': '375x500', 'secure_url': 'https://a2... | MLA7787961817 | NaN | NaN | True | NaN | ARS | http://mla-s2-p.mlstatic.com/13595-MLA77879618... | Serenata - Jennifer Blake | False | 2015-08-24T22:07:20.000Z | https://a248.e.akamai.net/mla-s2-p.mlstatic.co... | 2015-10-23 22:07:20 | active | None | NaN | NaN | 1 | 2015-08-24 22:07:20 | http://articulo.mercadolibre.com.ar/MLA7787961... | 0 | 1 | Argentina | Buenos Aires | Tres de febrero | NaN | MLA7787961817-902981678 | True | [] | [] | False | not_specified | None | NaN |
# Get payment methods from dict
def convertCol(x,key,i):
try:
return x[i][key]
except:
return ''
for key in ['description']: #['description','id','type'] -- only description is interesting
for i in range(0,13):
dfs[f'payment_{key}{i}'] = dfs['non_mercado_pago_payment_methods'].apply(lambda x: convertCol(x,key,i))
# Create a boolean column for each payment method
lista_c = []
for i in range(0,13):
lista = dfs[f'payment_description{i}'].unique()
lista_c.extend(lista)
desc_uniques = set(lista_c)
desc_uniques.remove('')
desc_uniques
{'Acordar con el comprador',
'American Express',
'Cheque certificado',
'Contra reembolso',
'Diners',
'Efectivo',
'Giro postal',
'MasterCard',
'Mastercard Maestro',
'MercadoPago',
'Tarjeta de crédito',
'Transferencia bancaria',
'Visa',
'Visa Electron'}
# Rename column for an improved dataframe (#TODO: Use apply for performance)
for col in desc_uniques:
col_name=col.replace(' ','_')
dfs[col_name] = dfs.isin([col]).any(axis=1)
# drop older columns
dfs = dfs.drop(dfs.loc[:, 'payment_description0':'payment_description12'], axis = 1)
import numpy as np
dfs = dfs.applymap(lambda x: x if x else np.nan)
dfs = dfs.dropna(how='all', axis=1)
COLUMNS THAT MATTERS:
- warranty (good and new products have different kind of warranties)
- sub_status (when a product ad is suspended might be due it's condition)
- base_price (price are different when used or new)
- seller_id (different sellers might sell used or new items)
- price (price again)
- buying_mode (type of buying might implicate something)
- parent_item_id (might have correlation between similar products)
- last_updated (we'll check)
- id (we'll check)
- official_store_id (different stores sells different items and conditions)
- original_price (price again)
- currency_id (type of payment and currency might be due to the kind of seller and products)
- title (keep title to find product)
- automatic_relist (we'll check')
- stop_time (time might influece)
- status (status might influece)
- video_id fica (we'll check, but might be videos for used products)
- initial_quantity (a good feature, used products have low counts)
- start_time (time again)
- sold_quantity (quantity again)
- available_quantity (quantity again)
- seller_country, state, city (used or new ads might have imbalanced distribution between regions)
- local_pick_up (being new or used might influence if local pickup is available)
- free_shipping (big sellers for new products might be more capable of assuming free shipping)
- Contra_reembolso fica (payment methods matters)
- Giro_postal (stays)
- mode fica (don't know what it is but it's full, 'not_specified' might be more common on used products)
- tags (we'll check about tags)
- tag (we'll check about tag)
- date_created
- category
TRANSFORM COLUMNS (TYPE OF PAYMENTS):
- Cheque_certificado
- Mastercard_Maestro
- Diners
- Transferencia_bancaria
- accepts_mercadopago
- MercadoPago
- Efectivo
- Tarjeta_de_crédito
- American_Express
- MasterCard
- Visa_Electron
- Visa
- Acordar_con_el_comprador
COLUMNS WE DON'T NEED:
- seller_address (too specific)
- deals_ids (nothing relevant, we checked)
- shipping (nothing relevant, we checked)
- non_mercad_pago_etc (transformed)
- site_id (too specific)
- listin_type_id sai
- description (nothing relevant, we checked, turned out to be id)
- international_delivery_mode
- pictures (nothing relevant, we checked)
- thumbnail (nothing relevant, we checked)
- secure_thumbnail (nothing relevant, we checked)
- permalink (nothing relevant, we checked)
- free_methods (nothing relevant, we checked)
DOUBTS:
- variations
- attributes
- dimension
# Rename columns
dfs = dfs.rename(columns = {'id':'descr_id', 'Id': 'id'})
# Reorder columns
dfs = dfs[['title', 'condition', 'warranty','initial_quantity', 'available_quantity', 'sold_quantity',
'sub_status', 'buying_mode', 'original_price', 'base_price', 'price', 'currency_id',
'seller_country', 'seller_state', 'seller_city', 'Giro_postal',
'free_shipping', 'local_pick_up', 'mode', 'tags', 'tag',
'Contra_reembolso','Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo', 'Transferencia_bancaria', 'Tarjeta_de_crédito',
'Mastercard_Maestro', 'MasterCard', 'Visa_Electron', 'Visa', 'Diners', 'American_Express',
'status', 'automatic_relist',
'accepts_mercadopago', 'MercadoPago',
'id', 'descr_id', 'deal_ids', 'parent_item_id', 'category_id', 'seller_id', 'official_store_id', 'video_id',
'date_created', 'start_time', 'last_updated', 'stop_time']]
dfs['accepts_mercadopago'].value_counts()
True 97781
Name: accepts_mercadopago, dtype: int64
dfs['MercadoPago'].value_counts()
True 720
Name: MercadoPago, dtype: int64
# Merge columns about same subjects
dfs['accepts_mercadopago'] = dfs['accepts_mercadopago'].fillna(dfs['MercadoPago'])
dfs['MasterCard'].value_counts()
True 647
Name: MasterCard, dtype: int64
dfs['MasterCard'] = dfs['Mastercard_Maestro'].fillna(dfs['MercadoPago'])
dfs['Visa'] = dfs['Visa_Electron'].fillna(dfs['Visa'])
dfs['Tarjeta_de_crédito'].value_counts()
True 24638
Name: Tarjeta_de_crédito, dtype: int64
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['Visa'])
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['MasterCard'])
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['Diners'])
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['American_Express'])
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['Visa'])
dfs['Tarjeta_de_crédito'].value_counts()
True 25928
Name: Tarjeta_de_crédito, dtype: int64
dfs = dfs.rename(columns = {'Tarjeta_de_crédito':'Aceptan_Tarjeta'})
# Drop used columns
dfs = dfs.drop(columns=['MercadoPago', 'Mastercard_Maestro', 'Visa_Electron'])
dfs = dfs.drop(columns=['Visa', 'MasterCard', 'Diners', 'American_Express'])
# Treat columns to access data
def try_join(l):
try:
return ','.join(map(str, l))
except TypeError:
return np.nan
dfs['sub_status'] = try_join(dfs['sub_status'])
dfs['tags'] = try_join(dfs['tags'])
dfs.columns
Index(['title', 'condition', 'warranty', 'initial_quantity',
'available_quantity', 'sold_quantity', 'sub_status', 'buying_mode',
'original_price', 'base_price', 'price', 'currency_id',
'seller_country', 'seller_state', 'seller_city', 'Giro_postal',
'free_shipping', 'local_pick_up', 'mode', 'tags', 'tag',
'Contra_reembolso', 'Acordar_con_el_comprador', 'Cheque_certificado',
'Efectivo', 'Transferencia_bancaria', 'Aceptan_Tarjeta', 'status',
'automatic_relist', 'accepts_mercadopago', 'id', 'descr_id', 'deal_ids',
'parent_item_id', 'category_id', 'seller_id', 'official_store_id',
'video_id', 'date_created', 'start_time', 'last_updated', 'stop_time'],
dtype='object')
dfs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 42 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 100000 non-null object
1 condition 100000 non-null object
2 warranty 39103 non-null object
3 initial_quantity 100000 non-null int64
4 available_quantity 100000 non-null int64
5 sold_quantity 16920 non-null float64
6 sub_status 100000 non-null object
7 buying_mode 100000 non-null object
8 original_price 143 non-null float64
9 base_price 100000 non-null float64
10 price 100000 non-null float64
11 currency_id 100000 non-null object
12 seller_country 99997 non-null object
13 seller_state 99997 non-null object
14 seller_city 99996 non-null object
15 Giro_postal 1665 non-null object
16 free_shipping 3016 non-null object
17 local_pick_up 79561 non-null object
18 mode 100000 non-null object
19 tags 100000 non-null object
20 tag 75090 non-null object
21 Contra_reembolso 648 non-null object
22 Acordar_con_el_comprador 7991 non-null object
23 Cheque_certificado 460 non-null object
24 Efectivo 67059 non-null object
25 Transferencia_bancaria 51469 non-null object
26 Aceptan_Tarjeta 25928 non-null object
27 status 100000 non-null object
28 automatic_relist 4697 non-null object
29 accepts_mercadopago 97781 non-null object
30 id 100000 non-null object
31 descr_id 41 non-null object
32 deal_ids 240 non-null object
33 parent_item_id 76989 non-null object
34 category_id 100000 non-null object
35 seller_id 100000 non-null int64
36 official_store_id 818 non-null float64
37 video_id 2985 non-null object
38 date_created 100000 non-null object
39 start_time 100000 non-null datetime64[ns]
40 last_updated 100000 non-null object
41 stop_time 100000 non-null datetime64[ns]
dtypes: datetime64[ns](2), float64(5), int64(3), object(32)
memory usage: 32.0+ MB
# Transform some columns to boolean type
dfs[['Giro_postal', 'free_shipping', 'local_pick_up', 'Contra_reembolso',
'Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo',
'Transferencia_bancaria', 'Aceptan_Tarjeta', 'automatic_relist']] = dfs[['Giro_postal', 'free_shipping', 'local_pick_up', 'Contra_reembolso',
'Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo',
'Transferencia_bancaria', 'Aceptan_Tarjeta', 'automatic_relist']].notna()
# Transform type of all columns
dfs = dfs.astype({'title':'str',
'condition': 'category', #bool
'warranty': 'category',
'initial_quantity': 'float', #int
'available_quantity': 'float', #int
'sold_quantity': 'float', #int
'sub_status': 'category', #bool?
'buying_mode': 'category',
'original_price': 'float',
'base_price': 'float',
'price': 'float',
'currency_id': 'category',
'seller_country': 'category',
'seller_state': 'category',
'seller_city': 'category',
'Giro_postal': 'bool',
'free_shipping': 'bool',
'local_pick_up': 'bool',
'mode': 'category',
'tags': 'category', #bool?
#'tag': 'category',
'Contra_reembolso': 'bool',
'Acordar_con_el_comprador': 'bool',
'Cheque_certificado': 'bool',
'Efectivo': 'bool',
'Transferencia_bancaria': 'bool',
'Aceptan_Tarjeta': 'bool',
'id': 'category',
'descr_id': 'category',
#'deal_ids': 'category',
'parent_item_id': 'category',
'category_id': 'category',
'seller_id': 'category',
'official_store_id': 'category',
'video_id': 'category',
#'date_created': 'datetime',
# 'start_time': 'datetime',
# 'last_updated': 'datetime',
# 'stop_time': 'datetime',
'status': 'category', #bool?
'automatic_relist': 'bool'
})
dfs.columns
Index(['title', 'condition', 'warranty', 'initial_quantity',
'available_quantity', 'sold_quantity', 'sub_status', 'buying_mode',
'original_price', 'base_price', 'price', 'currency_id',
'seller_country', 'seller_state', 'seller_city', 'Giro_postal',
'free_shipping', 'local_pick_up', 'mode', 'tags', 'tag',
'Contra_reembolso', 'Acordar_con_el_comprador', 'Cheque_certificado',
'Efectivo', 'Transferencia_bancaria', 'Aceptan_Tarjeta', 'status',
'automatic_relist', 'accepts_mercadopago', 'id', 'descr_id', 'deal_ids',
'parent_item_id', 'category_id', 'seller_id', 'official_store_id',
'video_id', 'date_created', 'start_time', 'last_updated', 'stop_time'],
dtype='object')
# Check missing values
import numpy as np
import pandas as pd
def missing_zero_values_table(df):
zero_val = (df == 0.00).astype(int).sum(axis=0)
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
mz_table = mz_table.rename(
columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
mz_table['Data Type'] = df.dtypes
mz_table = mz_table[
mz_table.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"
"There are " + str(mz_table.shape[0]) +
" columns that have missing values.")
# mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
return mz_table
missing_zero_values_table(dfs)
Your selected dataframe has 42 columns and 100000 Rows.
There are 13 columns that have missing values.
Zero Values | Missing Values | % of Total Values | Total Zero Missing Values | % Total Zero Missing Values | Data Type | |
---|---|---|---|---|---|---|
descr_id | 0 | 99959 | 100.0 | 99959 | 100.0 | category |
original_price | 0 | 99857 | 99.9 | 99857 | 99.9 | float64 |
deal_ids | 0 | 99760 | 99.8 | 99760 | 99.8 | object |
official_store_id | 0 | 99182 | 99.2 | 99182 | 99.2 | category |
video_id | 0 | 97015 | 97.0 | 97015 | 97.0 | category |
sold_quantity | 0 | 83080 | 83.1 | 83080 | 83.1 | float64 |
warranty | 0 | 60897 | 60.9 | 60897 | 60.9 | category |
tag | 0 | 24910 | 24.9 | 24910 | 24.9 | object |
parent_item_id | 0 | 23011 | 23.0 | 23011 | 23.0 | category |
accepts_mercadopago | 0 | 2219 | 2.2 | 2219 | 2.2 | object |
seller_city | 0 | 4 | 0.0 | 4 | 0.0 | category |
seller_country | 0 | 3 | 0.0 | 3 | 0.0 | category |
seller_state | 0 | 3 | 0.0 | 3 | 0.0 | category |
display(dfs['seller_country'].value_counts())
dfs = dfs.drop(columns = 'seller_country') # We can drop Country column, it's always Argentina
display(dfs['seller_city'].mode()[0])
display(dfs['seller_state'].mode()[0])
dfs['seller_city'] = dfs['seller_city'].fillna(dfs['seller_city'].mode()[0])
dfs['seller_state'] = dfs['seller_state'].fillna(dfs['seller_state'].mode()[0])
Argentina 99997
Name: seller_country, dtype: int64
'CABA'
'Capital Federal'
dfs['accepts_mercadopago'] = dfs['accepts_mercadopago'].fillna(False)
dfs['sold_quantity'] = dfs['sold_quantity'].fillna(0) # Is it ok to fill sold_quantity with 0? [VALIDATE]
dfs['warranty'] = dfs['warranty'].replace(r'^\s*$', np.nan, regex=True)
dfs['warranty'].isna().sum()
60897
import pandas as pd
df_temp1 = dfs[dfs['warranty'].isnull()]
df_temp1['warranty'] = False
df_temp2 = dfs[~dfs['warranty'].isnull()]
df_temp2['warranty'] = True
frames = [df_temp1, df_temp2]
dfs = pd.concat(frames)
dfs = dfs.astype({'warranty':'bool'})
dfs['warranty'].value_counts()
False 60897
True 39103
Name: warranty, dtype: int64
display('number of sold_quantity', dfs.sold_quantity.nunique())
'number of sold_quantity'
317
def get_value_per_cat():
flag = dfs.select_dtypes(include=['category']).shape[1]
i = 0
while i <= flag:
print(dict(dfs.select_dtypes(include=['category']).iloc[:,i:i+1].nunique()))
i = i+1
get_value_per_cat()
{'condition': 2}
{'sub_status': 1}
{'buying_mode': 3}
{'currency_id': 2}
{'seller_state': 24}
{'seller_city': 3655}
{'mode': 4}
{'tags': 1}
{'status': 4}
{'id': 100000}
{'descr_id': 41}
{'parent_item_id': 76989}
{'category_id': 10907}
{'seller_id': 35915}
{'official_store_id': 198}
{'video_id': 2077}
{}
dfs.columns
Index(['title', 'condition', 'warranty', 'initial_quantity',
'available_quantity', 'sold_quantity', 'sub_status', 'buying_mode',
'original_price', 'base_price', 'price', 'currency_id', 'seller_state',
'seller_city', 'Giro_postal', 'free_shipping', 'local_pick_up', 'mode',
'tags', 'tag', 'Contra_reembolso', 'Acordar_con_el_comprador',
'Cheque_certificado', 'Efectivo', 'Transferencia_bancaria',
'Aceptan_Tarjeta', 'status', 'automatic_relist', 'accepts_mercadopago',
'id', 'descr_id', 'deal_ids', 'parent_item_id', 'category_id',
'seller_id', 'official_store_id', 'video_id', 'date_created',
'start_time', 'last_updated', 'stop_time'],
dtype='object')
import re
dfs['sub_status'] = dfs['sub_status'].str.replace('nan,','')
dfs['sub_status'] = dfs['sub_status'].str.replace(',nan','')
display(len(re.findall(r'suspended',dfs['sub_status'][1])))
display(dfs['sub_status'].value_counts().value_counts())
display(dfs.shape)
# We concluded this column is useless: every row has the same count of the same value ('suspended')
dfs = dfs.drop('sub_status', axis=1)
966
100000 1
Name: sub_status, dtype: int64
(100000, 41)
# dfs['tags'] = dfs['tags'].str.replace('nan,','')
# dfs['tags'] = dfs['tags'].str.replace(',nan','')
# from ast import literal_eval
# dfs['tags'] = dfs['tags'].apply(lambda x: literal_eval(str(x)))
# def deduplicate(column):
# flag = len(column)
# i = 0
# while i <= flag:
# try:
# # 1. Convert into list of tuples
# tpls = [tuple(x) for x in column[i]]
# # 2. Create dictionary with empty values and
# # 3. convert back to a list (dups removed)
# dct = list(dict.fromkeys(tpls))
# # 4. Convert list of tuples to list of lists
# dup_free = [list(x) for x in lst]
# # Print everything
# column[i] = list(map(''.join, dup_free))
# # [[1, 1], [0, 1], [0, 1], [1, 1]]
# i = i+1
# except:
# return
# deduplicate(dfs['tags'])
# display(dfs['tags'].value_counts().value_counts())
# display(dfs.shape)
# display(dfs['tag'].value_counts().value_counts())
# Other useless colums -- all rows have the same values
dfs = dfs.drop('tags', axis=1)
dfs = dfs.drop('tag', axis=1)
display('dataframe shape', dfs.shape)
display('unique ids', dfs.id.nunique())
display('number of sellers', dfs.seller_id.nunique())
display('number of categories', dfs.category_id.nunique())
#Drop useless column
dfs = dfs.drop(['id'], axis=1)
'dataframe shape'
(100000, 38)
'unique ids'
100000
'number of sellers'
35915
'number of categories'
10907
missing_zero_values_table(dfs)
Your selected dataframe has 37 columns and 100000 Rows.
There are 6 columns that have missing values.
Zero Values | Missing Values | % of Total Values | Total Zero Missing Values | % Total Zero Missing Values | Data Type | |
---|---|---|---|---|---|---|
descr_id | 0 | 99959 | 100.0 | 99959 | 100.0 | category |
original_price | 0 | 99857 | 99.9 | 99857 | 99.9 | float64 |
deal_ids | 0 | 99760 | 99.8 | 99760 | 99.8 | object |
official_store_id | 0 | 99182 | 99.2 | 99182 | 99.2 | category |
video_id | 0 | 97015 | 97.0 | 97015 | 97.0 | category |
parent_item_id | 0 | 23011 | 23.0 | 23011 | 23.0 | category |
dfs = dfs.dropna(axis=1) # drop all columns with missing values (we checked and they are not necessary or have too many missing values to imput properly)
from matplotlib import pyplot as plt
# Deal with datetimes to create new features
dfs['year_start'] = pd.to_datetime(dfs['start_time']).dt.year.astype('category')
dfs['month_start'] = pd.to_datetime(dfs['start_time']).dt.month.astype('category')
dfs['year_stop'] = pd.to_datetime(dfs['stop_time']).dt.year.astype('category')
dfs['month_stop'] = pd.to_datetime(dfs['stop_time']).dt.month.astype('category')
dfs['week_day'] = pd.to_datetime(dfs['stop_time']).dt.weekday.astype('category')
#dfs['days_active'] = (dfs['start_time'] - dfs['stop_time']).dt.days
dfs['days_active'] = [int(i.days) for i in (dfs.stop_time - dfs.start_time)]
dfs['days_active'] = dfs['days_active'].astype('int')
dfs = dfs.reset_index(drop=True)
#dfs = dfs.drop(['date_created', 'start_time', 'last_updated', 'stop_time'], axis=1)
boxplot = dfs.boxplot(column=['days_active'], showfliers=False)
plt.savefig('days_active.png', bbox_inches='tight', dpi = 300)
# empty list to read list from a file
selected_features = []
# open file and read the content in a list
with open(r'selected_features.txt', 'r') as fp:
for line in fp:
# remove linebreak from a current name
# linebreak is the last character of each line
x = line[:-1]
# add current item to the list
selected_features.append(x)
# display list
print(selected_features)
['base_price', 'seller_id', 'available_quantity', 'seller_state', 'price', 'week_day', 'sold_quantity', 'mode', 'Transferencia_bancaria', 'category_id', 'Aceptan_Tarjeta', 'seller_city', 'initial_quantity', 'warranty', 'automatic_relist']
from sklearn import preprocessing
# Encode categorical columns to pass through model
mylist = list(dfs.select_dtypes(include=['category']).columns)
dfs[mylist] = dfs[mylist].apply(preprocessing.LabelEncoder().fit_transform)
dfs['log_price'] = np.log(dfs['price'] + 1)
dfs['log_base_price'] = np.log(dfs['base_price'] + 1)
import statsmodels.formula.api as fsm
import matplotlib.pyplot as plt
import seaborn as sns
model = fsm.logit(formula = 'condition ~ log_price' , data = dfs)
fit = model.fit()
fit.summary()
dfs['pred_baseline'] = fit.predict()
fig,ax = plt.subplots(1,2, figsize = (16,8))
sns.scatterplot(data = dfs, x = 'initial_quantity', y = 'pred_baseline', hue = 'Aceptan_Tarjeta', size = 'base_price', ax = ax[0])
sns.scatterplot(data = dfs, x = 'available_quantity', y = 'pred_baseline', hue = 'Aceptan_Tarjeta', size = 'base_price', ax = ax[1])
plt.savefig('logistic_baseline_plot.png', bbox_inches='tight', dpi = 300)
Optimization terminated successfully.
Current function value: 0.681740
Iterations 4
import matplotlib.pyplot as plt
import statsmodels.formula.api as fsm
model = fsm.logit(formula = 'condition ~ log_price : mode * seller_state', data = dfs)
fit = model.fit()
fit.summary()
dfs['pred_m1'] = fit.predict()
fig,ax = plt.subplots(1,2, figsize = (16,8))
sns.scatterplot(data = dfs, x = 'initial_quantity', y = 'pred_m1', hue = 'Aceptan_Tarjeta', size = 'base_price', ax = ax[0])
sns.scatterplot(data = dfs, x = 'available_quantity', y = 'pred_m1', hue = 'Aceptan_Tarjeta', size = 'base_price', ax = ax[1])
plt.savefig('logistic_tarjeta_plot.png', bbox_inches='tight', dpi = 300)
Optimization terminated successfully.
Current function value: 0.690247
Iterations 4
fig,ax = plt.subplots(1,2, figsize = (16,8))
sns.scatterplot(data = dfs, x = 'initial_quantity', y = 'pred_m1', hue = 'warranty', size = 'base_price', ax = ax[0])
sns.scatterplot(data = dfs, x = 'sold_quantity', y = 'pred_m1', hue = 'warranty', size = 'base_price', ax = ax[1])
plt.savefig('logistic_warranty_plot.png', bbox_inches='tight', dpi = 300)
from sklearn.metrics import f1_score
threshold_list = np.linspace(0.05, 0.95, 200)
f1_list = []
for threshold in threshold_list:
pred_label = np.where(dfs['pred_m1'] < threshold, 0, 1)
f1 = f1_score(dfs['condition'], pred_label)
f1_list.append(f1)
df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)
plt.savefig('logistic_baseline_threshold.png', bbox_inches='tight', dpi = 300)
from sklearn.metrics import cohen_kappa_score, precision_score, roc_curve
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, roc_auc_score
threshold_list = np.linspace(0.05, 0.95, 200)
score_list = []
for threshold in threshold_list:
pred_label = np.where(dfs['pred_m1'] < threshold, 0, 1)
score = cohen_kappa_score(dfs['condition'], pred_label)
score_list.append(score)
df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == max(df_score['score_score'])]
bt = df_score[df_score['score_score'] == max(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == max(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Kappa: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)
plt.savefig('logistic_kappa_threshold.png', bbox_inches='tight', dpi = 300)
from sklearn.metrics import roc_curve
#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(dfs['condition'], dfs['pred_baseline'])
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111, aspect=1)
sns.lineplot(x = fpr, y = fpr, ax = ax)
sns.lineplot(x = fpr, y = tpr, ax = ax)
plt.savefig('logistic_baseline_roc_curve.png', bbox_inches='tight', dpi = 300)
from sklearn.metrics import roc_curve
#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(dfs['condition'], dfs['pred_m1'])
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111, aspect=1)
sns.lineplot(x = fpr, y = fpr, ax = ax)
sns.lineplot(x = fpr, y = tpr, ax = ax)
plt.savefig('logistic_kappa_roc_curve.png', bbox_inches='tight', dpi = 300)
%%time
import os
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from sklearn.metrics import cohen_kappa_score, precision_score
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, roc_auc_score
dfs['condition'] = dfs['condition'].replace('new', 0)
dfs['condition'] = dfs['condition'].replace('used', 1)
scaled_features = dfs.copy()
col_names = ['warranty', 'initial_quantity', 'available_quantity', 'sold_quantity',
'base_price', 'price', 'Giro_postal', 'free_shipping', 'local_pick_up',
'Contra_reembolso', 'Acordar_con_el_comprador', 'Cheque_certificado',
'Efectivo', 'Transferencia_bancaria', 'Aceptan_Tarjeta',
'automatic_relist', 'accepts_mercadopago', 'days_active']
features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features[col_names] = features
X = scaled_features.drop(columns=['condition'], axis=1)
#X = dfs.drop(columns='condition')
y = scaled_features.condition
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=7)
Y_train = Y_train
Y_test = Y_test
full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), X_train.columns)], remainder='passthrough')
encoder = full_pipeline.fit(X_train)
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)
# train the model
model = xgb.XGBClassifier(n_estimators= 200,
max_depth= 30, # Lower ratios avoid over-fitting. Default is 6.
objective = 'binary:logistic', # Default is reg:squarederror. 'multi:softprob' for multiclass and get proba.
#num_class = 2, # Use if softprob is set.
reg_lambda = 10, # Larger ratios avoid over-fitting. Default is 1.
gamma = 0.3, # Larger values avoid over-fitting. Default is 0. # Values from 0.3 to 0.8 if you have many columns (especially if you did one-hot encoding), or 0.8 to 1 if you only have a few columns.
alpha = 1, # Larger ratios avoid over-fitting. Default is 0.
learning_rate= 0.10, # Lower ratios avoid over-fitting. Default is 3.
colsample_bytree= 0.7, # Lower ratios avoid over-fitting.
scale_pos_weight = 1, # Default is 1. Control balance of positive and negative weights, for unbalanced classes.
subsample = 0.1, # Lower ratios avoid over-fitting. Default 1. 0.5 recommended. # 0.1 if using GPU.
min_child_weight = 3, # Larger ratios avoid over-fitting. Default is 1.
missing = np.nan, # Deal with missing values
num_parallel_tree = 2, # Parallel trees constructed during each iteration. Default is 1.
importance_type = 'weight',
eval_metric = 'auc',
#use_label_encoder = True,
#enable_categorical = True,
verbosity = 1,
nthread = -1, # Set -1 to use all threads.
#use_rmm = True, # Use GPU if available
tree_method = 'auto', # auto # 'gpu_hist'. Default is auto: analyze the data and chooses the fastest.
#gradient_based = True, # If True you can set subsample as low as 0.1. Only use with gpu_hist
)
# fit model
model.fit(X_train_enc, Y_train.values.ravel(),
# early_stopping_rounds=20
)
# check best ntree limit
display(model.best_ntree_limit)
# extract the training set predictions
preds_train = model.predict(X_train_enc,
ntree_limit=model.best_ntree_limit
)
# extract the test set predictions
preds_test = model.predict(X_test_enc,
ntree_limit=model.best_ntree_limit
)
# save model
output_dir = "models"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# save in JSON format
model.save_model(f'{output_dir}/meli_xgboost.json')
# save in text format
model.save_model(f'{output_dir}/meli_xgboost.txt')
print('FINISHED!')
/home/ggnicolau/miniconda3/envs/jupyter-1/lib/python3.10/site-packages/xgboost/sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
400
/home/ggnicolau/miniconda3/envs/jupyter-1/lib/python3.10/site-packages/xgboost/core.py:105: UserWarning: ntree_limit is deprecated, use `iteration_range` or model slicing instead.
warnings.warn(
FINISHED!
CPU times: user 12min 45s, sys: 2.88 s, total: 12min 48s
Wall time: 1min 56s
# extract the test set predictions
preds_test = model.predict_proba(X_test_enc,
ntree_limit=model.best_ntree_limit
)
/home/ggnicolau/miniconda3/envs/jupyter-1/lib/python3.10/site-packages/xgboost/core.py:105: UserWarning: ntree_limit is deprecated, use `iteration_range` or model slicing instead.
warnings.warn(
%%time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import cohen_kappa_score, brier_score_loss
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from numpy import sqrt, argmax, argmin
# Plot F1-Score and Threshold
threshold_list = np.linspace(0.05, 0.95, 200)
f1_list = []
for threshold in threshold_list:
pred_label = np.where(preds_test[:,1] < threshold, 0, 1)
f1 = f1_score(Y_test, pred_label)
f1_list.append(f1)
df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)
plt.show()
# Plot your other Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)
score_list = []
for threshold in threshold_list:
pred_label = np.where(preds_test[:,1] < threshold, 0, 1)
score = brier_score_loss(Y_test, pred_label)
score_list.append(score)
df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == min(df_score['score_score'])]
bt = df_score[df_score['score_score'] == min(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == min(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Brier: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)
plt.show()
from sklearn.metrics import roc_curve
#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(Y_test, preds_test[:,1])
roc = roc_auc_score(Y_test, preds_test[:,1])
# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))
plt.figure()
lw = 2
plt.plot(
fpr,
tpr,
color="darkorange",
lw=lw,
#marker='.',
label=f"ROC curve (area ={'%.2f' % roc})"# % roc_auc["micro"],
)
plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') #threshold
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("XGBoost Condition Classifier")
plt.legend(loc="lower right")
plt.savefig('xgboost_roc_curve.png', bbox_inches='tight', dpi = 300)
plt.show()
Best Threshold=0.505019, G-Mean=0.810
CPU times: user 1.8 s, sys: 629 ms, total: 2.43 s
Wall time: 1.65 s
# best_preds_score = np.where(preds_test < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(Y_test, preds_test[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(Y_test, preds_test[:,1])))
print("Brier_error = {}".format(brier_score_loss(Y_test, preds_test[:,1])))
print("Logloss_test = {}".format(log_loss(Y_test, preds_test[:,1])))
# print("Precision = {}".format(precision_score(Y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(Y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(Y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, preds_test[:,1])))
mean_squared_error_test = 0.3640702047156597
Roc_auc = 0.8888785004424654
Brier_error = 0.13254711396170238
Logloss_test = 0.4085390232688165
# apply threshold to positive probabilities to create labels
def to_labels_max(pos_probs, threshold):
return (pos_probs >= threshold).astype('int')
# evaluate each threshold
scores = [roc_auc_score(Y_test, to_labels_max(preds_test[:,1], t)) for t in thresholds]
# get best threshold for max is better
ix = argmax(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.490, Roc_auc=0.81209
# evaluate each threshold
scores = [brier_score_loss(Y_test, to_labels_max(preds_test[:,1], t)) for t in thresholds]
# get best threshold for min is better
ix = argmin(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.505, Roc_auc=0.19180
# %%time
# import xgboost as xgb
# from sklearn.metrics import cohen_kappa_score
# from sklearn.metrics import matthews_corrcoef
# from sklearn.metrics import f1_score
# from sklearn.model_selection import train_test_split
# import patsy
# # Selecting features I've found and using patsy to automatic interact between features.
# y, X = patsy.dmatrices('condition ~ Aceptan_Tarjeta + category_id + Efectivo + Transferencia_bancaria + automatic_relist + available_quantity + \
# base_price + warranty + sold_quantity + free_shipping + initial_quantity + local_pick_up + mode + \
# price + seller_id + seller_city + seller_state+ \
# year_start + month_start + year_stop + month_stop + week_day + days_active', data = dfs)
# # Display patsy features
# #display(X)
# X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)
# D_train = xgb.DMatrix(X_train, label=Y_train)#, enable_categorical=True)
# D_test = xgb.DMatrix(X_test, label=Y_test)#, enable_categorical=True)
# param = {
# 'eta': 0.10, # Lower ratios avoid over-fitting. Default is 3.
# 'max_depth': 30, # Lower ratios avoid over-fitting. Default is 6.
# "min_child_weight": 3, # Larger ratios avoid over-fitting. Default is 1.
# "gamma": 0.3, # Larger values avoid over-fitting. Default is 0.
# "colsample_bytree" : 0.7, # Lower ratios avoid over-fitting. Values from 0.3 to 0.8 if you have many columns (especially if you did one-hot encoding), or 0.8 to 1 if you only have a few columns.
# "scale_pos_weight": 1, # Default is 1. Control balance of positive and negative weights, for unbalanced classes.
# "reg_lambda": 10, # Larger ratios avoid over-fitting. Default is 1.
# "alpha": 1, # Larger ratios avoid over-fitting. Default is 0.
# 'subsample':0.5, # Lower ratios avoid over-fitting. Default 1. 0.5 recommended.
# 'num_parallel_tree': 2, # Parallel trees constructed during each iteration. Default is 1.
# 'objective': 'multi:softprob', # Default is reg:squarederror. 'multi:softprob' for multiclass.
# 'num_class': 2, # Use if softprob is set.
# 'verbosity':1,
# 'eval_metric': 'auc',
# 'use_rmm':False, # Use GPU if available
# 'nthread':-1, # Set -1 to use all threads.
# 'tree_method': 'auto', # 'gpu_hist'. Default is auto: analyze the data and chooses the fastest.
# 'gradient_based': False, # If True you can set subsample as low as 0.1. Only use with gpu_hist
# }
# steps = 200 # The number of training iterations
# model = xgb.train(param, D_train, steps)
# import numpy as np
# from sklearn.metrics import precision_score, recall_score, accuracy_score
# preds = model.predict(D_test)
# best_preds = np.asarray([np.argmax(line) for line in preds])
# print("Precision = {}".format(precision_score(Y_test, best_preds)))
# print("Recall = {}".format(recall_score(Y_test, best_preds)))
# print("f1 = {}".format(f1_score(Y_test, best_preds)))
# print("kappa_score = {}".format(cohen_kappa_score(Y_test, best_preds)))
# print("matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, best_preds)))
# #print("mean_squared_error_train = {}".format(mean_squared_error(Y_train, best_preds)))
# # print("mean_squared_error_test = {}".format(mean_squared_error(Y_test, best_preds)))
# print("logloss_test = {}".format(log_loss(Y_test, best_preds)))
# #print("logloss_train = {}".format(log_loss(Y_train, best_preds)))
# # from xgboost import plot_importance
# # import matplotlib.pyplot as pyplot
# # plot_importance(model)
# # pyplot.show()
# from sklearn.metrics import roc_auc_score
# best_preds = np.where(preds_test < bt, 0, 1)
# print("Roc_auc = {}".format(roc_auc_score(Y_test, best_preds)))
# print("Precision = {}".format(precision_score(Y_test, best_preds)))
# print("Recall = {}".format(recall_score(Y_test, best_preds)))
# print("F1 = {}".format(f1_score(Y_test, best_preds)))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, best_preds)))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, best_preds)))
# print("Mean_squared_error_test = {}".format(mean_squared_error(Y_test, best_preds)))
# print("Logloss_test = {}".format(log_loss(Y_test, best_preds)))
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from embedding_encoder import EmbeddingEncoder
from embedding_encoder.utils.compose import ColumnTransformerWithNames
dfs.columns
Index(['title', 'condition', 'warranty', 'initial_quantity',
'available_quantity', 'sold_quantity', 'buying_mode', 'base_price',
'price', 'currency_id', 'seller_state', 'seller_city', 'Giro_postal',
'free_shipping', 'local_pick_up', 'mode', 'Contra_reembolso',
'Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo',
'Transferencia_bancaria', 'Aceptan_Tarjeta', 'status',
'automatic_relist', 'accepts_mercadopago', 'category_id', 'seller_id',
'date_created', 'start_time', 'last_updated', 'stop_time', 'year_start',
'month_start', 'year_stop', 'month_stop', 'week_day', 'days_active'],
dtype='object')
dfs.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns
Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
'price', 'days_active'],
dtype='object')
dfs.select_dtypes(include=['category']).columns
Index(['condition', 'buying_mode', 'currency_id', 'seller_state',
'seller_city', 'mode', 'status', 'category_id', 'seller_id',
'year_start', 'month_start', 'year_stop', 'month_stop', 'week_day'],
dtype='object')
# Split train and test
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64', 'category', 'bool']
X = dfs.select_dtypes(include=numerics).drop(columns=['condition'], axis=1)
dfs['condition'] = dfs['condition'].replace('new', 0)
dfs['condition'] = dfs['condition'].replace('used', 1)
y = dfs.condition
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
%%time
categorical_high = ["seller_city", "category_id"] #"seller_id"
numeric = X.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns#.drop(columns=['condition'], axis=1)
categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "year_stop", "month_start", "year_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "month_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status"] + list(X.select_dtypes(include=['bool']).columns)
def build_pipeline(mode: str):
if mode == "embeddings":
high_cardinality_encoder = EmbeddingEncoder(task="classification") #regression
else:
high_cardinality_encoder = OrdinalEncoder()
one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
scaler = StandardScaler()
imputer = ColumnTransformerWithNames([("numeric", SimpleImputer(strategy="mean"), numeric), ("categorical", SimpleImputer(strategy="most_frequent"), categorical_low+categorical_high)])
processor = ColumnTransformer([("one_hot", one_hot_encoder, categorical_low), (mode, high_cardinality_encoder, categorical_high), ("scale", scaler, numeric)])
return make_pipeline(imputer, processor, LogisticRegression(max_iter=1000)) #RandomForestRegressor() #XGBClassifier()
embeddings_pipeline = build_pipeline("embeddings")
embeddings_pipeline.fit(X_train, y_train)
CPU times: user 5min 16s, sys: 1min 16s, total: 6min 32s
Wall time: 1min 12s
Pipeline(steps=[('columntransformerwithnames',
ColumnTransformerWithNames(transformers=[('numeric',
SimpleImputer(),
Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
'price', 'days_active'],
dtype='object')),
('categorical',
SimpleImputer(strategy='most_frequent'),
['buying_mode',
'currency_id',
'seller_state',
'mode', 'status',
'week...
'Transferencia_bancaria',
'Aceptan_Tarjeta',
'automatic_relist',
'accepts_mercadopago']),
('embeddings',
EmbeddingEncoder(task='classification'),
['seller_city',
'category_id']),
('scale', StandardScaler(),
Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
'price', 'days_active'],
dtype='object'))])),
('logisticregression', LogisticRegression(max_iter=1000))])
y_pred_proba = embeddings_pipeline.predict_proba(X_test) #.decision_function(X_test)
%%time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import cohen_kappa_score, brier_score_loss
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from numpy import sqrt, argmax, argmin
# Plot F1-Score and Threshold
threshold_list = np.linspace(0.05, 0.95, 200)
f1_list = []
for threshold in threshold_list:
pred_label = np.where(y_pred_proba[:,1] < threshold, 0, 1)
f1 = f1_score(y_test, pred_label)
f1_list.append(f1)
df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)
plt.show()
# Plot your other Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)
score_list = []
for threshold in threshold_list:
pred_label = np.where(y_pred_proba[:,1] < threshold, 0, 1)
score = brier_score_loss(y_test, pred_label)
score_list.append(score)
df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == min(df_score['score_score'])]
bt = df_score[df_score['score_score'] == min(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == min(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Brier: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)
plt.show()
from sklearn.metrics import roc_curve
#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:,1])
roc = roc_auc_score(y_test, y_pred_proba[:,1])
# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))
plt.figure()
lw = 2
plt.plot(
fpr,
tpr,
color="darkorange",
lw=lw,
#marker='.',
label=f"ROC curve (area ={'%.2f' % roc})"# % roc_auc["micro"],
)
plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') #threshold
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Embeddings + Logistic Condition Classifier")
plt.legend(loc="lower right")
plt.savefig('emb_logistic_roc_curve.png', bbox_inches='tight', dpi = 300)
plt.show()
Best Threshold=0.508526, G-Mean=0.823
CPU times: user 2.02 s, sys: 1.08 s, total: 3.1 s
Wall time: 1.78 s
# best_preds_score = np.where(preds_test < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(y_test, y_pred_proba[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(y_test, y_pred_proba[:,1])))
print("Brier_error = {}".format(brier_score_loss(y_test, y_pred_proba[:,1])))
print("Logloss_test = {}".format(log_loss(y_test, y_pred_proba[:,1])))
# print("Precision = {}".format(precision_score(Y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(Y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(Y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, preds_test[:,1])))
mean_squared_error_test = 0.3568355720384346
Roc_auc = 0.9007658597305785
Brier_error = 0.12733162547199683
Logloss_test = 0.40211995678282203
# apply threshold to positive probabilities to create labels
def to_labels_max(pos_probs, threshold): # higher is better
return (pos_probs >= threshold).astype('int')
# evaluate each threshold
scores = [roc_auc_score(y_test, to_labels_max(y_pred_proba[:,1], t)) for t in thresholds]
# get best threshold for max is better
ix = argmax(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.509, Roc_auc=0.82327
# evaluate each threshold
scores = [brier_score_loss(y_test, to_labels_max(y_pred_proba[:,1], t)) for t in thresholds]
# get best threshold for min is better
ix = argmin(scores)
print('Threshold=%.3f, Brier=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.509, Brier=0.17665
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from embedding_encoder import EmbeddingEncoder
from embedding_encoder.utils.compose import ColumnTransformerWithNames
#dfs = pd.read_parquet('cleaned_data_haha.parquet.gzip')
dfs.columns
Index(['title', 'condition', 'warranty', 'initial_quantity',
'available_quantity', 'sold_quantity', 'buying_mode', 'base_price',
'price', 'currency_id', 'seller_state', 'seller_city', 'Giro_postal',
'free_shipping', 'local_pick_up', 'mode', 'Contra_reembolso',
'Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo',
'Transferencia_bancaria', 'Aceptan_Tarjeta', 'status',
'automatic_relist', 'accepts_mercadopago', 'category_id', 'seller_id',
'date_created', 'start_time', 'last_updated', 'stop_time', 'year_start',
'month_start', 'year_stop', 'month_stop', 'week_day', 'days_active'],
dtype='object')
dfs.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns
Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
'price', 'days_active'],
dtype='object')
dfs.select_dtypes(include=['category']).columns
Index(['condition', 'buying_mode', 'currency_id', 'seller_state',
'seller_city', 'mode', 'status', 'category_id', 'seller_id',
'year_start', 'month_start', 'year_stop', 'month_stop', 'week_day'],
dtype='object')
# Split train and test
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64', 'category', 'bool']
X = dfs.select_dtypes(include=numerics).drop(columns=['condition'], axis=1)
dfs['condition'] = dfs['condition'].replace('new', 0)
dfs['condition'] = dfs['condition'].replace('used', 1)
y = dfs.condition
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
categorical_high = ["seller_city", "category_id"] # "seller_id"
numeric = X.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns#.drop(columns=['condition'], axis=1)
categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "year_stop", "month_start", "year_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "month_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status"] + list(X.select_dtypes(include=['bool']).columns)
def build_pipeline(mode: str):
if mode == "embeddings":
high_cardinality_encoder = EmbeddingEncoder(task="classification") #regression
else:
high_cardinality_encoder = OrdinalEncoder()
one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
scaler = StandardScaler()
imputer = ColumnTransformerWithNames([("numeric", SimpleImputer(strategy="mean"), numeric), ("categorical", SimpleImputer(strategy="most_frequent"), categorical_low+categorical_high)])
processor = ColumnTransformer([("one_hot", one_hot_encoder, categorical_low), (mode, high_cardinality_encoder, categorical_high), ("scale", scaler, numeric)])
return make_pipeline(imputer, processor, xgb.XGBClassifier(n_estimators= 200,
max_depth= 30, # Lower ratios avoid over-fitting. Default is 6.
objective = 'binary:logistic', # Default is reg:squarederror. 'multi:softprob' for multiclass and get proba.
#num_class = 2, # Use if softprob is set.
reg_lambda = 10, # Larger ratios avoid over-fitting. Default is 1.
gamma = 0.3, # Larger values avoid over-fitting. Default is 0. # Values from 0.3 to 0.8 if you have many columns (especially if you did one-hot encoding), or 0.8 to 1 if you only have a few columns.
alpha = 1, # Larger ratios avoid over-fitting. Default is 0.
learning_rate= 0.10, # Lower ratios avoid over-fitting. Default is 3.
colsample_bytree= 0.7, # Lower ratios avoid over-fitting.
scale_pos_weight = 1, # Default is 1. Control balance of positive and negative weights, for unbalanced classes.
subsample = 0.1, # Lower ratios avoid over-fitting. Default 1. 0.5 recommended. # 0.1 if using GPU.
min_child_weight = 3, # Larger ratios avoid over-fitting. Default is 1.
missing = np.nan, # Deal with missing values
num_parallel_tree = 2, # Parallel trees constructed during each iteration. Default is 1.
importance_type = 'weight',
eval_metric = 'auc',
use_label_encoder = False, # True is
#enable_categorical = True,
verbosity = 1,
nthread = -1, # Set -1 to use all threads.
#use_rmm = True, # Use GPU if available
tree_method = 'auto', # auto # 'gpu_hist'. Default is auto: analyze the data and chooses the fastest.
#gradient_based = True,
)) #RandomForestClassifier() #LogisticRegression())
%%time
embeddings_pipeline = build_pipeline("embeddings")
embeddings_pipeline.fit(X_train, y_train)
embedding_preds = embeddings_pipeline.predict(X_test)
CPU times: user 18min 11s, sys: 15 s, total: 18min 26s
Wall time: 3min 6s
# Check accuracy for classes
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score, cohen_kappa_score, matthews_corrcoef
print("Accuracy = {}".format(accuracy_score(y_test, embedding_preds)))
print("Balanced accuracy = {}".format(balanced_accuracy_score(y_test, embedding_preds)))
print("Precision = {}".format(precision_score(y_test, embedding_preds)))
print("Recall = {}".format(recall_score(y_test, embedding_preds)))
print("F1 = {}".format(f1_score(y_test, embedding_preds)))
print("Kappa_score = {}".format(cohen_kappa_score(y_test, embedding_preds)))
print("Matthews_corrcoef = {}".format(matthews_corrcoef(y_test, embedding_preds)))
Accuracy = 0.85885
Balanced accuracy = 0.8587227386116755
Precision = 0.839697904478247
Recall = 0.8571118349619978
F1 = 0.8483155123314169
Kappa_score = 0.7163577766348136
Matthews_corrcoef = 0.7164897258171989
# Check target column balance
dfs.condition.value_counts()
0 53758
1 46242
Name: condition, dtype: int64
%%time
embeddings_pipeline = build_pipeline("embeddings")
embeddings_pipeline.fit(X_train, y_train)
CPU times: user 15min 37s, sys: 8.47 s, total: 15min 45s
Wall time: 2min 24s
# Check probabilities score
embedding_preds = embeddings_pipeline.predict_proba(X_test)
%%time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import cohen_kappa_score, brier_score_loss
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from numpy import sqrt, argmax, argmin
# Plot F1-Score and Threshold
threshold_list = np.linspace(0.05, 0.95, 200)
f1_list = []
for threshold in threshold_list:
pred_label = np.where(embedding_preds[:,1] < threshold, 0, 1)
f1 = f1_score(y_test, pred_label)
f1_list.append(f1)
df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)
plt.show()
# Plot your other Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)
score_list = []
for threshold in threshold_list:
pred_label = np.where(embedding_preds[:,1] < threshold, 0, 1)
score = brier_score_loss(y_test, pred_label)
score_list.append(score)
df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == min(df_score['score_score'])]
bt = df_score[df_score['score_score'] == min(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == min(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Brier: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)
plt.show()
from sklearn.metrics import roc_curve
#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(y_test, embedding_preds[:,1])
roc = roc_auc_score(y_test, embedding_preds[:,1])
# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))
plt.figure()
lw = 2
plt.plot(
fpr,
tpr,
color="darkorange",
lw=lw,
#marker='.',
label=f"ROC curve (area ={'%.2f' % roc})"# % roc_auc["micro"],
)
plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') #threshold
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Embeddings + XGBoost Condition Classifier")
plt.legend(loc="lower right")
plt.savefig('emb_xgboost_curve.png', bbox_inches='tight', dpi = 300)
plt.show()
Best Threshold=0.405652, G-Mean=0.865
CPU times: user 2.06 s, sys: 613 ms, total: 2.67 s
Wall time: 1.83 s
# best_preds_score = np.where(embedding_preds < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(y_test, embedding_preds[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(y_test, embedding_preds[:,1])))
print("Brier_error = {}".format(brier_score_loss(y_test, embedding_preds[:,1])))
print("Logloss_test = {}".format(log_loss(y_test, embedding_preds[:,1])))
# print("Precision = {}".format(precision_score(Y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(Y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(Y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, preds_test[:,1])))
mean_squared_error_test = 0.31567197986751955
Roc_auc = 0.9358383924037762
Brier_error = 0.09964879887347967
Logloss_test = 0.3227270290231638
best_preds_score = np.where(embedding_preds < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(y_test, best_preds_score[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(y_test, best_preds_score[:,1])))
print("Brier_error = {}".format(brier_score_loss(y_test, best_preds_score[:,1])))
print("Logloss_test = {}".format(log_loss(y_test, best_preds_score[:,1])))
# print("Precision = {}".format(precision_score(Y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(Y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(Y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, preds_test[:,1])))
mean_squared_error_test = 0.36939139134527754
Roc_auc = 0.8636920061833471
Brier_error = 0.13645
Logloss_test = 4.712875409194756
# apply threshold to positive probabilities to create labels
def to_labels_max(pos_probs, threshold):
return (pos_probs >= threshold).astype('int')
# evaluate each threshold
scores = [roc_auc_score(y_test, to_labels_max(embedding_preds[:,1], t)) for t in thresholds]
# get best threshold for max is better
ix = argmax(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.406, Roc_auc=0.86562
# evaluate each threshold
scores = [brier_score_loss(y_test, to_labels_max(embedding_preds[:,1], t)) for t in thresholds]
# get best threshold for min is better
ix = argmin(scores)
print('Threshold=%.3f, Brier=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.528, Brier=0.13625
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from embedding_encoder import EmbeddingEncoder
from embedding_encoder.utils.compose import ColumnTransformerWithNames
dfs.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns
Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
'price', 'days_active'],
dtype='object')
dfs.select_dtypes(include=['category']).columns
Index(['condition', 'buying_mode', 'currency_id', 'seller_state',
'seller_city', 'mode', 'status', 'category_id', 'seller_id',
'year_start', 'month_start', 'year_stop', 'month_stop', 'week_day'],
dtype='object')
# Split train and test
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64', 'category', 'bool']
X = dfs.select_dtypes(include=numerics).drop(columns=['condition'], axis=1)
dfs['condition'] = dfs['condition'].replace('new', 0)
dfs['condition'] = dfs['condition'].replace('used', 1)
y = dfs.condition
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
import keras
import tensorflow as tf
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Activation, Dense
from keras.layers import Dropout
categorical_high = ["seller_city", "category_id"] # "seller_id"
numeric = X.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns#.drop(columns=['condition'], axis=1)
categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "year_stop", "month_start", "year_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "month_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status"] + list(X.select_dtypes(include=['bool']).columns)
def build_pipeline(mode: str):
if mode == "embeddings":
high_cardinality_encoder = EmbeddingEncoder(task="classification") #regression
else:
high_cardinality_encoder = OrdinalEncoder()
one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
scaler = StandardScaler()
imputer = ColumnTransformerWithNames([("numeric", SimpleImputer(strategy="mean"), numeric), ("categorical", SimpleImputer(strategy="most_frequent"), categorical_low+categorical_high)])
processor = ColumnTransformer([("one_hot", one_hot_encoder, categorical_low), (mode, high_cardinality_encoder, categorical_high), ("scale", scaler, numeric)])
def twoLayerFeedForward():
model = Sequential()
model.add(keras.layers.Dense(300, activation=tf.nn.relu)) #input_dim=300
model.add(keras.layers.Dense(128, activation=tf.nn.relu))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
# clf = KerasClassifier(TwoLayerFeedForward(), epochs=100, batch_size=500, verbose=0)
model = KerasClassifier(twoLayerFeedForward, verbose=1, validation_split=0.15, shuffle=True, epochs=100, batch_size=512) #batch_size=32
return make_pipeline(imputer, processor, model) #RandomForestClassifier() #LogisticRegression())
2022-08-03 18:14:41.595116: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-08-03 18:14:41.595137: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
%%time
embeddings_pipeline = build_pipeline("embeddings")
history = embeddings_pipeline.fit(X_train, y_train)
/tmp/ipykernel_700621/402245863.py:35: DeprecationWarning: KerasClassifier is deprecated, use Sci-Keras (https://github.com/adriangb/scikeras) instead. See https://www.adriangb.com/scikeras/stable/migration.html for help migrating.
model = KerasClassifier(twoLayerFeedForward, verbose=1, validation_split=0.15, shuffle=True, epochs=100, batch_size=512) #batch_size=32
2022-08-03 18:14:43.959308: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-08-03 18:14:43.959359: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-08-03 18:14:43.959388: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (brspobitanl1727): /proc/driver/nvidia/version does not exist
2022-08-03 18:14:43.959671: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/100
150/150 [==============================] - 1s 4ms/step - loss: 0.3302 - accuracy: 0.8555 - val_loss: 0.4014 - val_accuracy: 0.8274
Epoch 2/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2948 - accuracy: 0.8732 - val_loss: 0.3881 - val_accuracy: 0.8350
Epoch 3/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2824 - accuracy: 0.8800 - val_loss: 0.3828 - val_accuracy: 0.8364
Epoch 4/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2750 - accuracy: 0.8824 - val_loss: 0.3810 - val_accuracy: 0.8387
Epoch 5/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2694 - accuracy: 0.8867 - val_loss: 0.3842 - val_accuracy: 0.8399
Epoch 6/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2638 - accuracy: 0.8890 - val_loss: 0.3749 - val_accuracy: 0.8391
Epoch 7/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2601 - accuracy: 0.8903 - val_loss: 0.3817 - val_accuracy: 0.8413
Epoch 8/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2559 - accuracy: 0.8921 - val_loss: 0.3821 - val_accuracy: 0.8379
Epoch 9/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2502 - accuracy: 0.8957 - val_loss: 0.3781 - val_accuracy: 0.8457
Epoch 10/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2470 - accuracy: 0.8964 - val_loss: 0.3804 - val_accuracy: 0.8427
Epoch 11/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2415 - accuracy: 0.8997 - val_loss: 0.3774 - val_accuracy: 0.8443
Epoch 12/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2406 - accuracy: 0.9004 - val_loss: 0.3862 - val_accuracy: 0.8428
Epoch 13/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2356 - accuracy: 0.9015 - val_loss: 0.3810 - val_accuracy: 0.8388
Epoch 14/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2286 - accuracy: 0.9060 - val_loss: 0.3822 - val_accuracy: 0.8407
Epoch 15/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2257 - accuracy: 0.9068 - val_loss: 0.3956 - val_accuracy: 0.8415
Epoch 16/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2221 - accuracy: 0.9087 - val_loss: 0.4000 - val_accuracy: 0.8402
Epoch 17/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2181 - accuracy: 0.9107 - val_loss: 0.3942 - val_accuracy: 0.8445
Epoch 18/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2096 - accuracy: 0.9142 - val_loss: 0.4040 - val_accuracy: 0.8404
Epoch 19/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2074 - accuracy: 0.9143 - val_loss: 0.4052 - val_accuracy: 0.8436
Epoch 20/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2030 - accuracy: 0.9177 - val_loss: 0.4251 - val_accuracy: 0.8426
Epoch 21/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1960 - accuracy: 0.9199 - val_loss: 0.4199 - val_accuracy: 0.8422
Epoch 22/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1944 - accuracy: 0.9209 - val_loss: 0.4563 - val_accuracy: 0.8390
Epoch 23/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1916 - accuracy: 0.9237 - val_loss: 0.4386 - val_accuracy: 0.8420
Epoch 24/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1847 - accuracy: 0.9251 - val_loss: 0.4574 - val_accuracy: 0.8372
Epoch 25/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1788 - accuracy: 0.9272 - val_loss: 0.4759 - val_accuracy: 0.8353
Epoch 26/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1738 - accuracy: 0.9300 - val_loss: 0.4750 - val_accuracy: 0.8450
Epoch 27/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1708 - accuracy: 0.9308 - val_loss: 0.4869 - val_accuracy: 0.8407
Epoch 28/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1645 - accuracy: 0.9334 - val_loss: 0.4733 - val_accuracy: 0.8411
Epoch 29/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1609 - accuracy: 0.9350 - val_loss: 0.4808 - val_accuracy: 0.8321
Epoch 30/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1557 - accuracy: 0.9373 - val_loss: 0.5059 - val_accuracy: 0.8384
Epoch 31/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1507 - accuracy: 0.9401 - val_loss: 0.4927 - val_accuracy: 0.8382
Epoch 32/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1430 - accuracy: 0.9423 - val_loss: 0.5239 - val_accuracy: 0.8348
Epoch 33/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1410 - accuracy: 0.9434 - val_loss: 0.5344 - val_accuracy: 0.8355
Epoch 34/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1390 - accuracy: 0.9442 - val_loss: 0.5711 - val_accuracy: 0.8362
Epoch 35/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1318 - accuracy: 0.9470 - val_loss: 0.5636 - val_accuracy: 0.8361
Epoch 36/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1295 - accuracy: 0.9484 - val_loss: 0.5880 - val_accuracy: 0.8398
Epoch 37/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1216 - accuracy: 0.9520 - val_loss: 0.6103 - val_accuracy: 0.8346
Epoch 38/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1163 - accuracy: 0.9540 - val_loss: 0.6112 - val_accuracy: 0.8316
Epoch 39/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1173 - accuracy: 0.9536 - val_loss: 0.6456 - val_accuracy: 0.8292
Epoch 40/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1127 - accuracy: 0.9547 - val_loss: 0.6430 - val_accuracy: 0.8373
Epoch 41/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1061 - accuracy: 0.9580 - val_loss: 0.6648 - val_accuracy: 0.8347
Epoch 42/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1020 - accuracy: 0.9607 - val_loss: 0.7315 - val_accuracy: 0.8348
Epoch 43/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1013 - accuracy: 0.9598 - val_loss: 0.6618 - val_accuracy: 0.8333
Epoch 44/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0938 - accuracy: 0.9637 - val_loss: 0.7261 - val_accuracy: 0.8273
Epoch 45/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0941 - accuracy: 0.9627 - val_loss: 0.7338 - val_accuracy: 0.8279
Epoch 46/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0913 - accuracy: 0.9640 - val_loss: 0.8022 - val_accuracy: 0.8339
Epoch 47/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0849 - accuracy: 0.9672 - val_loss: 0.7733 - val_accuracy: 0.8305
Epoch 48/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0839 - accuracy: 0.9679 - val_loss: 0.8097 - val_accuracy: 0.8351
Epoch 49/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0822 - accuracy: 0.9686 - val_loss: 0.8593 - val_accuracy: 0.8363
Epoch 50/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0792 - accuracy: 0.9694 - val_loss: 0.8464 - val_accuracy: 0.8343
Epoch 51/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0766 - accuracy: 0.9709 - val_loss: 0.8365 - val_accuracy: 0.8360
Epoch 52/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0683 - accuracy: 0.9743 - val_loss: 0.9086 - val_accuracy: 0.8327
Epoch 53/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0726 - accuracy: 0.9721 - val_loss: 0.9122 - val_accuracy: 0.8352
Epoch 54/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0626 - accuracy: 0.9765 - val_loss: 0.9309 - val_accuracy: 0.8290
Epoch 55/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0742 - accuracy: 0.9711 - val_loss: 0.9134 - val_accuracy: 0.8314
Epoch 56/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0594 - accuracy: 0.9771 - val_loss: 0.9703 - val_accuracy: 0.8296
Epoch 57/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0592 - accuracy: 0.9777 - val_loss: 0.9761 - val_accuracy: 0.8267
Epoch 58/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0619 - accuracy: 0.9764 - val_loss: 0.9635 - val_accuracy: 0.8291
Epoch 59/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0585 - accuracy: 0.9777 - val_loss: 0.9953 - val_accuracy: 0.8311
Epoch 60/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0535 - accuracy: 0.9805 - val_loss: 1.0472 - val_accuracy: 0.8281
Epoch 61/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0503 - accuracy: 0.9809 - val_loss: 1.0811 - val_accuracy: 0.8307
Epoch 62/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0492 - accuracy: 0.9814 - val_loss: 1.1155 - val_accuracy: 0.8359
Epoch 63/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0521 - accuracy: 0.9813 - val_loss: 1.1467 - val_accuracy: 0.8324
Epoch 64/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0465 - accuracy: 0.9823 - val_loss: 1.1086 - val_accuracy: 0.8286
Epoch 65/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0453 - accuracy: 0.9834 - val_loss: 1.1806 - val_accuracy: 0.8213
Epoch 66/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0457 - accuracy: 0.9829 - val_loss: 1.1553 - val_accuracy: 0.8266
Epoch 67/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0521 - accuracy: 0.9806 - val_loss: 1.1109 - val_accuracy: 0.8237
Epoch 68/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0488 - accuracy: 0.9822 - val_loss: 1.1458 - val_accuracy: 0.8236
Epoch 69/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0400 - accuracy: 0.9856 - val_loss: 1.2181 - val_accuracy: 0.8319
Epoch 70/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0411 - accuracy: 0.9846 - val_loss: 1.2346 - val_accuracy: 0.8304
Epoch 71/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0433 - accuracy: 0.9839 - val_loss: 1.1918 - val_accuracy: 0.8281
Epoch 72/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0376 - accuracy: 0.9864 - val_loss: 1.3038 - val_accuracy: 0.8265
Epoch 73/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0361 - accuracy: 0.9870 - val_loss: 1.3390 - val_accuracy: 0.8274
Epoch 74/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0379 - accuracy: 0.9862 - val_loss: 1.2512 - val_accuracy: 0.8244
Epoch 75/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0419 - accuracy: 0.9852 - val_loss: 1.3643 - val_accuracy: 0.8255
Epoch 76/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0421 - accuracy: 0.9846 - val_loss: 1.2699 - val_accuracy: 0.8267
Epoch 77/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0398 - accuracy: 0.9858 - val_loss: 1.3021 - val_accuracy: 0.8292
Epoch 78/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0304 - accuracy: 0.9898 - val_loss: 1.3497 - val_accuracy: 0.8275
Epoch 79/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0407 - accuracy: 0.9856 - val_loss: 1.3319 - val_accuracy: 0.8291
Epoch 80/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0430 - accuracy: 0.9869 - val_loss: 1.3290 - val_accuracy: 0.8302
Epoch 81/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0337 - accuracy: 0.9891 - val_loss: 1.3899 - val_accuracy: 0.8301
Epoch 82/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0315 - accuracy: 0.9892 - val_loss: 1.3707 - val_accuracy: 0.8276
Epoch 83/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0336 - accuracy: 0.9875 - val_loss: 1.3784 - val_accuracy: 0.8274
Epoch 84/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0345 - accuracy: 0.9875 - val_loss: 1.4005 - val_accuracy: 0.8295
Epoch 85/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0307 - accuracy: 0.9893 - val_loss: 1.3823 - val_accuracy: 0.8269
Epoch 86/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0401 - accuracy: 0.9862 - val_loss: 1.4838 - val_accuracy: 0.8297
Epoch 87/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0352 - accuracy: 0.9872 - val_loss: 1.4347 - val_accuracy: 0.8322
Epoch 88/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0284 - accuracy: 0.9894 - val_loss: 1.4827 - val_accuracy: 0.8289
Epoch 89/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0369 - accuracy: 0.9873 - val_loss: 1.4705 - val_accuracy: 0.8270
Epoch 90/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0313 - accuracy: 0.9889 - val_loss: 1.5390 - val_accuracy: 0.8243
Epoch 91/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0290 - accuracy: 0.9895 - val_loss: 1.4780 - val_accuracy: 0.8302
Epoch 92/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0281 - accuracy: 0.9900 - val_loss: 1.5518 - val_accuracy: 0.8297
Epoch 93/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0284 - accuracy: 0.9901 - val_loss: 1.5659 - val_accuracy: 0.8321
Epoch 94/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0306 - accuracy: 0.9893 - val_loss: 1.4831 - val_accuracy: 0.8287
Epoch 95/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0359 - accuracy: 0.9867 - val_loss: 1.5319 - val_accuracy: 0.8230
Epoch 96/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0336 - accuracy: 0.9881 - val_loss: 1.5192 - val_accuracy: 0.8311
Epoch 97/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0279 - accuracy: 0.9902 - val_loss: 1.4872 - val_accuracy: 0.8316
Epoch 98/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0269 - accuracy: 0.9910 - val_loss: 1.5875 - val_accuracy: 0.8327
Epoch 99/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0270 - accuracy: 0.9902 - val_loss: 1.4886 - val_accuracy: 0.8284
Epoch 100/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0332 - accuracy: 0.9882 - val_loss: 1.5127 - val_accuracy: 0.8242
CPU times: user 7min 20s, sys: 19.6 s, total: 7min 39s
Wall time: 1min 51s
# from keras.utils.vis_utils import plot_model
# plot_model(model, to_file='model.png')
# import matplotlib.pyplot as plt
# plt.plot(history[0]['accuracy'])
# plt.plot(history[0]['val_accuracy'])
# plt.title('model accuracy')
# plt.ylabel('accuracy')
# plt.xlabel('epoch')
# plt.legend(['train', 'val'], loc='upper left')
# plt.show()
import keras
import tensorflow as tf
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Activation, Dense
from keras.layers import Dropout
categorical_high = ["seller_city", "category_id"] # "seller_id"
numeric = X.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns#.drop(columns=['condition'], axis=1)
categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "year_stop", "month_start", "year_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "month_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status"] + list(X.select_dtypes(include=['bool']).columns)
def build_pipeline(mode: str):
if mode == "embeddings":
high_cardinality_encoder = EmbeddingEncoder(task="classification") #regression
else:
high_cardinality_encoder = OrdinalEncoder()
one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
scaler = StandardScaler()
imputer = ColumnTransformerWithNames([("numeric", SimpleImputer(strategy="mean"), numeric), ("categorical", SimpleImputer(strategy="most_frequent"), categorical_low+categorical_high)])
processor = ColumnTransformer([("one_hot", one_hot_encoder, categorical_low), (mode, high_cardinality_encoder, categorical_high), ("scale", scaler, numeric)])
def threeLayerFeedForward():
model = Sequential()
model.add(keras.layers.Dense(300, activation=tf.nn.relu, kernel_initializer='glorot_uniform')) #input_dim=df_train.shape[1])) #16
model.add(keras.layers.Dropout(0.4))
model.add(keras.layers.Dense(128, activation=tf.nn.relu, kernel_initializer='glorot_uniform')) #8
model.add(keras.layers.Dropout(0.4))
model.add(keras.layers.Dense(64, activation=tf.nn.relu, kernel_initializer='glorot_uniform')) # None
model.add(keras.layers.Dropout(0.4))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid, kernel_initializer='glorot_uniform')) # nn.softmax if multiclass
optimizer = tf.keras.optimizers.Adamax(
learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07,
name='Adamax'
)
# tf.keras.optimizers.Adam(
# learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,
# name='Adam'
# )
model.compile(optimizer= optimizer, # 'adam' # SGD()
loss='binary_crossentropy', # categorical_crossentropy if multilabel
metrics=['accuracy']
)
return model
es_callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
# clf = KerasClassifier(TwoLayerFeedForward(), epochs=100, batch_size=500, verbose=0)
model = KerasClassifier(threeLayerFeedForward, verbose=1, validation_split=0.05, shuffle=True, epochs=100, batch_size=512, callbacks=[es_callback]) #batch_size=32
return make_pipeline(imputer, processor, model) #RandomForestClassifier() #LogisticRegression())
%%time
embeddings_pipeline = build_pipeline("embeddings")
history = embeddings_pipeline.fit(X_train, y_train)
/tmp/ipykernel_700621/3919486494.py:57: DeprecationWarning: KerasClassifier is deprecated, use Sci-Keras (https://github.com/adriangb/scikeras) instead. See https://www.adriangb.com/scikeras/stable/migration.html for help migrating.
model = KerasClassifier(threeLayerFeedForward, verbose=1, validation_split=0.05, shuffle=True, epochs=100, batch_size=512, callbacks=[es_callback]) #batch_size=32
Epoch 1/100
167/167 [==============================] - 1s 5ms/step - loss: 0.4341 - accuracy: 0.7957 - val_loss: 0.4002 - val_accuracy: 0.8182
Epoch 2/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3534 - accuracy: 0.8501 - val_loss: 0.3905 - val_accuracy: 0.8218
Epoch 3/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3447 - accuracy: 0.8528 - val_loss: 0.3935 - val_accuracy: 0.8253
Epoch 4/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3396 - accuracy: 0.8553 - val_loss: 0.3886 - val_accuracy: 0.8260
Epoch 5/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3367 - accuracy: 0.8573 - val_loss: 0.3838 - val_accuracy: 0.8269
Epoch 6/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3344 - accuracy: 0.8592 - val_loss: 0.3834 - val_accuracy: 0.8278
Epoch 7/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3307 - accuracy: 0.8597 - val_loss: 0.3816 - val_accuracy: 0.8267
Epoch 8/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3289 - accuracy: 0.8610 - val_loss: 0.3795 - val_accuracy: 0.8269
Epoch 9/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3264 - accuracy: 0.8620 - val_loss: 0.3776 - val_accuracy: 0.8307
Epoch 10/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3238 - accuracy: 0.8642 - val_loss: 0.3795 - val_accuracy: 0.8298
Epoch 11/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3241 - accuracy: 0.8641 - val_loss: 0.3786 - val_accuracy: 0.8322
Epoch 12/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3193 - accuracy: 0.8656 - val_loss: 0.3772 - val_accuracy: 0.8327
Epoch 13/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3202 - accuracy: 0.8663 - val_loss: 0.3740 - val_accuracy: 0.8342
Epoch 14/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3167 - accuracy: 0.8672 - val_loss: 0.3747 - val_accuracy: 0.8331
Epoch 15/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3158 - accuracy: 0.8680 - val_loss: 0.3710 - val_accuracy: 0.8358
Epoch 16/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3159 - accuracy: 0.8697 - val_loss: 0.3698 - val_accuracy: 0.8360
Epoch 17/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3130 - accuracy: 0.8697 - val_loss: 0.3688 - val_accuracy: 0.8369
Epoch 18/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3109 - accuracy: 0.8706 - val_loss: 0.3679 - val_accuracy: 0.8384
Epoch 19/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3099 - accuracy: 0.8713 - val_loss: 0.3657 - val_accuracy: 0.8378
Epoch 20/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3077 - accuracy: 0.8724 - val_loss: 0.3652 - val_accuracy: 0.8380
Epoch 21/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3076 - accuracy: 0.8719 - val_loss: 0.3635 - val_accuracy: 0.8387
Epoch 22/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3064 - accuracy: 0.8734 - val_loss: 0.3649 - val_accuracy: 0.8364
Epoch 23/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3036 - accuracy: 0.8738 - val_loss: 0.3636 - val_accuracy: 0.8407
Epoch 24/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3024 - accuracy: 0.8753 - val_loss: 0.3615 - val_accuracy: 0.8422
Epoch 25/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3012 - accuracy: 0.8756 - val_loss: 0.3643 - val_accuracy: 0.8382
Epoch 26/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3007 - accuracy: 0.8766 - val_loss: 0.3591 - val_accuracy: 0.8456
Epoch 27/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2979 - accuracy: 0.8764 - val_loss: 0.3595 - val_accuracy: 0.8420
Epoch 28/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2971 - accuracy: 0.8765 - val_loss: 0.3581 - val_accuracy: 0.8442
Epoch 29/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2968 - accuracy: 0.8770 - val_loss: 0.3579 - val_accuracy: 0.8398
Epoch 30/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2935 - accuracy: 0.8793 - val_loss: 0.3575 - val_accuracy: 0.8436
Epoch 31/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2926 - accuracy: 0.8785 - val_loss: 0.3597 - val_accuracy: 0.8416
Epoch 32/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2921 - accuracy: 0.8803 - val_loss: 0.3559 - val_accuracy: 0.8458
Epoch 33/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2904 - accuracy: 0.8804 - val_loss: 0.3551 - val_accuracy: 0.8444
Epoch 34/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2901 - accuracy: 0.8802 - val_loss: 0.3555 - val_accuracy: 0.8418
Epoch 35/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2880 - accuracy: 0.8816 - val_loss: 0.3516 - val_accuracy: 0.8458
Epoch 36/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2873 - accuracy: 0.8814 - val_loss: 0.3551 - val_accuracy: 0.8451
Epoch 37/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2876 - accuracy: 0.8815 - val_loss: 0.3571 - val_accuracy: 0.8458
Epoch 38/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2853 - accuracy: 0.8826 - val_loss: 0.3512 - val_accuracy: 0.8473
Epoch 39/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2844 - accuracy: 0.8829 - val_loss: 0.3523 - val_accuracy: 0.8462
Epoch 40/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2830 - accuracy: 0.8829 - val_loss: 0.3554 - val_accuracy: 0.8520
Epoch 41/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2828 - accuracy: 0.8842 - val_loss: 0.3530 - val_accuracy: 0.8511
Epoch 42/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2798 - accuracy: 0.8853 - val_loss: 0.3543 - val_accuracy: 0.8473
Epoch 43/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2806 - accuracy: 0.8859 - val_loss: 0.3523 - val_accuracy: 0.8478
Epoch 44/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2795 - accuracy: 0.8861 - val_loss: 0.3570 - val_accuracy: 0.8473
Epoch 45/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2773 - accuracy: 0.8858 - val_loss: 0.3496 - val_accuracy: 0.8476
Epoch 46/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2770 - accuracy: 0.8867 - val_loss: 0.3506 - val_accuracy: 0.8551
Epoch 47/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2772 - accuracy: 0.8869 - val_loss: 0.3527 - val_accuracy: 0.8484
Epoch 48/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2747 - accuracy: 0.8870 - val_loss: 0.3520 - val_accuracy: 0.8518
Epoch 49/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2734 - accuracy: 0.8881 - val_loss: 0.3575 - val_accuracy: 0.8500
Epoch 50/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2733 - accuracy: 0.8882 - val_loss: 0.3517 - val_accuracy: 0.8544
Epoch 51/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2728 - accuracy: 0.8884 - val_loss: 0.3537 - val_accuracy: 0.8542
Epoch 52/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2730 - accuracy: 0.8887 - val_loss: 0.3493 - val_accuracy: 0.8507
Epoch 53/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2696 - accuracy: 0.8904 - val_loss: 0.3528 - val_accuracy: 0.8511
Epoch 54/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2701 - accuracy: 0.8887 - val_loss: 0.3534 - val_accuracy: 0.8478
Epoch 55/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2705 - accuracy: 0.8895 - val_loss: 0.3549 - val_accuracy: 0.8531
Epoch 56/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2691 - accuracy: 0.8902 - val_loss: 0.3511 - val_accuracy: 0.8529
Epoch 57/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2680 - accuracy: 0.8900 - val_loss: 0.3499 - val_accuracy: 0.8536
Epoch 58/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2666 - accuracy: 0.8908 - val_loss: 0.3526 - val_accuracy: 0.8531
Epoch 59/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2661 - accuracy: 0.8922 - val_loss: 0.3504 - val_accuracy: 0.8520
Epoch 60/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2647 - accuracy: 0.8920 - val_loss: 0.3479 - val_accuracy: 0.8538
Epoch 61/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2633 - accuracy: 0.8918 - val_loss: 0.3533 - val_accuracy: 0.8536
Epoch 62/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2654 - accuracy: 0.8919 - val_loss: 0.3530 - val_accuracy: 0.8524
Epoch 63/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2640 - accuracy: 0.8922 - val_loss: 0.3489 - val_accuracy: 0.8533
Epoch 64/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2622 - accuracy: 0.8923 - val_loss: 0.3552 - val_accuracy: 0.8502
Epoch 65/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2603 - accuracy: 0.8938 - val_loss: 0.3524 - val_accuracy: 0.8547
Epoch 66/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2614 - accuracy: 0.8938 - val_loss: 0.3515 - val_accuracy: 0.8576
Epoch 67/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2595 - accuracy: 0.8942 - val_loss: 0.3507 - val_accuracy: 0.8544
Epoch 68/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2595 - accuracy: 0.8946 - val_loss: 0.3518 - val_accuracy: 0.8542
Epoch 69/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2595 - accuracy: 0.8949 - val_loss: 0.3519 - val_accuracy: 0.8553
Epoch 70/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2569 - accuracy: 0.8956 - val_loss: 0.3536 - val_accuracy: 0.8569
CPU times: user 6min 8s, sys: 16.6 s, total: 6min 25s
Wall time: 1min 37s
# extract the test set predictions
preds_test = history.predict_proba(X_test)
%%time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import cohen_kappa_score, brier_score_loss
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from numpy import sqrt, argmax, argmin
# Plot F1-Score and Threshold
threshold_list = np.linspace(0.05, 0.95, 200)
f1_list = []
for threshold in threshold_list:
pred_label = np.where(preds_test[:,1] < threshold, 0, 1)
f1 = f1_score(y_test, pred_label)
f1_list.append(f1)
df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)
plt.show()
# Plot your other Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)
score_list = []
for threshold in threshold_list:
pred_label = np.where(preds_test[:,1] < threshold, 0, 1)
score = brier_score_loss(y_test, pred_label)
score_list.append(score)
df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == min(df_score['score_score'])]
bt = df_score[df_score['score_score'] == min(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == min(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Brier: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)
plt.show()
from sklearn.metrics import roc_curve
#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(y_test, preds_test[:,1])
roc = roc_auc_score(y_test, preds_test[:,1])
# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))
plt.figure()
lw = 2
plt.plot(
fpr,
tpr,
color="darkorange",
lw=lw,
#marker='.',
label=f"ROC curve (area ={'%.2f' % roc})"# % roc_auc["micro"],
)
plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') #threshold
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("NNet Condition Classifier")
plt.legend(loc="lower right")
plt.savefig('emb_nnet_roc_curve.png', bbox_inches='tight', dpi = 300)
plt.show()
Best Threshold=0.348351, G-Mean=0.858
CPU times: user 1.36 s, sys: 657 ms, total: 2.02 s
Wall time: 1.19 s
# best_preds_score = np.where(preds_test < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(y_test, preds_test[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(y_test, preds_test[:,1])))
print("Brier_error = {}".format(brier_score_loss(y_test, preds_test[:,1])))
print("Logloss_test = {}".format(log_loss(y_test, preds_test[:,1])))
# print("Precision = {}".format(precision_score(y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(y_test, preds_test[:,1])))
mean_squared_error_test = 0.32670411986655384
Roc_auc = 0.9300543679940618
Brier_error = 0.10673558193777959
Logloss_test = nan
# apply threshold to positive probabilities to create labels
def to_labels_max(pos_probs, threshold):
return (pos_probs >= threshold).astype('int')
# evaluate each threshold
scores = [roc_auc_score(y_test, to_labels_max(preds_test[:,1], t)) for t in thresholds]
# get best threshold for max is better
ix = argmax(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.348, Roc_auc=0.85848
# evaluate each threshold
scores = [brier_score_loss(y_test, to_labels_max(preds_test[:,1], t)) for t in thresholds]
# get best threshold for min is better
ix = argmin(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.435, Roc_auc=0.14370
https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/
https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classifica
https://machinelearningmastery.com/how-to-score-probability-predictions-in-python/
https://www.machinelearningplus.com/statistics/brier-score/
0.0.5.0
- Guilherme Giuliano Nicolau: @ggnicolau (https://github.com/ggnicolau)