PREDICT WETHER IS NEW OR USED

Project Organization

├── LICENSE
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets.
│   └── raw            <- The original, immutable data dump.
│
├── models             <- Trained and serialized models, model predictions, or model summaries.
│
├── notebooks          <- Jupyter notebooks with steps for training and evaluating models.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
└── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.

Project based on the cookiecutter data science project template. #cookiecutterdatascience

PREDICT WETHER IS NEW OR USED

Challenge

In the context of marketplaces, an algorithm is needed to predict if an item listed is new or used.

Your tasks involve the data analysis, designing, processing and modeling of a machine learning solution to predict if an item is new or used and then evaluate the model over held-out test data.

To assist in that task a dataset is provided in MLA_100k_checked_v3.jsonlines.

For the evaluation, you will use the accuracy metric in order to get a result of 0.86 as minimum. Additionally, you will have to choose an appropiate secondary metric and also elaborate an argument on why that metric was chosen.

The deliverables are:

The file, including all the code needed to define and evaluate a model.
A document with an explanation on the criteria applied to choose the features, the proposed secondary metric and the performance achieved on that metrics.
Optionally, you can deliver an EDA analysis with other formart like .ipynb

Resumed Conclusions

You will find our first selected columns at section 2.1, then you can check our definitive columns after treatment and feature engineering.
We didn't predict our classes (0 and 1), but we decided to predict the probability for our binary classification problem, since it's more meaningful (literally, we calculate the probability to be class 0 or 1). Thus, we didn't calculate accuracy, precision, recall, F1-score, Kappa or other metrics. For probability evaluation, we opted to use mean squared error, log loss and Brier score (lower is better). We also used the ROC curve to evaluate the model and calculated ROC AUC score (higher is better).
Our metrics only make sense if we compare between models. We compared four models.

(a) Our first one is our baseline, we used logistic regressions with no parameters and got a bad result with a score of 0.69. We mostly used it because applying a linear model can help to get insights from the data;

(b) Our second one is more complex and with less interpretability, we used an ensemble of non-linear hierarchical tree models, called XGBoost. We got a ROC AUC of 0.89, which is better than our baseline model and it's a good result, with high computational cost though;

(c) For our third one, we first used embeddings (neural networks) to encode our categorical features (category and seller city) with high cardinality which we couldn't do One Hot Encoding (due to computational cost) or Label Encoding (since unique values are independent from each other). After that, we just applied a Logistic Regression. Impressively, we got a ROC AUC of 0.9 with a simple linear model for binary classification;

(d) For our fourth one we also used embeddings for encoding, but then used a XGBoost. We got a ROC AUC of 0.93, as we expected to get a better result from the previous one;

(e) Finally, for our fifth model we also used embeddings for encoding, but then used a four-layers Neural Network. We got the same results as the last model;

REMARKS: Personally, I think the embeddings encoding with Logistic Regression is the best model, because it's simpler and has more interpretability. Occam's Razor principle states that other things equal, explanations that posit fewer entities, or fewer kinds of entities, are to be preferred to explanations that posit more.

	Metrics	XGBoost	Emb_Logistic	Emb_XGBoost	Emb_NNet
0	mean_squared_error_test	0.36	0.36	0.32	0.32
1	Roc_auc	0.89	0.90	0.93	0.93
2	Brier_error	0.13	0.12	0.10	0.10
3	Logloss	0.40	0.39	0.34	0.35

- The Mean Squared Error (or MSE) is much like the mean absolute error in that it provides a gross idea of the magnitude of error. 
Taking the square root of the mean squared error converts the units back to the original units of the output variable and can be meaningful for description and presentation. 
This is called the Root Mean Squared Error (or RMSE).

- Logistic loss (or log loss) is a performance metric for evaluating the predictions of probabilities of membership to a given class. 
The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm. Predictions that are correct or incorrect are rewarded or punished proportionally to the confidence of the prediction.
It heavily penalizes predicted probabilities far away from their expected value.

- The Brier score calculates the mean squared error between predicted probabilities and the expected values. 
It`s gentler than log loss but still penalizes proportional to the distance from the expected value.

- Area Under ROC Curve (or ROC AUC for short) is a performance metric for binary classification problems. 
The AUC represents a model’s ability to discriminate between positive and negative classes. 
An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random. 
A ROC Curve is a plot of the true positive rate and the false positive rate for a given set of probability predictions at different thresholds used to map the probabilities to class labels. 
The area under the curve is then the approximate integral under the ROC Curve.
The area under ROC curve that summarizes the likelihood of the model predicting a higher probability for true positive cases than true negative cases.

*(machinelearningmastery.com)*

Libraries

numpy
pandas
re
matplotlib
seaborn
embedding_encoder
sklearn
xgboost
keras

Technologies

Python version 3.9
Git

Tools

VS Studio
Jupyter IPython

Services

Github

PREDICT WETHER MARKETPLACE PRODUCTS ARE NEW OR USED

1.Load Data

import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
dfs = pd.read_json('MLA_100k_checked_v3.jsonlines', lines=True)

dfs = dfs.rename(columns = {'tags':'tag'})
dfs = dfs.rename(columns = {'id':'Id'})

1.1.Get features from dictionary columns

# Get region
dfs['seller_country'] = dfs.apply(lambda x : x['seller_address']['country']['name'], axis = 1)
dfs['seller_state'] = dfs.apply(lambda x : x['seller_address']['state']['name'], axis = 1)
dfs['seller_city'] = dfs.apply(lambda x : x['seller_address']['city']['name'], axis = 1)

# Transform id (named as descriptions) column to get data
import ast
def str_to_dict(column):
    for i in range(len(column)):
        try:
            column[i] = ast.literal_eval(column[i][0])
        except:
            return

str_to_dict(dfs['descriptions'])

# get data from descriptions and shipping 
dfs = pd.concat([dfs, dfs["descriptions"].apply(pd.Series)], axis=1)
dfs = pd.concat([dfs, dfs["shipping"].apply(pd.Series)], axis=1)

pd.set_option('display.max_columns', None)
dfs.head(5)

	seller_address	warranty	sub_status	condition	deal_ids	base_price	shipping	non_mercado_pago_payment_methods	seller_id	variations	site_id	listing_type_id	price	attributes	buying_mode	tag	parent_item_id	coverage_areas	category_id	descriptions	last_updated	international_delivery_mode	pictures	Id	official_store_id	differential_pricing	accepts_mercadopago	original_price	currency_id	thumbnail	title	automatic_relist	date_created	secure_thumbnail	stop_time	status	video_id	catalog_product_id	subtitle	initial_quantity	start_time	permalink	available_quantity	seller_country	seller_state	seller_city	0	id	local_pick_up	methods	tags	free_shipping	mode	dimensions	free_methods
0	{'country': {'name': 'Argentina', 'id': 'AR'},...	None	[]	new	[]	80.0	{'local_pick_up': True, 'methods': [], 'tags':...	[{'description': 'Transferencia bancaria', 'id...	8208882349	[]	MLA	bronze	80.0	[]	buy_it_now	[dragged_bids_and_visits]	MLA6553902747	[]	MLA126406	{'id': 'MLA4695330653-912855983'}	2015-09-05T20:42:58.000Z	none	[{'size': '500x375', 'secure_url': 'https://a2...	MLA4695330653	NaN	NaN	True	NaN	ARS	http://mla-s1-p.mlstatic.com/5386-MLA469533065...	Auriculares Samsung Originales Manos Libres Ca...	False	2015-09-05T20:42:53.000Z	https://a248.e.akamai.net/mla-s1-p.mlstatic.co...	2015-11-04 20:42:53	active	None	NaN	NaN	1	2015-09-05 20:42:53	http://articulo.mercadolibre.com.ar/MLA4695330...	1	Argentina	Capital Federal	San Cristóbal	NaN	MLA4695330653-912855983	True	[]	[]	False	not_specified	None	NaN
1	{'country': {'name': 'Argentina', 'id': 'AR'},...	NUESTRA REPUTACION	[]	used	[]	2650.0	{'local_pick_up': True, 'methods': [], 'tags':...	[{'description': 'Transferencia bancaria', 'id...	8141699488	[]	MLA	silver	2650.0	[]	buy_it_now	[]	MLA7727150374	[]	MLA10267	{'id': 'MLA7160447179-930764806'}	2015-09-26T18:08:34.000Z	none	[{'size': '499x334', 'secure_url': 'https://a2...	MLA7160447179	NaN	NaN	True	NaN	ARS	http://mla-s1-p.mlstatic.com/23223-MLA71604471...	Cuchillo Daga Acero Carbón Casco Yelmo Solinge...	False	2015-09-26T18:08:30.000Z	https://a248.e.akamai.net/mla-s1-p.mlstatic.co...	2015-11-25 18:08:30	active	None	NaN	NaN	1	2015-09-26 18:08:30	http://articulo.mercadolibre.com.ar/MLA7160447...	1	Argentina	Capital Federal	Buenos Aires	NaN	MLA7160447179-930764806	True	[]	[]	False	me2	None	NaN
2	{'country': {'name': 'Argentina', 'id': 'AR'},...	None	[]	used	[]	60.0	{'local_pick_up': True, 'methods': [], 'tags':...	[{'description': 'Transferencia bancaria', 'id...	8386096505	[]	MLA	bronze	60.0	[]	buy_it_now	[dragged_bids_and_visits]	MLA6561247998	[]	MLA1227	{'id': 'MLA7367189936-916478256'}	2015-09-09T23:57:10.000Z	none	[{'size': '375x500', 'secure_url': 'https://a2...	MLA7367189936	NaN	NaN	True	NaN	ARS	http://mla-s1-p.mlstatic.com/22076-MLA73671899...	Antigua Revista Billiken, N° 1826, Año 1954	False	2015-09-09T23:57:07.000Z	https://a248.e.akamai.net/mla-s1-p.mlstatic.co...	2015-11-08 23:57:07	active	None	NaN	NaN	1	2015-09-09 23:57:07	http://articulo.mercadolibre.com.ar/MLA7367189...	1	Argentina	Capital Federal	Boedo	NaN	MLA7367189936-916478256	True	[]	[]	False	me2	None	NaN
3	{'country': {'name': 'Argentina', 'id': 'AR'},...	None	[]	new	[]	580.0	{'local_pick_up': True, 'methods': [], 'tags':...	[{'description': 'Transferencia bancaria', 'id...	5377752182	[]	MLA	silver	580.0	[]	buy_it_now	[]	None	[]	MLA86345	{'id': 'MLA9191625553-932309698'}	2015-10-05T16:03:50.306Z	none	[{'size': '441x423', 'secure_url': 'https://a2...	MLA9191625553	NaN	NaN	True	NaN	ARS	http://mla-s2-p.mlstatic.com/183901-MLA9191625...	Alarma Guardtex Gx412 Seguridad Para El Automo...	False	2015-09-28T18:47:56.000Z	https://a248.e.akamai.net/mla-s2-p.mlstatic.co...	2015-12-04 01:13:16	active	None	NaN	NaN	1	2015-09-28 18:47:56	http://articulo.mercadolibre.com.ar/MLA9191625...	1	Argentina	Capital Federal	Floresta	NaN	MLA9191625553-932309698	True	[]	[]	False	me2	None	NaN
4	{'country': {'name': 'Argentina', 'id': 'AR'},...	MI REPUTACION.	[]	used	[]	30.0	{'local_pick_up': True, 'methods': [], 'tags':...	[{'description': 'Transferencia bancaria', 'id...	2938071313	[]	MLA	bronze	30.0	[]	buy_it_now	[dragged_bids_and_visits]	MLA3133256685	[]	MLA41287	{'id': 'MLA7787961817-902981678'}	2015-08-28T13:37:41.000Z	none	[{'size': '375x500', 'secure_url': 'https://a2...	MLA7787961817	NaN	NaN	True	NaN	ARS	http://mla-s2-p.mlstatic.com/13595-MLA77879618...	Serenata - Jennifer Blake	False	2015-08-24T22:07:20.000Z	https://a248.e.akamai.net/mla-s2-p.mlstatic.co...	2015-10-23 22:07:20	active	None	NaN	NaN	1	2015-08-24 22:07:20	http://articulo.mercadolibre.com.ar/MLA7787961...	1	Argentina	Buenos Aires	Tres de febrero	NaN	MLA7787961817-902981678	True	[]	[]	False	not_specified	None	NaN

# Get payment methods from dict
def convertCol(x,key,i):
    try:
        return x[i][key]
    except: 
        return ''
    
for key in ['description']: #['description','id','type'] -- only description is interesting
    for i in range(0,13):
        dfs[f'payment_{key}{i}'] = dfs['non_mercado_pago_payment_methods'].apply(lambda x: convertCol(x,key,i))

# Create a boolean column for each payment method 
lista_c = []
for i in range(0,13):
    lista = dfs[f'payment_description{i}'].unique()
    lista_c.extend(lista)

desc_uniques = set(lista_c)
desc_uniques.remove('')
desc_uniques

{'Acordar con el comprador',
 'American Express',
 'Cheque certificado',
 'Contra reembolso',
 'Diners',
 'Efectivo',
 'Giro postal',
 'MasterCard',
 'Mastercard Maestro',
 'MercadoPago',
 'Tarjeta de crédito',
 'Transferencia bancaria',
 'Visa',
 'Visa Electron'}

# Rename column for an improved dataframe (#TODO: Use apply for performance)
for col in desc_uniques:
    col_name=col.replace(' ','_')
    dfs[col_name] = dfs.isin([col]).any(axis=1)

# drop older columns
dfs = dfs.drop(dfs.loc[:, 'payment_description0':'payment_description12'], axis = 1)

import numpy as np
dfs = dfs.applymap(lambda x: x if x else np.nan)
dfs = dfs.dropna(how='all', axis=1)

2.Data Transformation

2.1.Change type and filter columns

COLUMNS THAT MATTERS:

warranty (good and new products have different kind of warranties)
sub_status (when a product ad is suspended might be due it's condition)
base_price (price are different when used or new)
seller_id (different sellers might sell used or new items)
price (price again)
buying_mode (type of buying might implicate something)
parent_item_id (might have correlation between similar products)
last_updated (we'll check)
id (we'll check)
official_store_id (different stores sells different items and conditions)
original_price (price again)
currency_id (type of payment and currency might be due to the kind of seller and products)
title (keep title to find product)
automatic_relist (we'll check')
stop_time (time might influece)
status (status might influece)
video_id fica (we'll check, but might be videos for used products)
initial_quantity (a good feature, used products have low counts)
start_time (time again)
sold_quantity (quantity again)
available_quantity (quantity again)
seller_country, state, city (used or new ads might have imbalanced distribution between regions)
local_pick_up (being new or used might influence if local pickup is available)
free_shipping (big sellers for new products might be more capable of assuming free shipping)
Contra_reembolso fica (payment methods matters)
Giro_postal (stays)
mode fica (don't know what it is but it's full, 'not_specified' might be more common on used products)
tags (we'll check about tags)
tag (we'll check about tag)
date_created
category

TRANSFORM COLUMNS (TYPE OF PAYMENTS):

Cheque_certificado
Mastercard_Maestro
Diners
Transferencia_bancaria
accepts_mercadopago
MercadoPago
Efectivo
Tarjeta_de_crédito
American_Express
MasterCard
Visa_Electron
Visa
Acordar_con_el_comprador

COLUMNS WE DON'T NEED:

seller_address (too specific)
deals_ids (nothing relevant, we checked)
shipping (nothing relevant, we checked)
non_mercad_pago_etc (transformed)
site_id (too specific)
listin_type_id sai
description (nothing relevant, we checked, turned out to be id)
international_delivery_mode
pictures (nothing relevant, we checked)
thumbnail (nothing relevant, we checked)
secure_thumbnail (nothing relevant, we checked)
permalink (nothing relevant, we checked)
free_methods (nothing relevant, we checked)

DOUBTS:

variations
attributes
dimension

# Rename columns
dfs = dfs.rename(columns = {'id':'descr_id', 'Id': 'id'})

# Reorder columns
dfs = dfs[['title', 'condition', 'warranty','initial_quantity', 'available_quantity', 'sold_quantity',
                'sub_status', 'buying_mode', 'original_price', 'base_price', 'price', 'currency_id',
                'seller_country', 'seller_state', 'seller_city', 'Giro_postal',  
                'free_shipping', 'local_pick_up', 'mode', 'tags', 'tag',
                'Contra_reembolso','Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo', 'Transferencia_bancaria', 'Tarjeta_de_crédito',
                'Mastercard_Maestro', 'MasterCard', 'Visa_Electron', 'Visa', 'Diners', 'American_Express',
                'status', 'automatic_relist',
                'accepts_mercadopago', 'MercadoPago', 
                'id', 'descr_id', 'deal_ids', 'parent_item_id', 'category_id', 'seller_id', 'official_store_id', 'video_id',
                'date_created', 'start_time', 'last_updated', 'stop_time']]

dfs['accepts_mercadopago'].value_counts()

True    97781
Name: accepts_mercadopago, dtype: int64

dfs['MercadoPago'].value_counts()

True    720
Name: MercadoPago, dtype: int64

# Merge columns about same subjects
dfs['accepts_mercadopago'] = dfs['accepts_mercadopago'].fillna(dfs['MercadoPago'])

dfs['MasterCard'].value_counts()

True    647
Name: MasterCard, dtype: int64

dfs['MasterCard'] = dfs['Mastercard_Maestro'].fillna(dfs['MercadoPago'])

dfs['Visa'] = dfs['Visa_Electron'].fillna(dfs['Visa'])

dfs['Tarjeta_de_crédito'].value_counts()

True    24638
Name: Tarjeta_de_crédito, dtype: int64

dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['Visa'])
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['MasterCard'])
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['Diners'])
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['American_Express'])
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['Visa'])

dfs['Tarjeta_de_crédito'].value_counts()

True    25928
Name: Tarjeta_de_crédito, dtype: int64

dfs = dfs.rename(columns = {'Tarjeta_de_crédito':'Aceptan_Tarjeta'})

# Drop used columns
dfs = dfs.drop(columns=['MercadoPago', 'Mastercard_Maestro', 'Visa_Electron'])
dfs = dfs.drop(columns=['Visa', 'MasterCard', 'Diners', 'American_Express'])

# Treat columns to access data
def try_join(l):
    try:
        return ','.join(map(str, l))
    except TypeError:
        return np.nan

dfs['sub_status'] = try_join(dfs['sub_status'])
dfs['tags'] = try_join(dfs['tags'])

dfs.columns

Index(['title', 'condition', 'warranty', 'initial_quantity',
       'available_quantity', 'sold_quantity', 'sub_status', 'buying_mode',
       'original_price', 'base_price', 'price', 'currency_id',
       'seller_country', 'seller_state', 'seller_city', 'Giro_postal',
       'free_shipping', 'local_pick_up', 'mode', 'tags', 'tag',
       'Contra_reembolso', 'Acordar_con_el_comprador', 'Cheque_certificado',
       'Efectivo', 'Transferencia_bancaria', 'Aceptan_Tarjeta', 'status',
       'automatic_relist', 'accepts_mercadopago', 'id', 'descr_id', 'deal_ids',
       'parent_item_id', 'category_id', 'seller_id', 'official_store_id',
       'video_id', 'date_created', 'start_time', 'last_updated', 'stop_time'],
      dtype='object')

dfs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 42 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   title                     100000 non-null  object        
 1   condition                 100000 non-null  object        
 2   warranty                  39103 non-null   object        
 3   initial_quantity          100000 non-null  int64         
 4   available_quantity        100000 non-null  int64         
 5   sold_quantity             16920 non-null   float64       
 6   sub_status                100000 non-null  object        
 7   buying_mode               100000 non-null  object        
 8   original_price            143 non-null     float64       
 9   base_price                100000 non-null  float64       
 10  price                     100000 non-null  float64       
 11  currency_id               100000 non-null  object        
 12  seller_country            99997 non-null   object        
 13  seller_state              99997 non-null   object        
 14  seller_city               99996 non-null   object        
 15  Giro_postal               1665 non-null    object        
 16  free_shipping             3016 non-null    object        
 17  local_pick_up             79561 non-null   object        
 18  mode                      100000 non-null  object        
 19  tags                      100000 non-null  object        
 20  tag                       75090 non-null   object        
 21  Contra_reembolso          648 non-null     object        
 22  Acordar_con_el_comprador  7991 non-null    object        
 23  Cheque_certificado        460 non-null     object        
 24  Efectivo                  67059 non-null   object        
 25  Transferencia_bancaria    51469 non-null   object        
 26  Aceptan_Tarjeta           25928 non-null   object        
 27  status                    100000 non-null  object        
 28  automatic_relist          4697 non-null    object        
 29  accepts_mercadopago       97781 non-null   object        
 30  id                        100000 non-null  object        
 31  descr_id                  41 non-null      object        
 32  deal_ids                  240 non-null     object        
 33  parent_item_id            76989 non-null   object        
 34  category_id               100000 non-null  object        
 35  seller_id                 100000 non-null  int64         
 36  official_store_id         818 non-null     float64       
 37  video_id                  2985 non-null    object        
 38  date_created              100000 non-null  object        
 39  start_time                100000 non-null  datetime64[ns]
 40  last_updated              100000 non-null  object        
 41  stop_time                 100000 non-null  datetime64[ns]
dtypes: datetime64[ns](2), float64(5), int64(3), object(32)
memory usage: 32.0+ MB

# Transform some columns to boolean type
dfs[['Giro_postal', 'free_shipping', 'local_pick_up', 'Contra_reembolso', 
     'Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo', 
     'Transferencia_bancaria', 'Aceptan_Tarjeta', 'automatic_relist']] = dfs[['Giro_postal', 'free_shipping', 'local_pick_up', 'Contra_reembolso', 
                                                          'Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo', 
                                                          'Transferencia_bancaria', 'Aceptan_Tarjeta', 'automatic_relist']].notna()

# Transform type of all columns
dfs = dfs.astype({'title':'str',
                  'condition': 'category', #bool
                  'warranty': 'category',
                  'initial_quantity': 'float', #int
                  'available_quantity': 'float', #int
                  'sold_quantity': 'float', #int
                  'sub_status': 'category', #bool?
                  'buying_mode': 'category',
                  'original_price': 'float',
                  'base_price': 'float',
                  'price': 'float',
                  'currency_id': 'category',
                  'seller_country': 'category',
                  'seller_state': 'category',
                  'seller_city': 'category',
                  'Giro_postal': 'bool',
                  'free_shipping': 'bool',
                  'local_pick_up': 'bool',
                  'mode': 'category',
                  'tags': 'category', #bool?
                  #'tag': 'category',
                  'Contra_reembolso': 'bool',
                  'Acordar_con_el_comprador': 'bool',
                  'Cheque_certificado': 'bool',
                  'Efectivo': 'bool',
                  'Transferencia_bancaria': 'bool',
                  'Aceptan_Tarjeta': 'bool',
                  'id': 'category',
                  'descr_id': 'category',
                  #'deal_ids': 'category',
                  'parent_item_id': 'category',
                  'category_id': 'category',
                  'seller_id': 'category',
                  'official_store_id': 'category',
                  'video_id': 'category',
                  #'date_created': 'datetime',
                  # 'start_time': 'datetime',
                  # 'last_updated': 'datetime',
                  # 'stop_time': 'datetime',
                  'status': 'category', #bool?
                  'automatic_relist': 'bool'
                                         })

dfs.columns

Index(['title', 'condition', 'warranty', 'initial_quantity',
       'available_quantity', 'sold_quantity', 'sub_status', 'buying_mode',
       'original_price', 'base_price', 'price', 'currency_id',
       'seller_country', 'seller_state', 'seller_city', 'Giro_postal',
       'free_shipping', 'local_pick_up', 'mode', 'tags', 'tag',
       'Contra_reembolso', 'Acordar_con_el_comprador', 'Cheque_certificado',
       'Efectivo', 'Transferencia_bancaria', 'Aceptan_Tarjeta', 'status',
       'automatic_relist', 'accepts_mercadopago', 'id', 'descr_id', 'deal_ids',
       'parent_item_id', 'category_id', 'seller_id', 'official_store_id',
       'video_id', 'date_created', 'start_time', 'last_updated', 'stop_time'],
      dtype='object')

# Check missing values
import numpy as np
import pandas as pd

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(dfs)

Your selected dataframe has 42 columns and 100000 Rows.
There are 13 columns that have missing values.

	Missing Values	% of Total Values	Total Zero Missing Values	% Total Zero Missing Values	Data Type
descr_id	99959	100.0	99959	100.0	category
original_price	99857	99.9	99857	99.9	float64
deal_ids	99760	99.8	99760	99.8	object
official_store_id	99182	99.2	99182	99.2	category
video_id	97015	97.0	97015	97.0	category
sold_quantity	83080	83.1	83080	83.1	float64
warranty	60897	60.9	60897	60.9	category
tag	24910	24.9	24910	24.9	object
parent_item_id	23011	23.0	23011	23.0	category
accepts_mercadopago	2219	2.2	2219	2.2	object
seller_city	4	0.0	4	0.0	category
seller_country	3	0.0	3	0.0	category
seller_state	3	0.0	3	0.0	category

display(dfs['seller_country'].value_counts())
dfs = dfs.drop(columns = 'seller_country') # We can drop Country column, it's always Argentina
display(dfs['seller_city'].mode()[0])
display(dfs['seller_state'].mode()[0])
dfs['seller_city'] = dfs['seller_city'].fillna(dfs['seller_city'].mode()[0])
dfs['seller_state'] = dfs['seller_state'].fillna(dfs['seller_state'].mode()[0])

Argentina    99997
Name: seller_country, dtype: int64



'CABA'



'Capital Federal'

dfs['accepts_mercadopago'] = dfs['accepts_mercadopago'].fillna(False)
dfs['sold_quantity'] = dfs['sold_quantity'].fillna(0) # Is it ok to fill sold_quantity with 0? [VALIDATE]

dfs['warranty'] = dfs['warranty'].replace(r'^\s*$', np.nan, regex=True)
dfs['warranty'].isna().sum()

import pandas as pd
df_temp1 = dfs[dfs['warranty'].isnull()]
df_temp1['warranty'] = False

df_temp2 = dfs[~dfs['warranty'].isnull()]
df_temp2['warranty'] = True

frames = [df_temp1, df_temp2]
dfs = pd.concat(frames)
dfs = dfs.astype({'warranty':'bool'})

dfs['warranty'].value_counts()

False    60897
True     39103
Name: warranty, dtype: int64

display('number of sold_quantity', dfs.sold_quantity.nunique())

'number of sold_quantity'



317

def get_value_per_cat():
    flag = dfs.select_dtypes(include=['category']).shape[1]
    i = 0

    while i <= flag:
        print(dict(dfs.select_dtypes(include=['category']).iloc[:,i:i+1].nunique()))
        i = i+1

get_value_per_cat()

{'condition': 2}
{'sub_status': 1}
{'buying_mode': 3}
{'currency_id': 2}
{'seller_state': 24}
{'seller_city': 3655}
{'mode': 4}
{'tags': 1}
{'status': 4}
{'id': 100000}
{'descr_id': 41}
{'parent_item_id': 76989}
{'category_id': 10907}
{'seller_id': 35915}
{'official_store_id': 198}
{'video_id': 2077}
{}

dfs.columns

Index(['title', 'condition', 'warranty', 'initial_quantity',
       'available_quantity', 'sold_quantity', 'sub_status', 'buying_mode',
       'original_price', 'base_price', 'price', 'currency_id', 'seller_state',
       'seller_city', 'Giro_postal', 'free_shipping', 'local_pick_up', 'mode',
       'tags', 'tag', 'Contra_reembolso', 'Acordar_con_el_comprador',
       'Cheque_certificado', 'Efectivo', 'Transferencia_bancaria',
       'Aceptan_Tarjeta', 'status', 'automatic_relist', 'accepts_mercadopago',
       'id', 'descr_id', 'deal_ids', 'parent_item_id', 'category_id',
       'seller_id', 'official_store_id', 'video_id', 'date_created',
       'start_time', 'last_updated', 'stop_time'],
      dtype='object')

import re
dfs['sub_status'] = dfs['sub_status'].str.replace('nan,','')
dfs['sub_status'] = dfs['sub_status'].str.replace(',nan','')
display(len(re.findall(r'suspended',dfs['sub_status'][1])))
display(dfs['sub_status'].value_counts().value_counts())
display(dfs.shape)

# We concluded this column is useless: every row has the same count of the same value ('suspended')
dfs = dfs.drop('sub_status', axis=1)

966



100000    1
Name: sub_status, dtype: int64



(100000, 41)

# dfs['tags'] = dfs['tags'].str.replace('nan,','')
# dfs['tags'] = dfs['tags'].str.replace(',nan','')

# from ast import literal_eval
# dfs['tags'] = dfs['tags'].apply(lambda x: literal_eval(str(x)))

# def deduplicate(column):
#     flag = len(column)
#     i = 0
    
#     while i <= flag:
#         try:
#             # 1. Convert into list of tuples
#             tpls = [tuple(x) for x in column[i]]
#             # 2. Create dictionary with empty values and
#             # 3. convert back to a list (dups removed)
#             dct = list(dict.fromkeys(tpls))
#             # 4. Convert list of tuples to list of lists
#             dup_free = [list(x) for x in lst]
#             # Print everything
#             column[i] = list(map(''.join, dup_free))
#             # [[1, 1], [0, 1], [0, 1], [1, 1]]
#             i = i+1
#         except:
#             return
        
# deduplicate(dfs['tags'])
# display(dfs['tags'].value_counts().value_counts())
# display(dfs.shape)
# display(dfs['tag'].value_counts().value_counts())

# Other useless colums -- all rows have the same values
dfs = dfs.drop('tags', axis=1)
dfs = dfs.drop('tag', axis=1)

display('dataframe shape', dfs.shape)
display('unique ids', dfs.id.nunique())
display('number of sellers', dfs.seller_id.nunique())
display('number of categories', dfs.category_id.nunique())

#Drop useless column
dfs = dfs.drop(['id'], axis=1)

'dataframe shape'



(100000, 38)



'unique ids'



100000



'number of sellers'



35915



'number of categories'



10907

missing_zero_values_table(dfs)

Your selected dataframe has 37 columns and 100000 Rows.
There are 6 columns that have missing values.

	Missing Values	% of Total Values	Total Zero Missing Values	% Total Zero Missing Values	Data Type
descr_id	99959	100.0	99959	100.0	category
original_price	99857	99.9	99857	99.9	float64
deal_ids	99760	99.8	99760	99.8	object
official_store_id	99182	99.2	99182	99.2	category
video_id	97015	97.0	97015	97.0	category
parent_item_id	23011	23.0	23011	23.0	category

dfs = dfs.dropna(axis=1) # drop all columns with missing values (we checked and they are not necessary or have too many missing values to imput properly)

from matplotlib import pyplot as plt
# Deal with datetimes to create new features
dfs['year_start'] = pd.to_datetime(dfs['start_time']).dt.year.astype('category')
dfs['month_start'] = pd.to_datetime(dfs['start_time']).dt.month.astype('category')
dfs['year_stop'] = pd.to_datetime(dfs['stop_time']).dt.year.astype('category')
dfs['month_stop'] = pd.to_datetime(dfs['stop_time']).dt.month.astype('category')
dfs['week_day'] = pd.to_datetime(dfs['stop_time']).dt.weekday.astype('category')
#dfs['days_active'] = (dfs['start_time'] - dfs['stop_time']).dt.days
dfs['days_active'] = [int(i.days) for i in (dfs.stop_time - dfs.start_time)]
dfs['days_active'] = dfs['days_active'].astype('int')
dfs = dfs.reset_index(drop=True)

#dfs = dfs.drop(['date_created', 'start_time', 'last_updated', 'stop_time'], axis=1)
boxplot = dfs.boxplot(column=['days_active'], showfliers=False)
plt.savefig('days_active.png', bbox_inches='tight', dpi = 300)

3.Modelling

3.1.Logistic Regression

# empty list to read list from a file
selected_features = []

# open file and read the content in a list
with open(r'selected_features.txt', 'r') as fp:
    for line in fp:
        # remove linebreak from a current name
        # linebreak is the last character of each line
        x = line[:-1]

        # add current item to the list
        selected_features.append(x)

# display list
print(selected_features)

['base_price', 'seller_id', 'available_quantity', 'seller_state', 'price', 'week_day', 'sold_quantity', 'mode', 'Transferencia_bancaria', 'category_id', 'Aceptan_Tarjeta', 'seller_city', 'initial_quantity', 'warranty', 'automatic_relist']

from sklearn import preprocessing

# Encode categorical columns to pass through model
mylist = list(dfs.select_dtypes(include=['category']).columns)
dfs[mylist] = dfs[mylist].apply(preprocessing.LabelEncoder().fit_transform)

dfs['log_price'] = np.log(dfs['price'] + 1)
dfs['log_base_price'] = np.log(dfs['base_price'] + 1)

import statsmodels.formula.api as fsm
import matplotlib.pyplot as plt
import seaborn as sns
model = fsm.logit(formula = 'condition ~ log_price' , data = dfs)
fit = model.fit()
fit.summary()

dfs['pred_baseline'] = fit.predict()

fig,ax = plt.subplots(1,2, figsize = (16,8))
sns.scatterplot(data = dfs, x = 'initial_quantity', y = 'pred_baseline', hue = 'Aceptan_Tarjeta', size = 'base_price', ax = ax[0])
sns.scatterplot(data = dfs, x = 'available_quantity', y = 'pred_baseline', hue = 'Aceptan_Tarjeta', size = 'base_price', ax = ax[1])
plt.savefig('logistic_baseline_plot.png', bbox_inches='tight', dpi = 300)

Optimization terminated successfully.
         Current function value: 0.681740
         Iterations 4

import matplotlib.pyplot as plt
import statsmodels.formula.api as fsm
model = fsm.logit(formula = 'condition ~ log_price : mode * seller_state', data = dfs)
fit = model.fit()
fit.summary()
dfs['pred_m1'] = fit.predict()

fig,ax = plt.subplots(1,2, figsize = (16,8))
sns.scatterplot(data = dfs, x = 'initial_quantity', y = 'pred_m1', hue = 'Aceptan_Tarjeta', size = 'base_price', ax = ax[0])
sns.scatterplot(data = dfs, x = 'available_quantity', y = 'pred_m1', hue = 'Aceptan_Tarjeta', size = 'base_price', ax = ax[1])
plt.savefig('logistic_tarjeta_plot.png', bbox_inches='tight', dpi = 300)

Optimization terminated successfully.
         Current function value: 0.690247
         Iterations 4

fig,ax = plt.subplots(1,2, figsize = (16,8))
sns.scatterplot(data = dfs, x = 'initial_quantity', y = 'pred_m1', hue = 'warranty', size = 'base_price', ax = ax[0])
sns.scatterplot(data = dfs, x = 'sold_quantity', y = 'pred_m1', hue = 'warranty', size = 'base_price', ax = ax[1])
plt.savefig('logistic_warranty_plot.png', bbox_inches='tight', dpi = 300)

from sklearn.metrics import f1_score

threshold_list = np.linspace(0.05, 0.95, 200)

f1_list = []
for threshold in threshold_list:
    pred_label = np.where(dfs['pred_m1'] < threshold, 0, 1)
    f1 = f1_score(dfs['condition'], pred_label)
    f1_list.append(f1)
    
df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)
plt.savefig('logistic_baseline_threshold.png', bbox_inches='tight', dpi = 300)

from sklearn.metrics import cohen_kappa_score, precision_score, roc_curve
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, roc_auc_score

threshold_list = np.linspace(0.05, 0.95, 200)

score_list = []
for threshold in threshold_list:
    pred_label = np.where(dfs['pred_m1'] < threshold, 0, 1)
    score = cohen_kappa_score(dfs['condition'], pred_label)
    score_list.append(score)

df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == max(df_score['score_score'])]
bt = df_score[df_score['score_score'] == max(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == max(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Kappa: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)
plt.savefig('logistic_kappa_threshold.png', bbox_inches='tight', dpi = 300)

from sklearn.metrics import roc_curve
#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(dfs['condition'], dfs['pred_baseline'])

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111, aspect=1)

sns.lineplot(x = fpr, y = fpr, ax = ax)
sns.lineplot(x = fpr, y = tpr, ax = ax)
plt.savefig('logistic_baseline_roc_curve.png', bbox_inches='tight', dpi = 300)

from sklearn.metrics import roc_curve
#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(dfs['condition'], dfs['pred_m1'])

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111, aspect=1)

sns.lineplot(x = fpr, y = fpr, ax = ax)
sns.lineplot(x = fpr, y = tpr, ax = ax)
plt.savefig('logistic_kappa_roc_curve.png', bbox_inches='tight', dpi = 300)

3.2.Model: XGBoost

%%time
import os
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from sklearn.metrics import cohen_kappa_score, precision_score
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, roc_auc_score

dfs['condition'] = dfs['condition'].replace('new', 0)
dfs['condition'] = dfs['condition'].replace('used', 1)

scaled_features = dfs.copy()
  
col_names = ['warranty', 'initial_quantity', 'available_quantity', 'sold_quantity',
       'base_price', 'price', 'Giro_postal', 'free_shipping', 'local_pick_up',
       'Contra_reembolso', 'Acordar_con_el_comprador', 'Cheque_certificado',
       'Efectivo', 'Transferencia_bancaria', 'Aceptan_Tarjeta',
       'automatic_relist', 'accepts_mercadopago', 'days_active']

features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features[col_names] = features


X = scaled_features.drop(columns=['condition'], axis=1)
#X = dfs.drop(columns='condition')
y = scaled_features.condition


X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=7)
Y_train = Y_train
Y_test = Y_test

full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), X_train.columns)], remainder='passthrough')

encoder = full_pipeline.fit(X_train)
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)

# train the model
model = xgb.XGBClassifier(n_estimators= 200,
                             max_depth= 30,                         # Lower ratios avoid over-fitting. Default is 6.
                             objective = 'binary:logistic',         # Default is reg:squarederror. 'multi:softprob' for multiclass and get proba.  
                             #num_class = 2,                        # Use if softprob is set.
                             reg_lambda = 10,                       # Larger ratios avoid over-fitting. Default is 1.
                             gamma = 0.3,                           # Larger values avoid over-fitting. Default is 0. # Values from 0.3 to 0.8 if you have many columns (especially if you did one-hot encoding), or 0.8 to 1 if you only have a few columns.
                             alpha = 1,                             # Larger ratios avoid over-fitting. Default is 0.
                             learning_rate= 0.10,                   # Lower ratios avoid over-fitting. Default is 3.
                             colsample_bytree= 0.7,                 # Lower ratios avoid over-fitting.
                             scale_pos_weight = 1,                  # Default is 1. Control balance of positive and negative weights, for unbalanced classes.
                             subsample = 0.1,                       # Lower ratios avoid over-fitting. Default 1. 0.5 recommended. # 0.1 if using GPU.
                             min_child_weight = 3,                  # Larger ratios avoid over-fitting. Default is 1.
                             missing = np.nan,                      # Deal with missing values
                             num_parallel_tree = 2,                 # Parallel trees constructed during each iteration. Default is 1.
                             importance_type = 'weight',
                             eval_metric = 'auc',
                             #use_label_encoder = True,
                             #enable_categorical = True,
                             verbosity = 1,
                             nthread = -1,                          # Set -1 to use all threads.
                             #use_rmm = True,                       # Use GPU if available
                             tree_method = 'auto', # auto           # 'gpu_hist'. Default is auto: analyze the data and chooses the fastest.
                             #gradient_based = True,                # If True you can set subsample as low as 0.1. Only use with gpu_hist 
                            )

# fit model              
model.fit(X_train_enc, Y_train.values.ravel(),
          # early_stopping_rounds=20
         )

# check best ntree limit
display(model.best_ntree_limit)

# extract the training set predictions
preds_train = model.predict(X_train_enc,
                            ntree_limit=model.best_ntree_limit
                           )
# extract the test set predictions
preds_test = model.predict(X_test_enc,
                           ntree_limit=model.best_ntree_limit
                           )

# save model
output_dir = "models"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
# save in JSON format
model.save_model(f'{output_dir}/meli_xgboost.json')
# save in text format
model.save_model(f'{output_dir}/meli_xgboost.txt')

print('FINISHED!')

/home/ggnicolau/miniconda3/envs/jupyter-1/lib/python3.10/site-packages/xgboost/sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)



400


/home/ggnicolau/miniconda3/envs/jupyter-1/lib/python3.10/site-packages/xgboost/core.py:105: UserWarning: ntree_limit is deprecated, use `iteration_range` or model slicing instead.
  warnings.warn(


FINISHED!
CPU times: user 12min 45s, sys: 2.88 s, total: 12min 48s
Wall time: 1min 56s

# extract the test set predictions
preds_test = model.predict_proba(X_test_enc,
                           ntree_limit=model.best_ntree_limit
                           )

/home/ggnicolau/miniconda3/envs/jupyter-1/lib/python3.10/site-packages/xgboost/core.py:105: UserWarning: ntree_limit is deprecated, use `iteration_range` or model slicing instead.
  warnings.warn(

%%time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import cohen_kappa_score, brier_score_loss 
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from numpy import sqrt, argmax, argmin

# Plot F1-Score and Threshold
threshold_list = np.linspace(0.05, 0.95, 200)

f1_list = []
for threshold in threshold_list:
    pred_label = np.where(preds_test[:,1] < threshold, 0, 1)
    f1 = f1_score(Y_test, pred_label)
    f1_list.append(f1)

df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)
plt.show()

# Plot your other Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)

score_list = []
for threshold in threshold_list:
    pred_label = np.where(preds_test[:,1] < threshold, 0, 1)
    score = brier_score_loss(Y_test, pred_label)
    score_list.append(score)

df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == min(df_score['score_score'])]
bt = df_score[df_score['score_score'] == min(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == min(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Brier: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)
plt.show()

from sklearn.metrics import roc_curve

#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(Y_test, preds_test[:,1])
roc = roc_auc_score(Y_test, preds_test[:,1])

# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))


plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    #marker='.',
    label=f"ROC curve (area ={'%.2f' % roc})"# % roc_auc["micro"],
)

plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') #threshold

plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("XGBoost Condition Classifier")
plt.legend(loc="lower right")
plt.savefig('xgboost_roc_curve.png', bbox_inches='tight', dpi = 300)
plt.show()

Best Threshold=0.505019, G-Mean=0.810

CPU times: user 1.8 s, sys: 629 ms, total: 2.43 s
Wall time: 1.65 s

# best_preds_score = np.where(preds_test < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(Y_test, preds_test[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(Y_test, preds_test[:,1])))
print("Brier_error = {}".format(brier_score_loss(Y_test, preds_test[:,1])))
print("Logloss_test = {}".format(log_loss(Y_test, preds_test[:,1])))
# print("Precision = {}".format(precision_score(Y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(Y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(Y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, preds_test[:,1])))

mean_squared_error_test = 0.3640702047156597
Roc_auc = 0.8888785004424654
Brier_error = 0.13254711396170238
Logloss_test = 0.4085390232688165

# apply threshold to positive probabilities to create labels
def to_labels_max(pos_probs, threshold):
    return (pos_probs >= threshold).astype('int')

# evaluate each threshold
scores = [roc_auc_score(Y_test, to_labels_max(preds_test[:,1], t)) for t in thresholds]
# get best threshold for max is better
ix = argmax(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.490, Roc_auc=0.81209

# evaluate each threshold
scores = [brier_score_loss(Y_test, to_labels_max(preds_test[:,1], t)) for t in thresholds]
# get best threshold for min is better
ix = argmin(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.505, Roc_auc=0.19180

Using patsy to combine features (couldn't run due to hardware limitations)

# %%time
# import xgboost as xgb
# from sklearn.metrics import cohen_kappa_score
# from sklearn.metrics import matthews_corrcoef
# from sklearn.metrics import f1_score
# from sklearn.model_selection import train_test_split
# import patsy
# # Selecting features I've found and using patsy to automatic interact between features.
# y, X = patsy.dmatrices('condition ~ Aceptan_Tarjeta + category_id + Efectivo + Transferencia_bancaria + automatic_relist + available_quantity + \
#                        base_price + warranty + sold_quantity + free_shipping + initial_quantity + local_pick_up + mode + \
#                        price + seller_id + seller_city + seller_state+ \
#                        year_start + month_start + year_stop  + month_stop + week_day + days_active', data = dfs)



                    
# # Display patsy features
# #display(X)

# X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)

# D_train = xgb.DMatrix(X_train, label=Y_train)#, enable_categorical=True)
# D_test = xgb.DMatrix(X_test, label=Y_test)#, enable_categorical=True)


# param = {
#     'eta': 0.10,                      # Lower ratios avoid over-fitting. Default is 3.
#     'max_depth': 30,                  # Lower ratios avoid over-fitting. Default is 6.
#     "min_child_weight": 3,            # Larger ratios avoid over-fitting. Default is 1.
#     "gamma": 0.3,                     # Larger values avoid over-fitting. Default is 0. 
#     "colsample_bytree" : 0.7,         # Lower ratios avoid over-fitting. Values from 0.3 to 0.8 if you have many columns (especially if you did one-hot encoding), or 0.8 to 1 if you only have a few columns.
#     "scale_pos_weight": 1,            # Default is 1. Control balance of positive and negative weights, for unbalanced classes.
#     "reg_lambda": 10,                 # Larger ratios avoid over-fitting. Default is 1.
#     "alpha": 1,                       # Larger ratios avoid over-fitting. Default is 0.
#     'subsample':0.5,                  # Lower ratios avoid over-fitting. Default 1. 0.5 recommended.
#     'num_parallel_tree': 2,           # Parallel trees constructed during each iteration. Default is 1.
#     'objective': 'multi:softprob',    # Default is reg:squarederror. 'multi:softprob' for multiclass.  
#     'num_class': 2,                   # Use if softprob is set.
#     'verbosity':1,
#     'eval_metric': 'auc',
#     'use_rmm':False,                   # Use GPU if available
#     'nthread':-1,                      # Set -1 to use all threads.
#     'tree_method': 'auto',             # 'gpu_hist'. Default is auto: analyze the data and chooses the fastest.
#     'gradient_based': False,           # If True you can set subsample as low as 0.1. Only use with gpu_hist 
# } 

# steps = 200  # The number of training iterations

# model = xgb.train(param, D_train, steps)
# import numpy as np
# from sklearn.metrics import precision_score, recall_score, accuracy_score

# preds = model.predict(D_test)
# best_preds = np.asarray([np.argmax(line) for line in preds])

# print("Precision = {}".format(precision_score(Y_test, best_preds)))
# print("Recall = {}".format(recall_score(Y_test, best_preds)))
# print("f1 = {}".format(f1_score(Y_test, best_preds)))
# print("kappa_score = {}".format(cohen_kappa_score(Y_test, best_preds)))
# print("matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, best_preds)))
# #print("mean_squared_error_train = {}".format(mean_squared_error(Y_train, best_preds)))
# # print("mean_squared_error_test = {}".format(mean_squared_error(Y_test, best_preds)))
# print("logloss_test = {}".format(log_loss(Y_test, best_preds)))
# #print("logloss_train = {}".format(log_loss(Y_train, best_preds)))

# # from xgboost import plot_importance
# # import matplotlib.pyplot as pyplot
# # plot_importance(model)
# # pyplot.show()

# from sklearn.metrics import roc_auc_score

# best_preds = np.where(preds_test < bt, 0, 1)

# print("Roc_auc = {}".format(roc_auc_score(Y_test, best_preds)))
# print("Precision = {}".format(precision_score(Y_test, best_preds)))
# print("Recall = {}".format(recall_score(Y_test, best_preds)))
# print("F1 = {}".format(f1_score(Y_test, best_preds)))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, best_preds)))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, best_preds)))
# print("Mean_squared_error_test = {}".format(mean_squared_error(Y_test, best_preds)))
# print("Logloss_test = {}".format(log_loss(Y_test, best_preds)))

3.3.Embeddings Encoding + Logistic Regression

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from embedding_encoder import EmbeddingEncoder
from embedding_encoder.utils.compose import ColumnTransformerWithNames

dfs.columns

Index(['title', 'condition', 'warranty', 'initial_quantity',
       'available_quantity', 'sold_quantity', 'buying_mode', 'base_price',
       'price', 'currency_id', 'seller_state', 'seller_city', 'Giro_postal',
       'free_shipping', 'local_pick_up', 'mode', 'Contra_reembolso',
       'Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo',
       'Transferencia_bancaria', 'Aceptan_Tarjeta', 'status',
       'automatic_relist', 'accepts_mercadopago', 'category_id', 'seller_id',
       'date_created', 'start_time', 'last_updated', 'stop_time', 'year_start',
       'month_start', 'year_stop', 'month_stop', 'week_day', 'days_active'],
      dtype='object')

dfs.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns

Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
       'price', 'days_active'],
      dtype='object')

dfs.select_dtypes(include=['category']).columns

Index(['condition', 'buying_mode', 'currency_id', 'seller_state',
       'seller_city', 'mode', 'status', 'category_id', 'seller_id',
       'year_start', 'month_start', 'year_stop', 'month_stop', 'week_day'],
      dtype='object')

# Split train and test
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64', 'category', 'bool']

X = dfs.select_dtypes(include=numerics).drop(columns=['condition'], axis=1)

dfs['condition'] = dfs['condition'].replace('new', 0)
dfs['condition'] = dfs['condition'].replace('used', 1)
y = dfs.condition

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

%%time
categorical_high = ["seller_city", "category_id"] #"seller_id"
numeric = X.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns#.drop(columns=['condition'], axis=1)
categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "year_stop", "month_start", "year_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "month_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status"] + list(X.select_dtypes(include=['bool']).columns)

def build_pipeline(mode: str):
    if mode == "embeddings":
        high_cardinality_encoder = EmbeddingEncoder(task="classification") #regression
    else:
        high_cardinality_encoder = OrdinalEncoder()
    one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
    scaler = StandardScaler()
    imputer = ColumnTransformerWithNames([("numeric", SimpleImputer(strategy="mean"), numeric), ("categorical", SimpleImputer(strategy="most_frequent"), categorical_low+categorical_high)])
    processor = ColumnTransformer([("one_hot", one_hot_encoder, categorical_low), (mode, high_cardinality_encoder, categorical_high), ("scale", scaler, numeric)])
    return make_pipeline(imputer, processor, LogisticRegression(max_iter=1000)) #RandomForestRegressor() #XGBClassifier()

embeddings_pipeline = build_pipeline("embeddings")

embeddings_pipeline.fit(X_train, y_train)

CPU times: user 5min 16s, sys: 1min 16s, total: 6min 32s
Wall time: 1min 12s





Pipeline(steps=[('columntransformerwithnames',
                 ColumnTransformerWithNames(transformers=[('numeric',
                                                           SimpleImputer(),
                                                           Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
       'price', 'days_active'],
      dtype='object')),
                                                          ('categorical',
                                                           SimpleImputer(strategy='most_frequent'),
                                                           ['buying_mode',
                                                            'currency_id',
                                                            'seller_state',
                                                            'mode', 'status',
                                                            'week...
                                                   'Transferencia_bancaria',
                                                   'Aceptan_Tarjeta',
                                                   'automatic_relist',
                                                   'accepts_mercadopago']),
                                                 ('embeddings',
                                                  EmbeddingEncoder(task='classification'),
                                                  ['seller_city',
                                                   'category_id']),
                                                 ('scale', StandardScaler(),
                                                  Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
       'price', 'days_active'],
      dtype='object'))])),
                ('logisticregression', LogisticRegression(max_iter=1000))])

y_pred_proba = embeddings_pipeline.predict_proba(X_test) #.decision_function(X_test)

%%time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import cohen_kappa_score, brier_score_loss 
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from numpy import sqrt, argmax, argmin

# Plot F1-Score and Threshold
threshold_list = np.linspace(0.05, 0.95, 200)

f1_list = []
for threshold in threshold_list:
    pred_label = np.where(y_pred_proba[:,1] < threshold, 0, 1)
    f1 = f1_score(y_test, pred_label)
    f1_list.append(f1)

df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)
plt.show()

# Plot your other Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)

score_list = []
for threshold in threshold_list:
    pred_label = np.where(y_pred_proba[:,1] < threshold, 0, 1)
    score = brier_score_loss(y_test, pred_label)
    score_list.append(score)

df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == min(df_score['score_score'])]
bt = df_score[df_score['score_score'] == min(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == min(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Brier: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)
plt.show()

from sklearn.metrics import roc_curve

#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:,1])
roc = roc_auc_score(y_test, y_pred_proba[:,1])

# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))


plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    #marker='.',
    label=f"ROC curve (area ={'%.2f' % roc})"# % roc_auc["micro"],
)

plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') #threshold

plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Embeddings + Logistic Condition Classifier")
plt.legend(loc="lower right")
plt.savefig('emb_logistic_roc_curve.png', bbox_inches='tight', dpi = 300)
plt.show()

Best Threshold=0.508526, G-Mean=0.823

CPU times: user 2.02 s, sys: 1.08 s, total: 3.1 s
Wall time: 1.78 s

# best_preds_score = np.where(preds_test < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(y_test, y_pred_proba[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(y_test, y_pred_proba[:,1])))
print("Brier_error = {}".format(brier_score_loss(y_test, y_pred_proba[:,1])))
print("Logloss_test = {}".format(log_loss(y_test, y_pred_proba[:,1])))
# print("Precision = {}".format(precision_score(Y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(Y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(Y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, preds_test[:,1])))

mean_squared_error_test = 0.3568355720384346
Roc_auc = 0.9007658597305785
Brier_error = 0.12733162547199683
Logloss_test = 0.40211995678282203

# apply threshold to positive probabilities to create labels
def to_labels_max(pos_probs, threshold): # higher is better
    return (pos_probs >= threshold).astype('int')

# evaluate each threshold
scores = [roc_auc_score(y_test, to_labels_max(y_pred_proba[:,1], t)) for t in thresholds]
# get best threshold for max is better
ix = argmax(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.509, Roc_auc=0.82327

# evaluate each threshold
scores = [brier_score_loss(y_test, to_labels_max(y_pred_proba[:,1], t)) for t in thresholds]
# get best threshold for min is better
ix = argmin(scores)
print('Threshold=%.3f, Brier=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.509, Brier=0.17665

3.4.Embeddings Encoding + XGBoost

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from embedding_encoder import EmbeddingEncoder
from embedding_encoder.utils.compose import ColumnTransformerWithNames

#dfs = pd.read_parquet('cleaned_data_haha.parquet.gzip')

dfs.columns

Index(['title', 'condition', 'warranty', 'initial_quantity',
       'available_quantity', 'sold_quantity', 'buying_mode', 'base_price',
       'price', 'currency_id', 'seller_state', 'seller_city', 'Giro_postal',
       'free_shipping', 'local_pick_up', 'mode', 'Contra_reembolso',
       'Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo',
       'Transferencia_bancaria', 'Aceptan_Tarjeta', 'status',
       'automatic_relist', 'accepts_mercadopago', 'category_id', 'seller_id',
       'date_created', 'start_time', 'last_updated', 'stop_time', 'year_start',
       'month_start', 'year_stop', 'month_stop', 'week_day', 'days_active'],
      dtype='object')

dfs.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns

Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
       'price', 'days_active'],
      dtype='object')

dfs.select_dtypes(include=['category']).columns

Index(['condition', 'buying_mode', 'currency_id', 'seller_state',
       'seller_city', 'mode', 'status', 'category_id', 'seller_id',
       'year_start', 'month_start', 'year_stop', 'month_stop', 'week_day'],
      dtype='object')

# Split train and test
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64', 'category', 'bool']

X = dfs.select_dtypes(include=numerics).drop(columns=['condition'], axis=1)

dfs['condition'] = dfs['condition'].replace('new', 0)
dfs['condition'] = dfs['condition'].replace('used', 1)
y = dfs.condition

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb

categorical_high = ["seller_city", "category_id"] # "seller_id"
numeric = X.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns#.drop(columns=['condition'], axis=1)
categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "year_stop", "month_start", "year_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "month_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status"] + list(X.select_dtypes(include=['bool']).columns)

def build_pipeline(mode: str):
    if mode == "embeddings":
        high_cardinality_encoder = EmbeddingEncoder(task="classification") #regression
    else:
        high_cardinality_encoder = OrdinalEncoder()
    one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
    scaler = StandardScaler()
    imputer = ColumnTransformerWithNames([("numeric", SimpleImputer(strategy="mean"), numeric), ("categorical", SimpleImputer(strategy="most_frequent"), categorical_low+categorical_high)])
    processor = ColumnTransformer([("one_hot", one_hot_encoder, categorical_low), (mode, high_cardinality_encoder, categorical_high), ("scale", scaler, numeric)])

    return make_pipeline(imputer, processor, xgb.XGBClassifier(n_estimators= 200,
                                                                 max_depth= 30,                         # Lower ratios avoid over-fitting. Default is 6.
                                                                 objective = 'binary:logistic',         # Default is reg:squarederror. 'multi:softprob' for multiclass and get proba.  
                                                                 #num_class = 2,                        # Use if softprob is set.
                                                                 reg_lambda = 10,                       # Larger ratios avoid over-fitting. Default is 1.
                                                                 gamma = 0.3,                           # Larger values avoid over-fitting. Default is 0. # Values from 0.3 to 0.8 if you have many columns (especially if you did one-hot encoding), or 0.8 to 1 if you only have a few columns.
                                                                 alpha = 1,                             # Larger ratios avoid over-fitting. Default is 0.
                                                                 learning_rate= 0.10,                   # Lower ratios avoid over-fitting. Default is 3.
                                                                 colsample_bytree= 0.7,                 # Lower ratios avoid over-fitting.
                                                                 scale_pos_weight = 1,                  # Default is 1. Control balance of positive and negative weights, for unbalanced classes.
                                                                 subsample = 0.1,                       # Lower ratios avoid over-fitting. Default 1. 0.5 recommended. # 0.1 if using GPU.
                                                                 min_child_weight = 3,                  # Larger ratios avoid over-fitting. Default is 1.
                                                                 missing = np.nan,                      # Deal with missing values
                                                                 num_parallel_tree = 2,                 # Parallel trees constructed during each iteration. Default is 1.
                                                                 importance_type = 'weight',
                                                                 eval_metric = 'auc',
                                                                 use_label_encoder = False,             # True is 
                                                                 #enable_categorical = True,
                                                                 verbosity = 1,
                                                                 nthread = -1,                          # Set -1 to use all threads.
                                                                 #use_rmm = True,                       # Use GPU if available
                                                                 tree_method = 'auto', # auto           # 'gpu_hist'. Default is auto: analyze the data and chooses the fastest.
                                                                 #gradient_based = True,
                                                                )) #RandomForestClassifier() #LogisticRegression())

%%time
embeddings_pipeline = build_pipeline("embeddings")
embeddings_pipeline.fit(X_train, y_train)
embedding_preds = embeddings_pipeline.predict(X_test)

CPU times: user 18min 11s, sys: 15 s, total: 18min 26s
Wall time: 3min 6s

# Check accuracy for classes
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score, cohen_kappa_score, matthews_corrcoef

print("Accuracy = {}".format(accuracy_score(y_test, embedding_preds)))
print("Balanced accuracy = {}".format(balanced_accuracy_score(y_test, embedding_preds)))
print("Precision = {}".format(precision_score(y_test, embedding_preds)))
print("Recall = {}".format(recall_score(y_test, embedding_preds)))
print("F1 = {}".format(f1_score(y_test, embedding_preds)))
print("Kappa_score = {}".format(cohen_kappa_score(y_test, embedding_preds)))
print("Matthews_corrcoef = {}".format(matthews_corrcoef(y_test, embedding_preds)))

Accuracy = 0.85885
Balanced accuracy = 0.8587227386116755
Precision = 0.839697904478247
Recall = 0.8571118349619978
F1 = 0.8483155123314169
Kappa_score = 0.7163577766348136
Matthews_corrcoef = 0.7164897258171989

# Check target column balance
dfs.condition.value_counts()

0    53758
1    46242
Name: condition, dtype: int64

%%time
embeddings_pipeline = build_pipeline("embeddings")
embeddings_pipeline.fit(X_train, y_train)

CPU times: user 15min 37s, sys: 8.47 s, total: 15min 45s
Wall time: 2min 24s

# Check probabilities score
embedding_preds = embeddings_pipeline.predict_proba(X_test)

%%time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import cohen_kappa_score, brier_score_loss 
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from numpy import sqrt, argmax, argmin

# Plot F1-Score and Threshold
threshold_list = np.linspace(0.05, 0.95, 200)

f1_list = []
for threshold in threshold_list:
    pred_label = np.where(embedding_preds[:,1] < threshold, 0, 1)
    f1 = f1_score(y_test, pred_label)
    f1_list.append(f1)

df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)
plt.show()

# Plot your other Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)

score_list = []
for threshold in threshold_list:
    pred_label = np.where(embedding_preds[:,1] < threshold, 0, 1)
    score = brier_score_loss(y_test, pred_label)
    score_list.append(score)

df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == min(df_score['score_score'])]
bt = df_score[df_score['score_score'] == min(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == min(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Brier: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)
plt.show()

from sklearn.metrics import roc_curve

#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(y_test, embedding_preds[:,1])
roc = roc_auc_score(y_test, embedding_preds[:,1])

# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))


plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    #marker='.',
    label=f"ROC curve (area ={'%.2f' % roc})"# % roc_auc["micro"],
)

plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') #threshold

plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Embeddings + XGBoost Condition Classifier")
plt.legend(loc="lower right")
plt.savefig('emb_xgboost_curve.png', bbox_inches='tight', dpi = 300)
plt.show()

Best Threshold=0.405652, G-Mean=0.865

CPU times: user 2.06 s, sys: 613 ms, total: 2.67 s
Wall time: 1.83 s

# best_preds_score = np.where(embedding_preds < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(y_test, embedding_preds[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(y_test, embedding_preds[:,1])))
print("Brier_error = {}".format(brier_score_loss(y_test, embedding_preds[:,1])))
print("Logloss_test = {}".format(log_loss(y_test, embedding_preds[:,1])))
# print("Precision = {}".format(precision_score(Y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(Y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(Y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, preds_test[:,1])))

mean_squared_error_test = 0.31567197986751955
Roc_auc = 0.9358383924037762
Brier_error = 0.09964879887347967
Logloss_test = 0.3227270290231638

best_preds_score = np.where(embedding_preds < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(y_test, best_preds_score[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(y_test, best_preds_score[:,1])))
print("Brier_error = {}".format(brier_score_loss(y_test, best_preds_score[:,1])))
print("Logloss_test = {}".format(log_loss(y_test, best_preds_score[:,1])))
# print("Precision = {}".format(precision_score(Y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(Y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(Y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, preds_test[:,1])))

mean_squared_error_test = 0.36939139134527754
Roc_auc = 0.8636920061833471
Brier_error = 0.13645
Logloss_test = 4.712875409194756

# apply threshold to positive probabilities to create labels
def to_labels_max(pos_probs, threshold):
    return (pos_probs >= threshold).astype('int')

# evaluate each threshold
scores = [roc_auc_score(y_test, to_labels_max(embedding_preds[:,1], t)) for t in thresholds]
# get best threshold for max is better
ix = argmax(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.406, Roc_auc=0.86562

# evaluate each threshold
scores = [brier_score_loss(y_test, to_labels_max(embedding_preds[:,1], t)) for t in thresholds]
# get best threshold for min is better
ix = argmin(scores)
print('Threshold=%.3f, Brier=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.528, Brier=0.13625

Model: Embeddings encoding + Neural Networks

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from embedding_encoder import EmbeddingEncoder
from embedding_encoder.utils.compose import ColumnTransformerWithNames

dfs.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns

Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
       'price', 'days_active'],
      dtype='object')

dfs.select_dtypes(include=['category']).columns

Index(['condition', 'buying_mode', 'currency_id', 'seller_state',
       'seller_city', 'mode', 'status', 'category_id', 'seller_id',
       'year_start', 'month_start', 'year_stop', 'month_stop', 'week_day'],
      dtype='object')

# Split train and test
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64', 'category', 'bool']

X = dfs.select_dtypes(include=numerics).drop(columns=['condition'], axis=1)

dfs['condition'] = dfs['condition'].replace('new', 0)
dfs['condition'] = dfs['condition'].replace('used', 1)
y = dfs.condition

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

import keras
import tensorflow as tf
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Activation, Dense
from keras.layers import Dropout

categorical_high = ["seller_city", "category_id"] # "seller_id"
numeric = X.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns#.drop(columns=['condition'], axis=1)
categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "year_stop", "month_start", "year_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "month_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status"] + list(X.select_dtypes(include=['bool']).columns)

def build_pipeline(mode: str):
    if mode == "embeddings":
        high_cardinality_encoder = EmbeddingEncoder(task="classification") #regression
    else:
        high_cardinality_encoder = OrdinalEncoder()
    one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
    scaler = StandardScaler()
    imputer = ColumnTransformerWithNames([("numeric", SimpleImputer(strategy="mean"), numeric), ("categorical", SimpleImputer(strategy="most_frequent"), categorical_low+categorical_high)])
    processor = ColumnTransformer([("one_hot", one_hot_encoder, categorical_low), (mode, high_cardinality_encoder, categorical_high), ("scale", scaler, numeric)])

    def twoLayerFeedForward():
        model = Sequential()
        model.add(keras.layers.Dense(300, activation=tf.nn.relu)) #input_dim=300
        model.add(keras.layers.Dense(128, activation=tf.nn.relu))
        model.add(keras.layers.Dense(64, activation=tf.nn.relu))
        model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
        model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
        return model


    # clf = KerasClassifier(TwoLayerFeedForward(), epochs=100, batch_size=500, verbose=0)
    model = KerasClassifier(twoLayerFeedForward, verbose=1, validation_split=0.15, shuffle=True, epochs=100, batch_size=512) #batch_size=32
    
    return make_pipeline(imputer, processor, model) #RandomForestClassifier() #LogisticRegression())

2022-08-03 18:14:41.595116: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-08-03 18:14:41.595137: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

%%time
embeddings_pipeline = build_pipeline("embeddings")
history = embeddings_pipeline.fit(X_train, y_train)

/tmp/ipykernel_700621/402245863.py:35: DeprecationWarning: KerasClassifier is deprecated, use Sci-Keras (https://github.com/adriangb/scikeras) instead. See https://www.adriangb.com/scikeras/stable/migration.html for help migrating.
  model = KerasClassifier(twoLayerFeedForward, verbose=1, validation_split=0.15, shuffle=True, epochs=100, batch_size=512) #batch_size=32
2022-08-03 18:14:43.959308: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-08-03 18:14:43.959359: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-08-03 18:14:43.959388: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (brspobitanl1727): /proc/driver/nvidia/version does not exist
2022-08-03 18:14:43.959671: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/100
150/150 [==============================] - 1s 4ms/step - loss: 0.3302 - accuracy: 0.8555 - val_loss: 0.4014 - val_accuracy: 0.8274
Epoch 2/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2948 - accuracy: 0.8732 - val_loss: 0.3881 - val_accuracy: 0.8350
Epoch 3/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2824 - accuracy: 0.8800 - val_loss: 0.3828 - val_accuracy: 0.8364
Epoch 4/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2750 - accuracy: 0.8824 - val_loss: 0.3810 - val_accuracy: 0.8387
Epoch 5/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2694 - accuracy: 0.8867 - val_loss: 0.3842 - val_accuracy: 0.8399
Epoch 6/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2638 - accuracy: 0.8890 - val_loss: 0.3749 - val_accuracy: 0.8391
Epoch 7/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2601 - accuracy: 0.8903 - val_loss: 0.3817 - val_accuracy: 0.8413
Epoch 8/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2559 - accuracy: 0.8921 - val_loss: 0.3821 - val_accuracy: 0.8379
Epoch 9/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2502 - accuracy: 0.8957 - val_loss: 0.3781 - val_accuracy: 0.8457
Epoch 10/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2470 - accuracy: 0.8964 - val_loss: 0.3804 - val_accuracy: 0.8427
Epoch 11/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2415 - accuracy: 0.8997 - val_loss: 0.3774 - val_accuracy: 0.8443
Epoch 12/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2406 - accuracy: 0.9004 - val_loss: 0.3862 - val_accuracy: 0.8428
Epoch 13/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2356 - accuracy: 0.9015 - val_loss: 0.3810 - val_accuracy: 0.8388
Epoch 14/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2286 - accuracy: 0.9060 - val_loss: 0.3822 - val_accuracy: 0.8407
Epoch 15/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2257 - accuracy: 0.9068 - val_loss: 0.3956 - val_accuracy: 0.8415
Epoch 16/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2221 - accuracy: 0.9087 - val_loss: 0.4000 - val_accuracy: 0.8402
Epoch 17/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2181 - accuracy: 0.9107 - val_loss: 0.3942 - val_accuracy: 0.8445
Epoch 18/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2096 - accuracy: 0.9142 - val_loss: 0.4040 - val_accuracy: 0.8404
Epoch 19/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2074 - accuracy: 0.9143 - val_loss: 0.4052 - val_accuracy: 0.8436
Epoch 20/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2030 - accuracy: 0.9177 - val_loss: 0.4251 - val_accuracy: 0.8426
Epoch 21/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1960 - accuracy: 0.9199 - val_loss: 0.4199 - val_accuracy: 0.8422
Epoch 22/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1944 - accuracy: 0.9209 - val_loss: 0.4563 - val_accuracy: 0.8390
Epoch 23/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1916 - accuracy: 0.9237 - val_loss: 0.4386 - val_accuracy: 0.8420
Epoch 24/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1847 - accuracy: 0.9251 - val_loss: 0.4574 - val_accuracy: 0.8372
Epoch 25/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1788 - accuracy: 0.9272 - val_loss: 0.4759 - val_accuracy: 0.8353
Epoch 26/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1738 - accuracy: 0.9300 - val_loss: 0.4750 - val_accuracy: 0.8450
Epoch 27/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1708 - accuracy: 0.9308 - val_loss: 0.4869 - val_accuracy: 0.8407
Epoch 28/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1645 - accuracy: 0.9334 - val_loss: 0.4733 - val_accuracy: 0.8411
Epoch 29/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1609 - accuracy: 0.9350 - val_loss: 0.4808 - val_accuracy: 0.8321
Epoch 30/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1557 - accuracy: 0.9373 - val_loss: 0.5059 - val_accuracy: 0.8384
Epoch 31/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1507 - accuracy: 0.9401 - val_loss: 0.4927 - val_accuracy: 0.8382
Epoch 32/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1430 - accuracy: 0.9423 - val_loss: 0.5239 - val_accuracy: 0.8348
Epoch 33/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1410 - accuracy: 0.9434 - val_loss: 0.5344 - val_accuracy: 0.8355
Epoch 34/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1390 - accuracy: 0.9442 - val_loss: 0.5711 - val_accuracy: 0.8362
Epoch 35/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1318 - accuracy: 0.9470 - val_loss: 0.5636 - val_accuracy: 0.8361
Epoch 36/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1295 - accuracy: 0.9484 - val_loss: 0.5880 - val_accuracy: 0.8398
Epoch 37/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1216 - accuracy: 0.9520 - val_loss: 0.6103 - val_accuracy: 0.8346
Epoch 38/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1163 - accuracy: 0.9540 - val_loss: 0.6112 - val_accuracy: 0.8316
Epoch 39/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1173 - accuracy: 0.9536 - val_loss: 0.6456 - val_accuracy: 0.8292
Epoch 40/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1127 - accuracy: 0.9547 - val_loss: 0.6430 - val_accuracy: 0.8373
Epoch 41/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1061 - accuracy: 0.9580 - val_loss: 0.6648 - val_accuracy: 0.8347
Epoch 42/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1020 - accuracy: 0.9607 - val_loss: 0.7315 - val_accuracy: 0.8348
Epoch 43/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1013 - accuracy: 0.9598 - val_loss: 0.6618 - val_accuracy: 0.8333
Epoch 44/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0938 - accuracy: 0.9637 - val_loss: 0.7261 - val_accuracy: 0.8273
Epoch 45/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0941 - accuracy: 0.9627 - val_loss: 0.7338 - val_accuracy: 0.8279
Epoch 46/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0913 - accuracy: 0.9640 - val_loss: 0.8022 - val_accuracy: 0.8339
Epoch 47/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0849 - accuracy: 0.9672 - val_loss: 0.7733 - val_accuracy: 0.8305
Epoch 48/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0839 - accuracy: 0.9679 - val_loss: 0.8097 - val_accuracy: 0.8351
Epoch 49/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0822 - accuracy: 0.9686 - val_loss: 0.8593 - val_accuracy: 0.8363
Epoch 50/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0792 - accuracy: 0.9694 - val_loss: 0.8464 - val_accuracy: 0.8343
Epoch 51/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0766 - accuracy: 0.9709 - val_loss: 0.8365 - val_accuracy: 0.8360
Epoch 52/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0683 - accuracy: 0.9743 - val_loss: 0.9086 - val_accuracy: 0.8327
Epoch 53/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0726 - accuracy: 0.9721 - val_loss: 0.9122 - val_accuracy: 0.8352
Epoch 54/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0626 - accuracy: 0.9765 - val_loss: 0.9309 - val_accuracy: 0.8290
Epoch 55/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0742 - accuracy: 0.9711 - val_loss: 0.9134 - val_accuracy: 0.8314
Epoch 56/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0594 - accuracy: 0.9771 - val_loss: 0.9703 - val_accuracy: 0.8296
Epoch 57/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0592 - accuracy: 0.9777 - val_loss: 0.9761 - val_accuracy: 0.8267
Epoch 58/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0619 - accuracy: 0.9764 - val_loss: 0.9635 - val_accuracy: 0.8291
Epoch 59/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0585 - accuracy: 0.9777 - val_loss: 0.9953 - val_accuracy: 0.8311
Epoch 60/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0535 - accuracy: 0.9805 - val_loss: 1.0472 - val_accuracy: 0.8281
Epoch 61/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0503 - accuracy: 0.9809 - val_loss: 1.0811 - val_accuracy: 0.8307
Epoch 62/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0492 - accuracy: 0.9814 - val_loss: 1.1155 - val_accuracy: 0.8359
Epoch 63/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0521 - accuracy: 0.9813 - val_loss: 1.1467 - val_accuracy: 0.8324
Epoch 64/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0465 - accuracy: 0.9823 - val_loss: 1.1086 - val_accuracy: 0.8286
Epoch 65/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0453 - accuracy: 0.9834 - val_loss: 1.1806 - val_accuracy: 0.8213
Epoch 66/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0457 - accuracy: 0.9829 - val_loss: 1.1553 - val_accuracy: 0.8266
Epoch 67/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0521 - accuracy: 0.9806 - val_loss: 1.1109 - val_accuracy: 0.8237
Epoch 68/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0488 - accuracy: 0.9822 - val_loss: 1.1458 - val_accuracy: 0.8236
Epoch 69/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0400 - accuracy: 0.9856 - val_loss: 1.2181 - val_accuracy: 0.8319
Epoch 70/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0411 - accuracy: 0.9846 - val_loss: 1.2346 - val_accuracy: 0.8304
Epoch 71/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0433 - accuracy: 0.9839 - val_loss: 1.1918 - val_accuracy: 0.8281
Epoch 72/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0376 - accuracy: 0.9864 - val_loss: 1.3038 - val_accuracy: 0.8265
Epoch 73/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0361 - accuracy: 0.9870 - val_loss: 1.3390 - val_accuracy: 0.8274
Epoch 74/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0379 - accuracy: 0.9862 - val_loss: 1.2512 - val_accuracy: 0.8244
Epoch 75/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0419 - accuracy: 0.9852 - val_loss: 1.3643 - val_accuracy: 0.8255
Epoch 76/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0421 - accuracy: 0.9846 - val_loss: 1.2699 - val_accuracy: 0.8267
Epoch 77/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0398 - accuracy: 0.9858 - val_loss: 1.3021 - val_accuracy: 0.8292
Epoch 78/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0304 - accuracy: 0.9898 - val_loss: 1.3497 - val_accuracy: 0.8275
Epoch 79/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0407 - accuracy: 0.9856 - val_loss: 1.3319 - val_accuracy: 0.8291
Epoch 80/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0430 - accuracy: 0.9869 - val_loss: 1.3290 - val_accuracy: 0.8302
Epoch 81/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0337 - accuracy: 0.9891 - val_loss: 1.3899 - val_accuracy: 0.8301
Epoch 82/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0315 - accuracy: 0.9892 - val_loss: 1.3707 - val_accuracy: 0.8276
Epoch 83/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0336 - accuracy: 0.9875 - val_loss: 1.3784 - val_accuracy: 0.8274
Epoch 84/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0345 - accuracy: 0.9875 - val_loss: 1.4005 - val_accuracy: 0.8295
Epoch 85/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0307 - accuracy: 0.9893 - val_loss: 1.3823 - val_accuracy: 0.8269
Epoch 86/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0401 - accuracy: 0.9862 - val_loss: 1.4838 - val_accuracy: 0.8297
Epoch 87/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0352 - accuracy: 0.9872 - val_loss: 1.4347 - val_accuracy: 0.8322
Epoch 88/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0284 - accuracy: 0.9894 - val_loss: 1.4827 - val_accuracy: 0.8289
Epoch 89/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0369 - accuracy: 0.9873 - val_loss: 1.4705 - val_accuracy: 0.8270
Epoch 90/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0313 - accuracy: 0.9889 - val_loss: 1.5390 - val_accuracy: 0.8243
Epoch 91/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0290 - accuracy: 0.9895 - val_loss: 1.4780 - val_accuracy: 0.8302
Epoch 92/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0281 - accuracy: 0.9900 - val_loss: 1.5518 - val_accuracy: 0.8297
Epoch 93/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0284 - accuracy: 0.9901 - val_loss: 1.5659 - val_accuracy: 0.8321
Epoch 94/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0306 - accuracy: 0.9893 - val_loss: 1.4831 - val_accuracy: 0.8287
Epoch 95/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0359 - accuracy: 0.9867 - val_loss: 1.5319 - val_accuracy: 0.8230
Epoch 96/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0336 - accuracy: 0.9881 - val_loss: 1.5192 - val_accuracy: 0.8311
Epoch 97/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0279 - accuracy: 0.9902 - val_loss: 1.4872 - val_accuracy: 0.8316
Epoch 98/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0269 - accuracy: 0.9910 - val_loss: 1.5875 - val_accuracy: 0.8327
Epoch 99/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0270 - accuracy: 0.9902 - val_loss: 1.4886 - val_accuracy: 0.8284
Epoch 100/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0332 - accuracy: 0.9882 - val_loss: 1.5127 - val_accuracy: 0.8242
CPU times: user 7min 20s, sys: 19.6 s, total: 7min 39s
Wall time: 1min 51s

# from keras.utils.vis_utils import plot_model
# plot_model(model, to_file='model.png')

# import matplotlib.pyplot as plt

# plt.plot(history[0]['accuracy'])
# plt.plot(history[0]['val_accuracy'])
# plt.title('model accuracy')
# plt.ylabel('accuracy')
# plt.xlabel('epoch')
# plt.legend(['train', 'val'], loc='upper left')
# plt.show()

import keras
import tensorflow as tf
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Activation, Dense
from keras.layers import Dropout

categorical_high = ["seller_city", "category_id"] # "seller_id"
numeric = X.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns#.drop(columns=['condition'], axis=1)
categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "year_stop", "month_start", "year_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "month_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status"] + list(X.select_dtypes(include=['bool']).columns)

def build_pipeline(mode: str):
    if mode == "embeddings":
        high_cardinality_encoder = EmbeddingEncoder(task="classification") #regression
    else:
        high_cardinality_encoder = OrdinalEncoder()
    one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
    scaler = StandardScaler()
    imputer = ColumnTransformerWithNames([("numeric", SimpleImputer(strategy="mean"), numeric), ("categorical", SimpleImputer(strategy="most_frequent"), categorical_low+categorical_high)])
    processor = ColumnTransformer([("one_hot", one_hot_encoder, categorical_low), (mode, high_cardinality_encoder, categorical_high), ("scale", scaler, numeric)])

    def threeLayerFeedForward():
        model = Sequential()     

        model.add(keras.layers.Dense(300, activation=tf.nn.relu, kernel_initializer='glorot_uniform')) #input_dim=df_train.shape[1])) #16
        model.add(keras.layers.Dropout(0.4))

        model.add(keras.layers.Dense(128, activation=tf.nn.relu, kernel_initializer='glorot_uniform')) #8
        model.add(keras.layers.Dropout(0.4))
        
        model.add(keras.layers.Dense(64, activation=tf.nn.relu, kernel_initializer='glorot_uniform')) # None
        model.add(keras.layers.Dropout(0.4))
                  
        model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid, kernel_initializer='glorot_uniform')) # nn.softmax if multiclass

        optimizer =  tf.keras.optimizers.Adamax(
                         learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07,
                         name='Adamax'
                     ) 
        
#         tf.keras.optimizers.Adam(
#              learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,
#              name='Adam'
#          )         

        model.compile(optimizer= optimizer, # 'adam' # SGD()
                      loss='binary_crossentropy', # categorical_crossentropy if multilabel
                      metrics=['accuracy']
                     ) 
        
        return model

    es_callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
    # clf = KerasClassifier(TwoLayerFeedForward(), epochs=100, batch_size=500, verbose=0)
    model = KerasClassifier(threeLayerFeedForward, verbose=1, validation_split=0.05, shuffle=True, epochs=100, batch_size=512, callbacks=[es_callback]) #batch_size=32
    
    return make_pipeline(imputer, processor, model) #RandomForestClassifier() #LogisticRegression())

%%time
embeddings_pipeline = build_pipeline("embeddings")
history = embeddings_pipeline.fit(X_train, y_train)

/tmp/ipykernel_700621/3919486494.py:57: DeprecationWarning: KerasClassifier is deprecated, use Sci-Keras (https://github.com/adriangb/scikeras) instead. See https://www.adriangb.com/scikeras/stable/migration.html for help migrating.
  model = KerasClassifier(threeLayerFeedForward, verbose=1, validation_split=0.05, shuffle=True, epochs=100, batch_size=512, callbacks=[es_callback]) #batch_size=32


Epoch 1/100
167/167 [==============================] - 1s 5ms/step - loss: 0.4341 - accuracy: 0.7957 - val_loss: 0.4002 - val_accuracy: 0.8182
Epoch 2/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3534 - accuracy: 0.8501 - val_loss: 0.3905 - val_accuracy: 0.8218
Epoch 3/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3447 - accuracy: 0.8528 - val_loss: 0.3935 - val_accuracy: 0.8253
Epoch 4/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3396 - accuracy: 0.8553 - val_loss: 0.3886 - val_accuracy: 0.8260
Epoch 5/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3367 - accuracy: 0.8573 - val_loss: 0.3838 - val_accuracy: 0.8269
Epoch 6/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3344 - accuracy: 0.8592 - val_loss: 0.3834 - val_accuracy: 0.8278
Epoch 7/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3307 - accuracy: 0.8597 - val_loss: 0.3816 - val_accuracy: 0.8267
Epoch 8/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3289 - accuracy: 0.8610 - val_loss: 0.3795 - val_accuracy: 0.8269
Epoch 9/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3264 - accuracy: 0.8620 - val_loss: 0.3776 - val_accuracy: 0.8307
Epoch 10/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3238 - accuracy: 0.8642 - val_loss: 0.3795 - val_accuracy: 0.8298
Epoch 11/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3241 - accuracy: 0.8641 - val_loss: 0.3786 - val_accuracy: 0.8322
Epoch 12/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3193 - accuracy: 0.8656 - val_loss: 0.3772 - val_accuracy: 0.8327
Epoch 13/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3202 - accuracy: 0.8663 - val_loss: 0.3740 - val_accuracy: 0.8342
Epoch 14/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3167 - accuracy: 0.8672 - val_loss: 0.3747 - val_accuracy: 0.8331
Epoch 15/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3158 - accuracy: 0.8680 - val_loss: 0.3710 - val_accuracy: 0.8358
Epoch 16/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3159 - accuracy: 0.8697 - val_loss: 0.3698 - val_accuracy: 0.8360
Epoch 17/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3130 - accuracy: 0.8697 - val_loss: 0.3688 - val_accuracy: 0.8369
Epoch 18/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3109 - accuracy: 0.8706 - val_loss: 0.3679 - val_accuracy: 0.8384
Epoch 19/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3099 - accuracy: 0.8713 - val_loss: 0.3657 - val_accuracy: 0.8378
Epoch 20/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3077 - accuracy: 0.8724 - val_loss: 0.3652 - val_accuracy: 0.8380
Epoch 21/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3076 - accuracy: 0.8719 - val_loss: 0.3635 - val_accuracy: 0.8387
Epoch 22/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3064 - accuracy: 0.8734 - val_loss: 0.3649 - val_accuracy: 0.8364
Epoch 23/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3036 - accuracy: 0.8738 - val_loss: 0.3636 - val_accuracy: 0.8407
Epoch 24/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3024 - accuracy: 0.8753 - val_loss: 0.3615 - val_accuracy: 0.8422
Epoch 25/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3012 - accuracy: 0.8756 - val_loss: 0.3643 - val_accuracy: 0.8382
Epoch 26/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3007 - accuracy: 0.8766 - val_loss: 0.3591 - val_accuracy: 0.8456
Epoch 27/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2979 - accuracy: 0.8764 - val_loss: 0.3595 - val_accuracy: 0.8420
Epoch 28/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2971 - accuracy: 0.8765 - val_loss: 0.3581 - val_accuracy: 0.8442
Epoch 29/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2968 - accuracy: 0.8770 - val_loss: 0.3579 - val_accuracy: 0.8398
Epoch 30/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2935 - accuracy: 0.8793 - val_loss: 0.3575 - val_accuracy: 0.8436
Epoch 31/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2926 - accuracy: 0.8785 - val_loss: 0.3597 - val_accuracy: 0.8416
Epoch 32/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2921 - accuracy: 0.8803 - val_loss: 0.3559 - val_accuracy: 0.8458
Epoch 33/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2904 - accuracy: 0.8804 - val_loss: 0.3551 - val_accuracy: 0.8444
Epoch 34/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2901 - accuracy: 0.8802 - val_loss: 0.3555 - val_accuracy: 0.8418
Epoch 35/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2880 - accuracy: 0.8816 - val_loss: 0.3516 - val_accuracy: 0.8458
Epoch 36/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2873 - accuracy: 0.8814 - val_loss: 0.3551 - val_accuracy: 0.8451
Epoch 37/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2876 - accuracy: 0.8815 - val_loss: 0.3571 - val_accuracy: 0.8458
Epoch 38/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2853 - accuracy: 0.8826 - val_loss: 0.3512 - val_accuracy: 0.8473
Epoch 39/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2844 - accuracy: 0.8829 - val_loss: 0.3523 - val_accuracy: 0.8462
Epoch 40/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2830 - accuracy: 0.8829 - val_loss: 0.3554 - val_accuracy: 0.8520
Epoch 41/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2828 - accuracy: 0.8842 - val_loss: 0.3530 - val_accuracy: 0.8511
Epoch 42/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2798 - accuracy: 0.8853 - val_loss: 0.3543 - val_accuracy: 0.8473
Epoch 43/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2806 - accuracy: 0.8859 - val_loss: 0.3523 - val_accuracy: 0.8478
Epoch 44/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2795 - accuracy: 0.8861 - val_loss: 0.3570 - val_accuracy: 0.8473
Epoch 45/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2773 - accuracy: 0.8858 - val_loss: 0.3496 - val_accuracy: 0.8476
Epoch 46/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2770 - accuracy: 0.8867 - val_loss: 0.3506 - val_accuracy: 0.8551
Epoch 47/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2772 - accuracy: 0.8869 - val_loss: 0.3527 - val_accuracy: 0.8484
Epoch 48/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2747 - accuracy: 0.8870 - val_loss: 0.3520 - val_accuracy: 0.8518
Epoch 49/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2734 - accuracy: 0.8881 - val_loss: 0.3575 - val_accuracy: 0.8500
Epoch 50/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2733 - accuracy: 0.8882 - val_loss: 0.3517 - val_accuracy: 0.8544
Epoch 51/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2728 - accuracy: 0.8884 - val_loss: 0.3537 - val_accuracy: 0.8542
Epoch 52/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2730 - accuracy: 0.8887 - val_loss: 0.3493 - val_accuracy: 0.8507
Epoch 53/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2696 - accuracy: 0.8904 - val_loss: 0.3528 - val_accuracy: 0.8511
Epoch 54/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2701 - accuracy: 0.8887 - val_loss: 0.3534 - val_accuracy: 0.8478
Epoch 55/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2705 - accuracy: 0.8895 - val_loss: 0.3549 - val_accuracy: 0.8531
Epoch 56/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2691 - accuracy: 0.8902 - val_loss: 0.3511 - val_accuracy: 0.8529
Epoch 57/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2680 - accuracy: 0.8900 - val_loss: 0.3499 - val_accuracy: 0.8536
Epoch 58/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2666 - accuracy: 0.8908 - val_loss: 0.3526 - val_accuracy: 0.8531
Epoch 59/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2661 - accuracy: 0.8922 - val_loss: 0.3504 - val_accuracy: 0.8520
Epoch 60/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2647 - accuracy: 0.8920 - val_loss: 0.3479 - val_accuracy: 0.8538
Epoch 61/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2633 - accuracy: 0.8918 - val_loss: 0.3533 - val_accuracy: 0.8536
Epoch 62/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2654 - accuracy: 0.8919 - val_loss: 0.3530 - val_accuracy: 0.8524
Epoch 63/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2640 - accuracy: 0.8922 - val_loss: 0.3489 - val_accuracy: 0.8533
Epoch 64/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2622 - accuracy: 0.8923 - val_loss: 0.3552 - val_accuracy: 0.8502
Epoch 65/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2603 - accuracy: 0.8938 - val_loss: 0.3524 - val_accuracy: 0.8547
Epoch 66/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2614 - accuracy: 0.8938 - val_loss: 0.3515 - val_accuracy: 0.8576
Epoch 67/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2595 - accuracy: 0.8942 - val_loss: 0.3507 - val_accuracy: 0.8544
Epoch 68/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2595 - accuracy: 0.8946 - val_loss: 0.3518 - val_accuracy: 0.8542
Epoch 69/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2595 - accuracy: 0.8949 - val_loss: 0.3519 - val_accuracy: 0.8553
Epoch 70/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2569 - accuracy: 0.8956 - val_loss: 0.3536 - val_accuracy: 0.8569
CPU times: user 6min 8s, sys: 16.6 s, total: 6min 25s
Wall time: 1min 37s

# extract the test set predictions
preds_test = history.predict_proba(X_test)

%%time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import cohen_kappa_score, brier_score_loss 
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from numpy import sqrt, argmax, argmin

# Plot F1-Score and Threshold
threshold_list = np.linspace(0.05, 0.95, 200)

f1_list = []
for threshold in threshold_list:
    pred_label = np.where(preds_test[:,1] < threshold, 0, 1)
    f1 = f1_score(y_test, pred_label)
    f1_list.append(f1)

df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)
plt.show()

# Plot your other Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)

score_list = []
for threshold in threshold_list:
    pred_label = np.where(preds_test[:,1] < threshold, 0, 1)
    score = brier_score_loss(y_test, pred_label)
    score_list.append(score)

df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == min(df_score['score_score'])]
bt = df_score[df_score['score_score'] == min(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == min(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Brier: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)
plt.show()

from sklearn.metrics import roc_curve

#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(y_test, preds_test[:,1])
roc = roc_auc_score(y_test, preds_test[:,1])

# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))


plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    #marker='.',
    label=f"ROC curve (area ={'%.2f' % roc})"# % roc_auc["micro"],
)

plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') #threshold

plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("NNet Condition Classifier")
plt.legend(loc="lower right")
plt.savefig('emb_nnet_roc_curve.png', bbox_inches='tight', dpi = 300)
plt.show()

Best Threshold=0.348351, G-Mean=0.858

CPU times: user 1.36 s, sys: 657 ms, total: 2.02 s
Wall time: 1.19 s

# best_preds_score = np.where(preds_test < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(y_test, preds_test[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(y_test, preds_test[:,1])))
print("Brier_error = {}".format(brier_score_loss(y_test, preds_test[:,1])))
print("Logloss_test = {}".format(log_loss(y_test, preds_test[:,1])))
# print("Precision = {}".format(precision_score(y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(y_test, preds_test[:,1])))

mean_squared_error_test = 0.32670411986655384
Roc_auc = 0.9300543679940618
Brier_error = 0.10673558193777959
Logloss_test = nan

# apply threshold to positive probabilities to create labels
def to_labels_max(pos_probs, threshold):
    return (pos_probs >= threshold).astype('int')

# evaluate each threshold
scores = [roc_auc_score(y_test, to_labels_max(preds_test[:,1], t)) for t in thresholds]
# get best threshold for max is better
ix = argmax(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.348, Roc_auc=0.85848

# evaluate each threshold
scores = [brier_score_loss(y_test, to_labels_max(preds_test[:,1], t)) for t in thresholds]
# get best threshold for min is better
ix = argmin(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.435, Roc_auc=0.14370

BIBLIOGRAPHIC REFERENCES

https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/

https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classifica

https://machinelearningmastery.com/how-to-score-probability-predictions-in-python/

https://www.machinelearningplus.com/statistics/brier-score/

https://medium.com/@penggongting/understanding-roc-auc-pros-and-cons-why-is-bier-score-a-great-supplement-c7a0c976b679

https://towardsdatascience.com/why-you-should-always-use-feature-embeddings-with-structured-datasets-7f280b40e716

Version

0.0.5.0

Author

Guilherme Giuliano Nicolau: @ggnicolau (https://github.com/ggnicolau)

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
models		models
notebooks		notebooks
reports		reports
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
output_2_56_0.png		output_2_56_0.png
output_47_0.png		output_47_0.png
output_51_0.png		output_51_0.png
output_51_1.png		output_51_1.png
output_51_2.png		output_51_2.png
output_52_1.png		output_52_1.png
output_53_1.png		output_53_1.png
output_54_0.png		output_54_0.png
output_55_0.png		output_55_0.png
output_56_0.png		output_56_0.png
output_56_1.png		output_56_1.png
output_56_3.png		output_56_3.png
output_57_0.png		output_57_0.png
output_58_0.png		output_58_0.png
output_58_1.png		output_58_1.png
output_58_3.png		output_58_3.png
output_60_0.png		output_60_0.png
output_60_1.png		output_60_1.png
output_60_3.png		output_60_3.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Organization

PREDICT WETHER IS NEW OR USED

Challenge

Resumed Conclusions

Libraries

Technologies

Tools

Services

PREDICT WETHER MARKETPLACE PRODUCTS ARE NEW OR USED

1.Load Data

1.1.Get features from dictionary columns

2.Data Transformation

2.1.Change type and filter columns

3.Modelling

3.1.Logistic Regression

3.2.Model: XGBoost

Using patsy to combine features (couldn't run due to hardware limitations)

3.3.Embeddings Encoding + Logistic Regression

3.4.Embeddings Encoding + XGBoost

Model: Embeddings encoding + Neural Networks

BIBLIOGRAPHIC REFERENCES

Version

Author

About

Releases

Packages

Languages

License

ggnicolau/predict-used-new-marketplace

Folders and files

Latest commit

History

Repository files navigation

Project Organization

PREDICT WETHER IS NEW OR USED

Challenge

Resumed Conclusions

Libraries

Technologies

Tools

Services

PREDICT WETHER MARKETPLACE PRODUCTS ARE NEW OR USED

1.Load Data

1.1.Get features from dictionary columns

2.Data Transformation

2.1.Change type and filter columns

3.Modelling

3.1.Logistic Regression

3.2.Model: XGBoost

Using patsy to combine features (couldn't run due to hardware limitations)

3.3.Embeddings Encoding + Logistic Regression

3.4.Embeddings Encoding + XGBoost

Model: Embeddings encoding + Neural Networks

BIBLIOGRAPHIC REFERENCES

Version

Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages