Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An example with categorical features #51

Open
prcastro opened this issue Apr 25, 2019 · 2 comments
Open

An example with categorical features #51

prcastro opened this issue Apr 25, 2019 · 2 comments
Labels
documentation Missing documentation or improvements in the existing one good first issue Good for newcomers

Comments

@prcastro
Copy link
Contributor

Have a tutorial for dealing with categorical features in a machine learning problem, including the usage of tools inside fklearn.training.transformation.

@prcastro prcastro added the documentation Missing documentation or improvements in the existing one label Apr 25, 2019
@prcastro prcastro changed the title An example of using categorical features An example with categorical features Apr 25, 2019
@caique-lima caique-lima added the good first issue Good for newcomers label Apr 26, 2019
@victor-ab
Copy link
Contributor

@prcastro, is there a way to include a transformation tool like the onehot_categorizer inside a pipeline?
Because it will generate new columns and break the pipeline.

@vultor33
Copy link
Contributor

vultor33 commented May 22, 2019

I am not from Nubank, I just did this example for myself and shared it here.

# This is not an official example, it has no warranty of any kind.

import pandas as pd
# DATA WAS OBTAINED AT: https://www.kaggle.com/c/titanic/data
# IT WAS ALSO ADDED TO THE COMMENT IN FKLEARN ISSUE #51
DATA_FILE = 'titanic-train.txt'
data = pd.read_csv(DATA_FILE,delimiter=',',dtype=str)
data.loc[:,'Age'] = data.loc[:,'Age'].astype(float)
data.loc[:,'Fare'] = data.loc[:,'Fare'].astype(float)
data.loc[:,'Parch'] = data.loc[:,'Parch'].astype(int)
data.loc[:,'Pclass'] = data.loc[:,'Pclass'].astype(int)
data.loc[:,'SibSp'] = data.loc[:,'SibSp'].astype(int)
AUXILIARY = ['PassengerId','Name','Cabin', 'Ticket']
TARGET = ['Survived']
FEATURES = set(data.columns) - set(AUXILIARY) - set(TARGET)


from fklearn.training.transformation import onehot_categorizer
# ONE HOT ENCODER DEFINITION
my_onehotencoder = onehot_categorizer(columns_to_categorize = ['Embarked','Sex'])
# FEATURE NAMES ARE NEEDED (SEE ISSUE #68 FOR MORE)
_, data_after_enconding, _ = my_onehotencoder(data)  # applying encoder to training dataset
NEW_FEATURES = ['Pclass', 
            'Age', 
            'SibSp', 
            'Parch', 
            'Fare', 
            'Embarked==C', 
            'Embarked==Q', 
            'Embarked==S',
            'Sex==female', 
            'Sex==male'] # This names are in "data_after_enconding", I had just typed them here.


from fklearn.training.imputation import imputer
from fklearn.training.transformation import onehot_categorizer
from fklearn.training.transformation import standard_scaler
from fklearn.training.classification import xgb_classification_learner
# SOME OTHER TRANSFORMATIONS
my_imputer = imputer(columns_to_impute=NEW_FEATURES, impute_strategy='median')
my_scaler = standard_scaler(columns_to_scale=NEW_FEATURES)
# MODEL DEFINITION
my_model = xgb_classification_learner(features = NEW_FEATURES,
                                      target = TARGET[0])    



from fklearn.training.pipeline import build_pipeline
# PIPELINE DEFINITON
my_learner = build_pipeline(my_onehotencoder, my_imputer, my_scaler, my_model)
# TRAINING
(prediction_function, data_trained, logs) = my_learner(data)



# EVALUATION
from sklearn.metrics import accuracy_score
Survived_prediction = []
for i in data_trained.index:
    if data_trained.prediction[i] > 0.5:
        Survived_prediction.append('1')
    else:
        Survived_prediction.append('0')
print('Train accuracy:  ', accuracy_score(Survived_prediction,data_trained.Survived))

DATASET

titanic-train.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Missing documentation or improvements in the existing one good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants