MLHackFest 2019 Repo
- Create your own
conda
orvirtual environment
. - Run
pip install -r requirements.txt
. - Add Kaggle API credentials. See instructions here.
- get wild
test
folder contains all experiment notebooks from Kaggle competitions with the pipeline created in utils/models.
src
folder is reserved for competition purposes, need to copy tests/utils in the folder to use the modules.
data
folder is used for downloading dataset in Kaggle using KAGGLE API
.
src
and test
folders must have their own submission
folder when generating predictions in competition/challenges for easier tracking of the submission files.
-
You must join the competition first before you are able to download data thru Kaggle API.
-
You must join the Kaggle competition first before you are able to download its dataset thru the Kaggle API.
-
When using
CatBoostCV
specify in init theobj
i.eregression
orbinary
to use the correct algorithm to use.LGBMCV
works for both, no need to specify theobj
. -
When dealing with Regression Problem transform the target to
np.log
for easier training then transform again back to original state usingnp.exp
. ifnegative values
are encountered in the prediction values, just usepd.Series.clip
function to clip the values to its min, and max. -
When dealing with Classification Problem with large dataset 500K ~ 1M+ Instances, consider to downsample the majority class for easier feedback loop iteration,
don't
use SMOTE or other stuff, that doesn't work! -
When everything doesn't work, use target encoding under
utils/cat_encoding.py
that willautomagically
make the model better, but ofcourse make sure you have solid CV and DO NOT OVERFIT -
RandomForest is the only model that
Rafael Trusts
in sklearn that can be used in competitions, unless you ensemble/stack predictions, use LogisticRegression. -
Submission files makes it easier for us to check if our
Cross Validation
correlates with the Public Leaderboard in Kaggle by formatting the name of the submission file using{model_used}_{challenge}_(my_cv_score}.csv