- first we need to analysis the data which is the most important part
- data is prepared in csv format which i read i by pandas
- columns are provided as "['index', 'title', 'genre', 'summary']"
- 'genre' column will be used as label
- 'summary' column will be our feature
- the distribution of 'genre' column shows the imbalance data
- as imbalanced , accuracy is not a good metric and we will check precision, recall ,f1score, confusion matrix
- descriptions also include non-alphabet characters which should be cleaned both for train and test data
- nltk library , word2vec for word preprocessing
thriller 1023
fantasy 876
science 647
history 600
horror 600
crime 500
romance 111
psychology 100
sports 100
travel 100
- then, the next step is model architecture:
- as the supervised data is provided (sequence -label), this problem is considered as sequence modeling
- as the it is a Many-to-1 architecture as recurrent problems
- as the input features are long, to avoid vanishing gradient problem, gated architectures like GRU (faster) and LSTM are preferred
- Attention and Transformer(Attention without RNN) also can be applied
- final layer is a Dense layer with softmax activation since the problem as multiclass classification
- as a pretrained model, we can use Roberta from HuggingFace with pytorch code. it is based on transformer architecture, which is also used for text classification(modified version of BERT)
- if confront with overfit, try regularization techniques
- Since the number of model parameters are more than the number of training samples, we will encounter overfitting. regularization techniques can be applied (Dropout, early stopping, WeightDecay, ...)
Congig
Config.py : as config reader from the json file
config.json : contains the parameters to set, like learning rate, batch size, etc
Data
data.csv : our dataset
modules:
callbacks.py : list of callbacks that are called in training loop/or model fit
datareader.py : read data as pandas dataframe, extract and preprocess features and labels, split train and validation set
evaluation.py : metric for evaluation (customization)
model.py : the architecture of model
train.py : model training and saving
inference.py : for testing
Training loss(for one batch) at step 0:2.3178
Training loss(for one batch) at step 200:0.0169
Training loss(for one batch) at step 400:0.0010
- we need three files:
- Dockerfile (basically for one service)
- docker-compose.yaml (for serving several services)[for our problem is just on service: django app]
- requirements.txt #[packages to be installed(mentioned in Dockerfile) specially : django]
- run these commands in the above directory:
- docker-compose build
- docker-compose run --rm app django-admin startproject DjangoProject .
- run is for both building and running
- after the DjangoProject is added to current directory, we can add model predicition to it (Go to step Django)
- docker-compose up
as the command 'django-admin startproject DjangoProject' is already commited by docker compose, we skip this step. we can try to different ways:
- without using startapp a new app, and just
- create a views.py inside DjangoProject directory
- modify urls.py , settings.py(add DjangoProject to INSTALLED_APP)
- create a 'templates' directory in DjangoProject,add index.html and code the templates
- with using startapp
- after python manage.py startapp DjangoApp , a new folder for app will be created
- inside this App folder, mkdir another folder called: models , and save the checkpoint of the model inside this directory
- add DjangoApp to INSTALLED_APP in settings.py
- calling the model checkpoint is happening inside apps.py , where we create a class
- prediction codes is commited in views.py