README

This repository contain the implementation of the project for the Machine Learning course

Models and Classifiers

Numerous models were tested.

Preprocessing steps on dataset:

Publishers are extracted from the url's such that we can group them (useful for splitting them in training and test).
Most common sentences (at least 2 words) that occured at least 50 times removed. Because these are mostly advertisements or footnotes on the scraped sites.
White space characters removed ('\n', '....')
'Advertisement' removed
Only kept articles with amount_of_tokens > 400 && amount_of_tokens < 3000. An A4 page can have 3000 characters.
All articles by a certain publisher can only exist in the training OR test set.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.ipynb_checkpoints		.ipynb_checkpoints
NB_on_count_vector.py		NB_on_count_vector.py
NB_on_tf_idf_vector.py		NB_on_tf_idf_vector.py
README.md		README.md
Universal_Sentence_Encoder.ipynb		Universal_Sentence_Encoder.ipynb
calculate_count_vector.py		calculate_count_vector.py
calculate_tf-idf_vector.py		calculate_tf-idf_vector.py
check_schema.ipynb		check_schema.ipynb
common_sentences.csv		common_sentences.csv
descriptive_analysis.ipynb		descriptive_analysis.ipynb
hyperp_task_GRU_model.ipynb		hyperp_task_GRU_model.ipynb
model_hyperp.h5		model_hyperp.h5
multi_task_GRU_model.ipynb		multi_task_GRU_model.ipynb
multi_task_GRU_model_glove.ipynb		multi_task_GRU_model_glove.ipynb
nb_svm++_count_vector_ngrams.py		nb_svm++_count_vector_ngrams.py
results_baselines.txt		results_baselines.txt