Collection of tutorials on text analytics/NLP, including vector space models, neural language models and topic models on the Pivotal MPP platform (Greenplum/HAWQ).
1. Tokenization, stemming, unigrams, bigrams, trigram and skipgrams generation.
2. Bag-of-words model for classification on 20-news-groups dataset.
3. tf-idf weighting for classification on 20-news-groups dataset.
4. Feature hashing for classification on 20-news-groups dataset.
5. Grid search on model parameters for Elastic Net on the tf-idf representation
1. LDA topic models on 20-news-groups dataset.
2. Grid search for LDA hyperparameters, on the 20-news-groups dataset.
1. Classification models using Paragraph vector representation of 20-news-groups dataset using `doc2vec` package in `gensim`.
These exercises have the following client and server side dependencies:
- Client side: We encourage you to install Anaconda Python for your Jupyter Notebooks. The notebooks in these exercises use matplotlib and seaborn for data visualization, pandas and psycopg2 to query the backend database.
- Server side: On the server side, you'll need to install sklearn (and its dependencies).
These notebooks have been uploaded only to show code snippets, it is not meant to be a complete tutorial as is a narration that accompanies these exercises.