This repository contains code I have written while following MLH Global Hack Week: Init 2022 Machine Learning Track.
Link to data used for training the models
Rules-based Model -> Machine Learning Model (Logistic Regression) -> Artificial Neural Network Model (Tensorflow and Keras)
NLP stands for Natural Language Processing. NLP is the ability of a computer program to understand human language as it is spoken and written.
- The smallest unit of NLP data is a character.
character = 'g'
- A sequence of characters that make up a "word" is called a token.
token = "good"
- A sequence of tokens that convey a meaning on its own is called a document.
document = "A. R. Rahman is a good film composer and songwriter."
- A collection of documents is called a corpus.
corpus = [
"A. R. Rahman is a good film composer and songwriter.",
"Pineapple on pizzas is a very bad idea.",
"I like anime. Steins;Gate is my favourite",
"My introvert friend is terrible at communicating.",
]
- Tokenization: breaks down text into smaller semantic units or single clauses
- Part-of-speech-tagging: marking up words as nouns, verbs, adjectives, adverbs, pronouns, etc
- Stemming and lemmatization: standardizing words by reducing them to their root forms
- Stop word removal: filtering out common words that add little or no unique information, for example, prepositions and articles (at, to, a, the).
Machine Learning Track Part 2: Intro to NLP
Machine Learning Track Part 3: Logistic Regression and Neural Networks
Machine Learning Track Part 4: Tensorflow, Keras, and Overfitting