The script performs the following tasks:
It breaks down the text into words, phrases, symbols, or other meaningful elements called tokens. The output is a list of tokens.
It removes common words (like 'is', 'the', 'and') that do not carry much meaningful information.
It removes punctuation from the text.
It counts the frequency of each word in the text and prints the 5 most common words.
It reduces the words to their base or root form (for example, 'running' to 'run').
It labels each word in the text as corresponding to a particular part of speech (like noun, verb, adjective, etc.).
It identifies and classifies named entities in the text into predefined categories like person names, organizations, locations, etc.
It visualizes the grammatical structure of sentences, depicting how words relate to each other.
- Python
- SpaCy
- en_core_web_sm (SpaCy model)