This project deals with real-time streaming data arriving from twitter streams.
I implement the following framework using Apache Spark Streaming, StandfordCoreNLP, Twitter Developer Restful API, ElasticSearch and Kibana. The framework performs sentiment analysis of particular hashtags in twitter data in real time. For example, we want to do the sentiment analysis for all the tweets for #guncontrolnow and show their (e.g.,positive, neutral, negative) statistics in Kibana.
For this, I get the tweets via scrapper. Next, I write a sentiment analysis program to predict sentiment of the tweet message. Finally, I visualize Ir fndings using ElasticSearch/Kibana.
Scrapper -> Sentiment Analyzer/Common topic finder -> Visualizer (ElasticSearch/Kibana)
We provide a sample scrapper (stream.py). However, I need to extend the code to support the following functionality.
The scrapper collects tweets and pre-process them for analytics. It is a standalone program written in Python by using twitter dev restful api and should perform the following:
- Collect tweets in real-time with particular hashtag. For example, I collect all tweets with #guncontrolnow.
- After getting tweets, I fliter them by removing emoji symbols and special characters and discard any noisy tweet that do not belong to #guncontrolnow. Note that the returned tweet contains both the meta data (e.g., location) and text contents. I have to keep at least the text content and the location meta data.
- After fltering, I convert the location meta data of each tweet back to its geolocation info by calling google geo API and send the text and geolocation info to spark streaming.
- My scrapper program run infnitely and should take hash tags as input parameters while running.
Sentiment Analyzer determines whether a piece of tweet is positive, neutral or negative. For example,
I use any third-party sentiment analyzer like Stanford CoreNLP for sentiment analyzing.
In summary, for each hashtag, I perform sentiment analysis using sentiment analysis tools discussed above and output sentiment and geolocation of each tweet to some external bases (either save in a json fle or send them to kibana for visualization).
I install ElasticSearch and Kibana. Create an index for visualization. Create a data table to show the sentiment of each tweet, i.e., "sentiment | tweet". Then, create a number of geo coordinate maps to show the geolocation distribution of tweets. More specifcally, frst geo coordinate map show the geolocation distribution of all tweets, regardless of sentiment related to #guncontrolnow. Second and Third geo coordinate map show geolocation distributions of positive tweets and negative tweets, respectively. When I send data from spark to ElasticSearch, I need to add a time stamp. In the dashboard, set the refresh time to 2 min as an example.
- Install tweepy, standfordcorenlp, googlemap and kibana packages in your ide.
- Install elasticsearch and kibana locally.
- run MyScraper.py firstly and then run main.py.
- If you have already install kibana or used kibana cloud, you can access my bashboard at My Tweets Bashboard
feel free to email me about how to run the project. My email: [email protected]