This repository is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way.
All contributions are welcome: ideas, pull requests, issues, documentation improvement, complaints.
This repository aims to provide a fully working "out-of-the-box" data pipeline for doing Machine learning on Twitter data using the ELK (Elasticsearch, Logstash, and Kibana) stack.
If you are not familiar with Logstash you may want to follow this tutorial first.
After having installed ELK you should be able in 5 minutes to visualize dashboard like the following:
The offered pipeline can be modelized by the following flow chart:
Here are some slides that present the logstash part of the pipeline: https://www.slideshare.net/hypto/machine-learning-in-a-twitter-etl-using-elk .
Let's have a look to the different part that are covered by this pipeline:
The input used is Twitter, you can use it to track users or keywords or tweets in a specific location.
A lot of filters are applied and they are in charge of the following tasks:
- Remove depreciated field
- Divide the tweet in two or three events (users and tweet)
- Flatten the JSON
- Remove the fields not used
Two output are defined:
- Elasticsearch: To allow a better search of your data
- MongoDB: To store your data
A mapping is provided and offers the following:
- A parent/child relationship between the tweet author and their tweets
- On text fields (Tweet content, User description, User location):
- 3 Analyzers
- Storing of the term vectors (For the 3 analyzers)
- Storing of the token numbers (For the 3 analyzers)
- One geofield to locate the provenance of the tweet (if available)
- Many "keyword", "integer" field to all allow data filtering
The 3 analyzers are:
- Standard
- English
- A custom analyzer that keeps emoticons and punctuations, which is useful for sentimental and emotion analysis
The mapping is not dynamic, Twitter having a lot of fields that are not (or poorly) documented, it avoid data polution and keep only the wanted data.
On Kibana side the repository offer:
- A dashboard for general data visualization
- A dashboard for comparison between a positive and negative tweet
- Different kind of visualizations
Logstash make it simple to integrate machine learning model directly into your pipeline using the rest filter. A small "API" has been created to give you an idea about how you can use the rest filter in order to "label" your tweet on the fly before indexation. You can find this toy API here:
https://github.com/melvynator/toy_sentiment_API
The model is a dummy model but you can easily introduce your own complex model on the form of such API.
For the pipeline to work, you need a Twitter developer account, which you can obtain here: https://dev.twitter.com/resources/signup
This guide assumes that you have already installed Elasticsearch, Logstash and Kibana. All three need to be installed properly in order to use this pipeline.
Once having installed ELK, here are some instructions to configure Elasticsearch to start automatically when the system boots up.
sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable elasticsearch.service
Elasticsearch can be started and stopped as follows:
sudo systemctl start elasticsearch.service
sudo systemctl stop elasticsearch.service
(Note that the same steps can be used for Kibana and Logstash)
brew install elasticsearch
brew install logstash
brew install kibana
Clone the repository:
git clone https://github.com/melvynator/ELK_twitter.git
Make sure that you don't have an index twitter
already present.
Download the toy API:
git clone https://github.com/melvynator/toy_sentiment_API
Go into the main repository and create a virtual environement:
cd toy_sentiment_API
virtualenv -p python3 venv
source venv/bin/activate
Then install Flask and Scikit-Learn (For the machine learning)
pip install -r requirements.txt
Then you can launch your local server:
python sentiment_server.py
To start configuring your logstash you have to open the configuration file:
ELK_twitter/src/twitter-pipeline/config/twitter-pipeline.conf
Replace the <YOUR-KEY>
by your corresponding twitter key:
consumer_key => "<YOUR-KEY>"
consumer_secret => "<YOUR-KEY>"
oauth_token => "<YOUR-KEY>"
oauth_token_secret => "<YOUR-KEY>"
Now go into twitter-pipeline
:
cd ../src/twitter-pipeline
Make sure that Elasticsearch is started and run on the port 9200
.
In addition, you also have to manually install the following plugins for Logstash:
- MongoDB for Logstash (Allow you to store your data into mongoDB)
sudo /usr/share/logstash/bin/logstash-plugin install logstash-output-mongodb
- REST for Logstash (Allow you to make API call)
sudo /usr/share/logstash/bin/logstash-plugin install logstash-filter-rest
ELK_twitter/src/twitter-pipeline/config/twitter-pipeline.conf
rest
filter in the config file:
ELK_twitter/src/twitter-pipeline/config/twitter-pipeline.conf
Don't forget to specify your own endpoint and data.
Then, you can run the pipeline using:
sudo /usr/share/logstash/bin/logstash -f config/twitter-pipeline.conf
Or define logstash in your SYSTEM_PATH
and run the following:
logstash -f config/twitter-pipeline.conf
You should see some logs that end up with:
Successfully started Logstash sentiment_service endpoint {:port=>9600}
Now go to Kibana: http://localhost:5601/
Management => Index Patterns => Create Index Pattern
Into the text box Index name or pattern
type: twitter
Into the drop down box Time Filter field name
choose: inserted_in_es_at
Click on create
Now go to:
Management => Saved Objects => import
And select the file in:
ELK_twitter/src/twitter-pipeline/kibana-visualization/kibana_charts.json
You can now go to Dashboard
This gif summarize the different step if you are lost.
Thanks to stackoverflow community and Elastic community for the answer provided.
https://www.elastic.co/guide/en/logstash/current/introduction.html https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html