Skip to content

virup/TrendingTopics_VOLTDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Trending Topics application
===========================

A prototype of Twitter's trending topics feature. This program shows the top 10 most frequently used word in Twitter in the last 5 minutes

The following things can be customizable in the program :

   - Whether or not to show the trending topics in the output
   - Whether or not to show a leaderboard like stats of the database reads and write rates 
   - Control the number of worker threads which write and read from the database

These can be adjusted by the command line arguments


How to run the program:
---------------------
First the ./run.sh script has to be run. The following tells what the script takes as arguments and what they do.


run.sh actions
-----------

run.sh               : compile all Java clients and stored procedures, build the catalog, and start the server
run.sh clean         : remove compiled files

After the server is started, the script 'readTweets.py' can be run from the command line. The script takes the following arguments: 

python readTweet.py       	: Runs the program but does not show any output
python readTweet.py -t    	: Runs the program and shows the trending topics every 5 seconds
python readTweet.py -l    	: Runs the program and shows a leaderboard-like statistics
python readTweet.py -w <count>  : Runs the program using <count> number of worker threads. The worker threads
				  interact with the database
 
The options can be combined with each other

NOTE: Press CTRL+C to end the program


Interpreting the Results
------------------------
The trending topics are a simple list of 10 words. 

The leaderboard outputs the following list :

<count> words per second  	: This is the number of words processed by the application
Total tweets : <count>		: This is the total number of tweets handled till now
Reads per sec :  <count>	: This is the average number of database reads per sec happened in the last 5 seconds
Writes per sec : <count>	: The average number of database writes per sec in the last 5 seconds
Updates per sec : <count>	: The average number of database updates per sec in the last 5 seconds



Details of the application
-------------------------
This is  a prototype of the Twitter's 'trending topics'. This application does not use any form of machine learning to reveal topics, but have implemented a rather dumb version which simply shows the top 10 most frequently used words in Twitter in the last 5 minutes. In order to accomplish this task, I have used VoltDB (www.voltdb.com) as my data store to keep a record of the words that have been encountered and their counts.

In order to get the real time tweets from the twitter, Twitter's streaming API has been used. Since the Twitter libraries are best supported in Python, the actual code is in Python and JSON over HTTP has been used to operate on the VoltDB. 

The application outputs a leader board which gives the number of words processed per second, the number of tweets processed per second, the number of database reads per second, the number of database writes per second, and the number of database update operations per second. 

The streaming API is supposed to provide me with all the actual tweets happening per second, but instead it really provides me with only 1% of all the tweets at any time. This results in a data transfer which is not as fast as I had initially expected. Currently I am clocking around 1500 reads per second, and around the same rate of writes and updates with the bottleneck being the inflow of tweets. I know it is negligible compared to what VoltDB is designed to handle. 

About

Voltdb sample application of Twitter's trending topic feature

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published