-
Notifications
You must be signed in to change notification settings - Fork 1
virup/TrendingTopics_VOLTDB
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Trending Topics application =========================== A prototype of Twitter's trending topics feature. This program shows the top 10 most frequently used word in Twitter in the last 5 minutes The following things can be customizable in the program : - Whether or not to show the trending topics in the output - Whether or not to show a leaderboard like stats of the database reads and write rates - Control the number of worker threads which write and read from the database These can be adjusted by the command line arguments How to run the program: --------------------- First the ./run.sh script has to be run. The following tells what the script takes as arguments and what they do. run.sh actions ----------- run.sh : compile all Java clients and stored procedures, build the catalog, and start the server run.sh clean : remove compiled files After the server is started, the script 'readTweets.py' can be run from the command line. The script takes the following arguments: python readTweet.py : Runs the program but does not show any output python readTweet.py -t : Runs the program and shows the trending topics every 5 seconds python readTweet.py -l : Runs the program and shows a leaderboard-like statistics python readTweet.py -w <count> : Runs the program using <count> number of worker threads. The worker threads interact with the database The options can be combined with each other NOTE: Press CTRL+C to end the program Interpreting the Results ------------------------ The trending topics are a simple list of 10 words. The leaderboard outputs the following list : <count> words per second : This is the number of words processed by the application Total tweets : <count> : This is the total number of tweets handled till now Reads per sec : <count> : This is the average number of database reads per sec happened in the last 5 seconds Writes per sec : <count> : The average number of database writes per sec in the last 5 seconds Updates per sec : <count> : The average number of database updates per sec in the last 5 seconds Details of the application ------------------------- This is a prototype of the Twitter's 'trending topics'. This application does not use any form of machine learning to reveal topics, but have implemented a rather dumb version which simply shows the top 10 most frequently used words in Twitter in the last 5 minutes. In order to accomplish this task, I have used VoltDB (www.voltdb.com) as my data store to keep a record of the words that have been encountered and their counts. In order to get the real time tweets from the twitter, Twitter's streaming API has been used. Since the Twitter libraries are best supported in Python, the actual code is in Python and JSON over HTTP has been used to operate on the VoltDB. The application outputs a leader board which gives the number of words processed per second, the number of tweets processed per second, the number of database reads per second, the number of database writes per second, and the number of database update operations per second. The streaming API is supposed to provide me with all the actual tweets happening per second, but instead it really provides me with only 1% of all the tweets at any time. This results in a data transfer which is not as fast as I had initially expected. Currently I am clocking around 1500 reads per second, and around the same rate of writes and updates with the bottleneck being the inflow of tweets. I know it is negligible compared to what VoltDB is designed to handle.
About
Voltdb sample application of Twitter's trending topic feature
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published