-
Notifications
You must be signed in to change notification settings - Fork 23
Collection Verification
There are multiple things that need to be checked to verify that all three of the processes are running and collecting or processing data.
ps –aux | grep python
or use:
ps -ef | grep python
After "start" or "restart", the first string is the "project id". The second is the "collection id" or what social media platform (e.g. "twitter" or "facebook") it is inserting or processing for.
python __main__.py controller collect restart 54ccfc300603632dcbdff02c 54ccfc660603632dd2ccf1a4
python __main__.py controller process start 54ccfc300603632dcbdff02c twitter
python __main__.py controller insert start 54ccfc300603632dcbdff02c twitter
For a detailed check:
-
Check for a "collect" line for every collection id in every project that is running.
-
Check for a "process" and "insert" line for every project id that is running.
-
If any of these python processes are not running, then something is not working properly. Otherwise, all processes are running normally.
NOTE: Just because these python lines are running does not mean they are getting data. This just means that the the collect, process, and inserter scripts are running on the server.
To check to see if collections are working and going to the database, you must follow up by checking Mongo, and looking in key directories on your server.
From the command line, execute:
mongo admin
*if mongo does not open, the process may have failed
From here, you may need to use authorization with the following depending upon mongo security:
db.auth('<username>','<password>')
If '1' is outputted with the inputted username and password, then these credentials worked. Otherwise, retry. If mongo is not built with security, then you should bypass this step.
In mongo, to see available databases execute the command:
show dbs
Identify the database you want to check, e.g. POTUS1_54cc09df1ed75a10b8ca9f07
Execute the command to move into this database directory:
use POTUS1_54cc09df1ed75a10b8ca9f07
Execute the command to count the tweets in the database:
db.tweets.count
Here, if the collector, process, and inserter are functioning correctly and there are expected tweets, we should see a number greater than zero. Check to make sure this number is expected. If it is, exit mongo.
You should be back the Ubuntu command line
If the collector's, inserter's, and processor's are running and working correctly, then specific files should exist in key directories on the server.
Navigate to the location of the stack directory and the network collector. E.G.:
cd /home/bits/stack/data
Navigate into the raw_tweets_ directory Execute command to check directories in data/:
ls
*If the directory is empty, the collector is not working.
If there are files here, change directory to the collector you wish to check.
cd <collector_string>
Look at the files in this directory also. If no files exist, the collector may not be working. There should be a "twitter", "raw", and other directories.
Navigate into "raw" and look at the files in this directory. If more than one file exists, the processor may have stopped. We generally expect one .json file in the raw directory.
Verify one or more files exist that follow the standard naming convention. Starts with date and time format: YYYYMMDD-HH---tweets_out.json. One file in the directory should have a timestamp less than one hour old. So HH should be the current hour on the current day on at least one file. Example file name:
20150807-07-potustrack2-54ccfc300603632dcbdff02c-54ccfc660603632dd2ccf1a4-tweets_out.json
*Note that if the most recent file is more than an hour old, the date and time stamp indicates when the processor stopped working.
Note that if there is more than one file in the raw collection dir, the processor may have stopped working even if the process (see the ps –aux command above) is running.
To find out how many files are in the directory, execute:
ls | grep –c json
Files are processed fairly quickly with the default processor, so running this command every minute or two should show a reduction in the number of files in the folder.
*Note: if the number of files is growing or staying the same, the processor may not be working.
Navigate to the tweet_archive_ directory
Verify that two files exist that have time stamps of more than one, but less than two hours old, e.g.
20150807-06-potustrack2-54ccfc300603632dcbdff02c-54ccfc660603632dd2ccf1a4-tweets_out.json
20150807-06-potustrack2-54ccfc300603632dcbdff02c-54ccfc660603632dd2ccf1a4-tweets_out_processed.json
This directory may have zero or more files in it. If zero, it may have processed the last file and correctly inserted data into mongo. If there are many files, the inserter may have stopped. Alternately, if there are many it may be that the previous data processing step has been restarted and is plowing through a backlog of data. Since insertion can take longer than the previous data process step, files can reasonably back up in the directory. In this case, watch the number of files or that new files are moving into the directory while others are moving out.
Navigate to insert_queue_
Execute ls
To find out how many files are in the directory, execute: ls | grep –c json