Skip to content

Collection Verification

Alexander O. Smith edited this page May 24, 2019 · 24 revisions

Verification that a collection is running

There are multiple things that need to be checked to verify that all three of the processes are running and collecting or processing data.

Check processes on the Ubuntu command line:

ps –aux | grep python

or use:

ps -ef | grep python

Three or more processes should be running per project id:

After "start" or "restart", the first string is the "project id". The second is the "collection id" or what social media platform (e.g. "twitter" or "facebook") it is inserting or processing for.

python __main__.py controller collect restart 54ccfc300603632dcbdff02c 54ccfc660603632dd2ccf1a4

python __main__.py controller process start 54ccfc300603632dcbdff02c twitter

python __main__.py controller insert start 54ccfc300603632dcbdff02c twitter

For a detailed check:

  1. Check for a "collect" line for every collection id in every project that is running.

  2. Check for a "process" and "insert" line for every project id that is running.

  3. If any of these python processes are not running, then something is not working properly. Otherwise, all processes are running normally.

NOTE: Just because these python lines are running does not mean they are getting data. This just means that the the collect, process, and inserter scripts are running on the server.

To check to see if collections are working and going to the database, you must follow up by checking Mongo, and looking in key directories on your server.

Check mongo:

From the command line, execute:

mongo admin

*if mongo does not open, the process may have failed

From here, you may need to use authorization with the following depending upon mongo security:

db.auth('<username>','<password>')

If '1' is outputted with the inputted username and password, then these credentials worked. Otherwise, retry. If mongo is not built with security, then you should bypass this step.

In mongo, to see available databases execute the command:

show dbs

Identify the database you want to check, e.g. POTUS1_54cc09df1ed75a10b8ca9f07

Execute the command to move into this database directory:

use POTUS1_54cc09df1ed75a10b8ca9f07

Execute the command to count the tweets in the database:

db.tweets.count

Here, if the collector, process, and inserter are functioning correctly and there are expected tweets, we should see a number greater than zero. Check to make sure this number is expected. If it is, exit mongo.

You should be back the Ubuntu command line

Check files:

If the collector's, inserter's, and processor's are running and working correctly, then specific files should exist in key directories on the server.

Navigate to the location of the stack directory and the network collector. E.G.:

cd /home/bits/stack/data

Raw collection files:

Navigate into the raw_tweets_ directory Execute command to check directories in data/:

ls

*If the directory is empty, the collector is not working.

If there are files here, change directory to the collector you wish to check.

cd <collector_string>

Look at the files in this directory also. If no files exist, the collector may not be working. There should be a "twitter", "raw", and other directories.

Navigate into "raw" and look at the files in this directory. If more than one file exists, the processor may have stopped. We generally expect one .json file in the raw directory.

Verify one or more files exist that follow the standard naming convention. Starts with date and time format: YYYYMMDD-HH---tweets_out.json. One file in the directory should have a timestamp less than one hour old. So HH should be the current hour on the current day on at least one file. Example file name:

20150807-07-potustrack2-54ccfc300603632dcbdff02c-54ccfc660603632dd2ccf1a4-tweets_out.json

*Note that if the most recent file is more than an hour old, the date and time stamp indicates when the processor stopped working.

Processing files: When the raw dir has more than one file

Note that if there is more than one file in the raw collection dir, the processor may have stopped working even if the process (see the ps –aux command above) is running.

To find out how many files are in the directory, execute:

ls | grep –c json

Files are processed fairly quickly with the default processor, so running this command every minute or two should show a reduction in the number of files in the folder.

*Note: if the number of files is growing or staying the same, the processor may not be working.

Processing files: When the raw dir has a single file of if the number of files is declining

Navigate to the tweet_archive_ directory

Verify that two files exist that have time stamps of more than one, but less than two hours old, e.g.

20150807-06-potustrack2-54ccfc300603632dcbdff02c-54ccfc660603632dd2ccf1a4-tweets_out.json
20150807-06-potustrack2-54ccfc300603632dcbdff02c-54ccfc660603632dd2ccf1a4-tweets_out_processed.json

Insert queue file

This directory may have zero or more files in it. If zero, it may have processed the last file and correctly inserted data into mongo. If there are many files, the inserter may have stopped. Alternately, if there are many it may be that the previous data processing step has been restarted and is plowing through a backlog of data. Since insertion can take longer than the previous data process step, files can reasonably back up in the directory. In this case, watch the number of files or that new files are moving into the directory while others are moving out.

Navigate to insert_queue_

Execute ls

To find out how many files are in the directory, execute: ls | grep –c json