-
Notifications
You must be signed in to change notification settings - Fork 4
Background: Data Format
The RITHM Parser works with data that has been dumped in native JavaScript Object Notation (JSON) file format from the Twitter Streaming API. In general, the parser iterates over any available files with the naming convention of YYYYMMDDHHMMSS*.json
. For example:
-
20140101000000_foo.json
is a file that was created on the first second of the year 2014. The data file is further identified with an optional "_foo" attribute which is ignored by the parser. -
20170523162003.json
was created at 16:20:03 on 05-23-2017 and the filename includes no optional descriptors. This file name includes the bare minimum amount of information. - File names that do not begin as
YYYYMMDDHHMMSS
will not be parsed. However, you could assign an arbitrary 14-digit date (00000000000000 through 99999999999999 are valid). - File names that do not end as
.json
will not be parsed. Files dumped from the Twitter Streaming API are actually in quasi-JSON format, meaning that that each tweet is in JSON format but the overall files do not provide containers or comma separators between tweet objects. The parser handles this discrepancy, but you should be aware that trying to convert these raw data files to JSON structures will likely fail if using other means.
By default, the current parser uses "HiMem" setting to read in each .json file, hold it in memory, reformat it to proper JSON format, and then iterate through each tweet for criteria matching and output. There are memory considerations for taking this approach, as very large file/JSON objects may not fit in working memory. The "LoMem" setting works around this by reading one line at a time and building smaller JSON objects, but this approach is considerably slower. See the parser documentation for additional details.