Skip to content

Background: Data Format

Jason Colditz edited this page Jun 19, 2019 · 2 revisions

The RITHM Parser works with data that has been dumped in native JavaScript Object Notation (JSON) file format from the Twitter Streaming API. In general, the parser iterates over any available files with the naming convention of YYYYMMDDHHMMSS*.json. For example:

  • 20140101000000_foo.json is a file that was created on the first second of the year 2014. The data file is further identified with an optional "_foo" attribute which is ignored by the parser.
  • 20170523162003.json was created at 16:20:03 on 05-23-2017 and the filename includes no optional descriptors. This file name includes the bare minimum amount of information.
  • File names that do not begin as YYYYMMDDHHMMSS will not be parsed. However, you could assign an arbitrary 14-digit date (00000000000000 through 99999999999999 are valid).
  • File names that do not end as .json will not be parsed. Files dumped from the Twitter Streaming API are actually in quasi-JSON format, meaning that that each tweet is in JSON format but the overall files do not provide containers or comma separators between tweet objects. The parser handles this discrepancy, but you should be aware that trying to convert these raw data files to JSON structures will likely fail if using other means.

By default, the current parser uses "HiMem" setting to read in each .json file, hold it in memory, reformat it to proper JSON format, and then iterate through each tweet for criteria matching and output. There are memory considerations for taking this approach, as very large file/JSON objects may not fit in working memory. The "LoMem" setting works around this by reading one line at a time and building smaller JSON objects, but this approach is considerably slower. See the parser documentation for additional details.

Clone this wiki locally