Make data easier to consume in bulk #191

fortuna · 2018-03-16T15:55:32Z

I'll use this bug to collect ideas to make it easier to consume OONI data in bulk.

Move files under <test_name> directories (e.g. autoclaved/jsonl/web_connectivity/...)
This will allow one to fetch or process a test type without having to go over all the tests. One may list all files, filter, then use that as input. However thousands of filenames as input does not work well with big data tools.
Even if the test_name is under the date directory, that would already be useful (I could process one data at a time). Similarly, the filename prefix could have the test name (it doesn't help to have it in the middle).
Move files under <country> directories (e.g. autoclaved/jsonl/web_connectivity/IT/...)
This will allow one to fetch or process a test in a country without going over all the data. This would also be in a filename prefix.
Use a uniform file format with balanced number of measurements. We can have a split file (e.g. measurements-XXXXX-of-01000.jsonl.gz) with about the same number of measurements each.
The grouping by report is not useful to consumers. The mix of json.lz4 and tar.lz4 prevents effective use of big data tools. Furthermore, LZ4 is not well supported, which also prevents the use of many tools.
Move HTTP bodies to a separate datastore.
Most of the data is HTTP request/response body, which is often not useful for bulk processing. They would live in parallel files instead, maybe with the same index as the measurements to make joining easy (e.g. measurements-01234-of-10000 joins with bodies-01234-of-10000).

The text was updated successfully, but these errors were encountered:

fortuna · 2018-03-16T15:59:50Z

Are any of those a low hanging fruit? For example, maybe rename the files to have test name and country as a prefix?

hellais · 2018-03-19T10:34:53Z

@fortuna I think the first point "Move files under <test_name> directories", is something I think is pretty important to do and it would benefit a larger audience too, but it's not exactly "low hanging fruit".

Currently our data processing pipeline depends quite a bit on having files layed out in a certain way and every measurement for a particular date being inside of the bucket prefix (YYYY-MM-DD), so it's not going to be so simple to change the schema of the already emitted measurements (i.e. the autoclaved).

It's probably going to be easier to add support for another format, such as parquet, and write those in a more comfortable format.

Move files under directories (e.g. autoclaved/jsonl/web_connectivity/IT/...)

I am not sure this is something we should actually do. I think that in most cases, when you are consuming data in batch, you will actually be interested in having measurements from all the countries and actually our country classification sometimes has bugs (geoip is inaccurate), so we probably don't want to do this.

Use a uniform file format with balanced number of measurements

I think this makes sense for formatting the files of other data formats, though I believe that this file format is also something internally used by the data processing pipeline.
@darkk can probably say more on how hard this actually is.

Move HTTP bodies to a separate datastore

I can see why this is something that would be desirable for certain use cases, but maybe this is better solved by having specific pre-processed slices of the data that don't include fields that are not interesting to a user of that dataset.

As a general note, something to keep in mind is that the autoclaved/ are also the master dataset that our data processing pipeline uses internally. I think that to meet the need of end users, we are probably better off having some other batch export that is designed specifically for end user consumption, rather than trying to adapt the internals of our data processing pipeline the need of end users.

Does this make sense?

fortuna · 2018-03-19T13:32:42Z

test_name prefix
I understand the need for the date buckets. I actually ran into the same issue when trying to figure out how to keep my data in sync. We can keep those, but having a <date_bucket>/<test_name> prefix would be helpful.

Public format
Good to know that autoclaved is your internal format. I agree with you: having a an export to a public format for end users makes sense.

On the body size, one test-agnostic heuristic I use is to trim all strings in the JSON to 1000 bytes. It saves a lot in the body field. Notice that if we had a columnar format, we could potentially just ignore that column, which would be great.

Parquet and friends
It seems parquet supports nested objects in their file format, but the serialization in the parquet, avro and Arrow Python libraries don't, as far as I can tell (at least I couldn't figure out). You'd need to roll out your own coder. The library pyspark seems to support nesting according to this Tutorial.
That would be convenient if you decide to use Spark.

One more thing: for a columnar format we may need to revist the schema. Currently there are fields like "headers", where keys can be arbitrary strings. The formats that require schemas don't like that. All keys must be known.
Instead of:

"headers": {
  "Foo": "foo",
  "Bar": "bar",
}

You'd need to restructure to something like

"headers": {[
  {"name": "Foo", "value": "foo"},
  {"name": "Bar", "value": "bar"},
]}

You wouldn't need to worry about the duplication of the keys because the format removes that for you.

Thanks for the follow up!

darkk · 2019-04-29T18:00:29Z

maybe this is better solved by having specific pre-processed slices

We're following the same idea in other places. E.g. PostgreSQL has the fields those are "interesting for sure". Maybe it actually makes sense to have a "shaved" version of autoclaved files without HTTP bodies and replacing them with derived values {title, body_length, body_sha256, body_simhash, body_text_simhash}.

HTTP bodies take 97% of lz4-compressed data size, so making the slice of the dataset that is 130 GiB compared to 4200 GiB is something that sounds like both a useful activity and a "low hanging fruit".

fortuna · 2019-07-01T00:24:54Z

That would be helpful. The new metadata DB requires a complicated and fragile setup. I just need a data dump without most of the stuff I don't need.

hellais · 2019-11-19T14:17:47Z

@FedericoCeratto this is a useful thread to keep in mind.

FedericoCeratto · 2020-08-24T11:19:23Z

Related to #203

FedericoCeratto · 2023-12-14T15:13:22Z

While implementing database backups (#766) we discussed publishing the fastpath and JSONL tables to provide users with another way to access measurement data.

The jsonl table also provide indexing of the files on the S3 data bucket. This could be used to selectively download and reprocess only the most interesting measurement.

hellais · 2025-01-23T10:00:05Z

We now have better options for this and have mostly implemented the suggestions from @fortuna for the new data.

It's also worth noting that researchers are being given access to a notebook server where they can query our database directly which addresses a lot of these issues.

I suggest any future work on this is done as part of ooni/data#59. Closing.

hellais transferred this issue from ooni/pipeline Jan 13, 2020

hellais added enhancement New feature request or improvement to existing functionality ooni/pipeline Issues related to https://github.com/ooni/pipeline labels Jan 13, 2020

bassosimone self-assigned this Sep 14, 2020

bassosimone added the priority/low Nice to have label Sep 25, 2020

FedericoCeratto mentioned this issue Jun 8, 2021

Reprocess all historical data #515

Closed

bassosimone assigned FedericoCeratto and unassigned bassosimone Jul 5, 2021

This was referenced Dec 14, 2023

Investigate seekable Zstandard to speed up measurement access #782

Closed

Database backup: implement private/public backups #766

Closed

hellais closed this as completed Jan 23, 2025

github-project-automation bot moved this to Done in Roadmap Jan 23, 2025

hellais mentioned this issue Jan 23, 2025

Evaluate different data format for speeding up reprocessing ooni/data#59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make data easier to consume in bulk #191

Make data easier to consume in bulk #191

fortuna commented Mar 16, 2018

fortuna commented Mar 16, 2018 •

edited

Loading

hellais commented Mar 19, 2018

fortuna commented Mar 19, 2018

darkk commented Apr 29, 2019

fortuna commented Jul 1, 2019

hellais commented Nov 19, 2019

FedericoCeratto commented Aug 24, 2020

FedericoCeratto commented Dec 14, 2023

hellais commented Jan 23, 2025 •

edited

Loading

Make data easier to consume in bulk #191

Make data easier to consume in bulk #191

Comments

fortuna commented Mar 16, 2018

fortuna commented Mar 16, 2018 • edited Loading

hellais commented Mar 19, 2018

fortuna commented Mar 19, 2018

darkk commented Apr 29, 2019

fortuna commented Jul 1, 2019

hellais commented Nov 19, 2019

FedericoCeratto commented Aug 24, 2020

FedericoCeratto commented Dec 14, 2023

hellais commented Jan 23, 2025 • edited Loading

fortuna commented Mar 16, 2018 •

edited

Loading

hellais commented Jan 23, 2025 •

edited

Loading