Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make data easier to consume in bulk #191

Closed
4 tasks
fortuna opened this issue Mar 16, 2018 · 9 comments
Closed
4 tasks

Make data easier to consume in bulk #191

fortuna opened this issue Mar 16, 2018 · 9 comments
Assignees
Labels
enhancement New feature request or improvement to existing functionality ooni/pipeline Issues related to https://github.com/ooni/pipeline priority/low Nice to have

Comments

@fortuna
Copy link

fortuna commented Mar 16, 2018

I'll use this bug to collect ideas to make it easier to consume OONI data in bulk.

  • Move files under <test_name> directories (e.g. autoclaved/jsonl/web_connectivity/...)
    This will allow one to fetch or process a test type without having to go over all the tests. One may list all files, filter, then use that as input. However thousands of filenames as input does not work well with big data tools.
    Even if the test_name is under the date directory, that would already be useful (I could process one data at a time). Similarly, the filename prefix could have the test name (it doesn't help to have it in the middle).

  • Move files under <country> directories (e.g. autoclaved/jsonl/web_connectivity/IT/...)
    This will allow one to fetch or process a test in a country without going over all the data. This would also be in a filename prefix.

  • Use a uniform file format with balanced number of measurements. We can have a split file (e.g. measurements-XXXXX-of-01000.jsonl.gz) with about the same number of measurements each.
    The grouping by report is not useful to consumers. The mix of json.lz4 and tar.lz4 prevents effective use of big data tools. Furthermore, LZ4 is not well supported, which also prevents the use of many tools.

  • Move HTTP bodies to a separate datastore.
    Most of the data is HTTP request/response body, which is often not useful for bulk processing. They would live in parallel files instead, maybe with the same index as the measurements to make joining easy (e.g. measurements-01234-of-10000 joins with bodies-01234-of-10000).

@fortuna
Copy link
Author

fortuna commented Mar 16, 2018

Are any of those a low hanging fruit? For example, maybe rename the files to have test name and country as a prefix?

cc @darkk @hellais

@hellais
Copy link
Member

hellais commented Mar 19, 2018

@fortuna I think the first point "Move files under <test_name> directories", is something I think is pretty important to do and it would benefit a larger audience too, but it's not exactly "low hanging fruit".

Currently our data processing pipeline depends quite a bit on having files layed out in a certain way and every measurement for a particular date being inside of the bucket prefix (YYYY-MM-DD), so it's not going to be so simple to change the schema of the already emitted measurements (i.e. the autoclaved).

It's probably going to be easier to add support for another format, such as parquet, and write those in a more comfortable format.

Move files under directories (e.g. autoclaved/jsonl/web_connectivity/IT/...)

I am not sure this is something we should actually do. I think that in most cases, when you are consuming data in batch, you will actually be interested in having measurements from all the countries and actually our country classification sometimes has bugs (geoip is inaccurate), so we probably don't want to do this.

Use a uniform file format with balanced number of measurements

I think this makes sense for formatting the files of other data formats, though I believe that this file format is also something internally used by the data processing pipeline.
@darkk can probably say more on how hard this actually is.

Move HTTP bodies to a separate datastore

I can see why this is something that would be desirable for certain use cases, but maybe this is better solved by having specific pre-processed slices of the data that don't include fields that are not interesting to a user of that dataset.

As a general note, something to keep in mind is that the autoclaved/ are also the master dataset that our data processing pipeline uses internally. I think that to meet the need of end users, we are probably better off having some other batch export that is designed specifically for end user consumption, rather than trying to adapt the internals of our data processing pipeline the need of end users.

Does this make sense?

@fortuna
Copy link
Author

fortuna commented Mar 19, 2018

test_name prefix
I understand the need for the date buckets. I actually ran into the same issue when trying to figure out how to keep my data in sync. We can keep those, but having a <date_bucket>/<test_name> prefix would be helpful.

Public format
Good to know that autoclaved is your internal format. I agree with you: having a an export to a public format for end users makes sense.

On the body size, one test-agnostic heuristic I use is to trim all strings in the JSON to 1000 bytes. It saves a lot in the body field. Notice that if we had a columnar format, we could potentially just ignore that column, which would be great.

Parquet and friends
It seems parquet supports nested objects in their file format, but the serialization in the parquet, avro and Arrow Python libraries don't, as far as I can tell (at least I couldn't figure out). You'd need to roll out your own coder. The library pyspark seems to support nesting according to this Tutorial.
That would be convenient if you decide to use Spark.

One more thing: for a columnar format we may need to revist the schema. Currently there are fields like "headers", where keys can be arbitrary strings. The formats that require schemas don't like that. All keys must be known.
Instead of:

"headers": {
  "Foo": "foo",
  "Bar": "bar",
}

You'd need to restructure to something like

"headers": {[
  {"name": "Foo", "value": "foo"},
  {"name": "Bar", "value": "bar"},
]}

You wouldn't need to worry about the duplication of the keys because the format removes that for you.

Thanks for the follow up!

@darkk
Copy link

darkk commented Apr 29, 2019

maybe this is better solved by having specific pre-processed slices

We're following the same idea in other places. E.g. PostgreSQL has the fields those are "interesting for sure". Maybe it actually makes sense to have a "shaved" version of autoclaved files without HTTP bodies and replacing them with derived values {title, body_length, body_sha256, body_simhash, body_text_simhash}.

HTTP bodies take 97% of lz4-compressed data size, so making the slice of the dataset that is 130 GiB compared to 4200 GiB is something that sounds like both a useful activity and a "low hanging fruit".

@fortuna
Copy link
Author

fortuna commented Jul 1, 2019

That would be helpful. The new metadata DB requires a complicated and fragile setup. I just need a data dump without most of the stuff I don't need.

@hellais
Copy link
Member

hellais commented Nov 19, 2019

@FedericoCeratto this is a useful thread to keep in mind.

@hellais hellais transferred this issue from ooni/pipeline Jan 13, 2020
@hellais hellais added enhancement New feature request or improvement to existing functionality ooni/pipeline Issues related to https://github.com/ooni/pipeline labels Jan 13, 2020
@FedericoCeratto
Copy link
Contributor

Related to #203

@FedericoCeratto
Copy link
Contributor

While implementing database backups (#766) we discussed publishing the fastpath and JSONL tables to provide users with another way to access measurement data.

The jsonl table also provide indexing of the files on the S3 data bucket. This could be used to selectively download and reprocess only the most interesting measurement.

@hellais
Copy link
Member

hellais commented Jan 23, 2025

We now have better options for this and have mostly implemented the suggestions from @fortuna for the new data.

It's also worth noting that researchers are being given access to a notebook server where they can query our database directly which addresses a lot of these issues.

I suggest any future work on this is done as part of ooni/data#59. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature request or improvement to existing functionality ooni/pipeline Issues related to https://github.com/ooni/pipeline priority/low Nice to have
Projects
Archived in project
Development

No branches or pull requests

5 participants