-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make data easier to consume in bulk #191
Comments
@fortuna I think the first point "Move files under <test_name> directories", is something I think is pretty important to do and it would benefit a larger audience too, but it's not exactly "low hanging fruit". Currently our data processing pipeline depends quite a bit on having files layed out in a certain way and every measurement for a particular date being inside of the bucket prefix ( It's probably going to be easier to add support for another format, such as parquet, and write those in a more comfortable format.
I am not sure this is something we should actually do. I think that in most cases, when you are consuming data in batch, you will actually be interested in having measurements from all the countries and actually our country classification sometimes has bugs (geoip is inaccurate), so we probably don't want to do this.
I think this makes sense for formatting the files of other data formats, though I believe that this file format is also something internally used by the data processing pipeline.
I can see why this is something that would be desirable for certain use cases, but maybe this is better solved by having specific pre-processed slices of the data that don't include fields that are not interesting to a user of that dataset. As a general note, something to keep in mind is that the Does this make sense? |
test_name prefix Public format On the body size, one test-agnostic heuristic I use is to trim all strings in the JSON to 1000 bytes. It saves a lot in the body field. Notice that if we had a columnar format, we could potentially just ignore that column, which would be great. Parquet and friends One more thing: for a columnar format we may need to revist the schema. Currently there are fields like "headers", where keys can be arbitrary strings. The formats that require schemas don't like that. All keys must be known.
You'd need to restructure to something like
You wouldn't need to worry about the duplication of the keys because the format removes that for you. Thanks for the follow up! |
We're following the same idea in other places. E.g. PostgreSQL has the fields those are "interesting for sure". Maybe it actually makes sense to have a "shaved" version of autoclaved files without HTTP bodies and replacing them with derived values HTTP bodies take 97% of lz4-compressed data size, so making the slice of the dataset that is 130 GiB compared to 4200 GiB is something that sounds like both a useful activity and a "low hanging fruit". |
That would be helpful. The new metadata DB requires a complicated and fragile setup. I just need a data dump without most of the stuff I don't need. |
@FedericoCeratto this is a useful thread to keep in mind. |
Related to #203 |
While implementing database backups (#766) we discussed publishing the fastpath and JSONL tables to provide users with another way to access measurement data. The jsonl table also provide indexing of the files on the S3 data bucket. This could be used to selectively download and reprocess only the most interesting measurement. |
We now have better options for this and have mostly implemented the suggestions from @fortuna for the new data. It's also worth noting that researchers are being given access to a notebook server where they can query our database directly which addresses a lot of these issues. I suggest any future work on this is done as part of ooni/data#59. Closing. |
I'll use this bug to collect ideas to make it easier to consume OONI data in bulk.
Move files under
<test_name>
directories (e.g.autoclaved/jsonl/web_connectivity/...
)This will allow one to fetch or process a test type without having to go over all the tests. One may list all files, filter, then use that as input. However thousands of filenames as input does not work well with big data tools.
Even if the
test_name
is under the date directory, that would already be useful (I could process one data at a time). Similarly, the filename prefix could have the test name (it doesn't help to have it in the middle).Move files under
<country>
directories (e.g.autoclaved/jsonl/web_connectivity/IT/...
)This will allow one to fetch or process a test in a country without going over all the data. This would also be in a filename prefix.
Use a uniform file format with balanced number of measurements. We can have a split file (e.g.
measurements-XXXXX-of-01000.jsonl.gz
) with about the same number of measurements each.The grouping by report is not useful to consumers. The mix of
json.lz4
andtar.lz4
prevents effective use of big data tools. Furthermore, LZ4 is not well supported, which also prevents the use of many tools.Move HTTP bodies to a separate datastore.
Most of the data is HTTP request/response body, which is often not useful for bulk processing. They would live in parallel files instead, maybe with the same index as the measurements to make joining easy (e.g. measurements-01234-of-10000 joins with bodies-01234-of-10000).
The text was updated successfully, but these errors were encountered: