You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation is partitioning parquet datasets by time, space, and input file (i.e. each file in the final parquet dataset will contain data from exactly one input file, located in one of the tiles composing the grid of the data, and pertaining to a specific period of time). This results in potentially generating a very large amount of files, some of which can contain a very low number of rows (files containing a single row are possible under this framework).
Querying a parquet dataset remotely takes more time the more files are created, because the reader will need to read the metadata section of each file to know which files contain data that can be of interest for each query. The helper functions provided in this repository minimise this issue by relying heavily on the hive structure of the parquet dataset, filtering files depending on the names of the folders they are contained in. This is done by computing the ID(s) of the tile and time period of interest when running any sort of spatiotemporal query, doing manually something that should be handled by the parquet reader itself.
The current solution allows to efficiently extract data for a specific point in time or space and to easily modify the dataset if some input files are reprocessed, but it has two main disadvantages:
Querying datasets with standard parquet libraries is possible but very inefficient, the functions in ParquetDataQuery.py are required to make efficient use of the structure
Querying on fields other than the partition keys is not efficient because the reader needs to look at all files again
Solutions trying to minimise these disadvantages should focus on reducing the number of files created for each parquet dataset. Some possible ideas can be:
Don't partition by input files
Remove the partitioning by input files and add a column containing the name of the input file for each row. This would still allow to know which rows need to be changed when an input file has been reprocessed; each partition containing rows from the reprocessed file will have to be rewritten completely, trading off some data management efficiency for data querying efficiency.
Don't partition by time
Remove the temporal partitions, make sure that the data inside each partition is sorted temporally, and set rowgroup sizes appropriately to make temporal queries efficient.
Don't partition by space
Remove the spatial partitions and use a spatial_shuffling algorithm to organise the data in groups of spatially-close rows. This could be computationally intensive depending on the amount of data it is performed on, but it can make spatial queries efficient. In order to obtain the best efficiency on spatial queries, geoParquet datasets should make use of a bbox column as described by the geoparquet 1.1 specs. This point is not compatible with the previous one, but the rigid grid currently in use to split the data is one of the main causes of very small partitions.
Only partition by input files
Use exclusively the input files for partitioning. This assumes that data in each file will be close in space and/or in time; one of the two previous solutions (temporal or spatial sort) can be applied to files to make spatiotemporal queries efficient while still allowing to reprocess input files very easily. E.g. for a dataset with one input file per day covering a large area, we can convert each file to a single parquet partition, apply a spatial shuffle to it and set the rowgroups to be small enough (some benchmarking will be needed to get the right size).
Build an index file
Create an index file such as the one proposed by OvertureMaps. Could be applied in combination with any of the other solutions. This is still a solution that requires additional logic on top of classic parquet, but it is the closest thing to a shared standard we have at the moment, meaning this has the greatest chance of eventually turning into something supported natively by libraries.
The text was updated successfully, but these errors were encountered:
The current implementation is partitioning parquet datasets by time, space, and input file (i.e. each file in the final parquet dataset will contain data from exactly one input file, located in one of the tiles composing the grid of the data, and pertaining to a specific period of time). This results in potentially generating a very large amount of files, some of which can contain a very low number of rows (files containing a single row are possible under this framework).
Querying a parquet dataset remotely takes more time the more files are created, because the reader will need to read the metadata section of each file to know which files contain data that can be of interest for each query. The helper functions provided in this repository minimise this issue by relying heavily on the hive structure of the parquet dataset, filtering files depending on the names of the folders they are contained in. This is done by computing the ID(s) of the tile and time period of interest when running any sort of spatiotemporal query, doing manually something that should be handled by the parquet reader itself.
The current solution allows to efficiently extract data for a specific point in time or space and to easily modify the dataset if some input files are reprocessed, but it has two main disadvantages:
ParquetDataQuery.py
are required to make efficient use of the structureSolutions trying to minimise these disadvantages should focus on reducing the number of files created for each parquet dataset. Some possible ideas can be:
Don't partition by input files
Remove the partitioning by input files and add a column containing the name of the input file for each row. This would still allow to know which rows need to be changed when an input file has been reprocessed; each partition containing rows from the reprocessed file will have to be rewritten completely, trading off some data management efficiency for data querying efficiency.
Don't partition by time
Remove the temporal partitions, make sure that the data inside each partition is sorted temporally, and set rowgroup sizes appropriately to make temporal queries efficient.
Don't partition by space
Remove the spatial partitions and use a spatial_shuffling algorithm to organise the data in groups of spatially-close rows. This could be computationally intensive depending on the amount of data it is performed on, but it can make spatial queries efficient. In order to obtain the best efficiency on spatial queries, geoParquet datasets should make use of a
bbox
column as described by the geoparquet 1.1 specs. This point is not compatible with the previous one, but the rigid grid currently in use to split the data is one of the main causes of very small partitions.Only partition by input files
Use exclusively the input files for partitioning. This assumes that data in each file will be close in space and/or in time; one of the two previous solutions (temporal or spatial sort) can be applied to files to make spatiotemporal queries efficient while still allowing to reprocess input files very easily. E.g. for a dataset with one input file per day covering a large area, we can convert each file to a single parquet partition, apply a spatial shuffle to it and set the rowgroups to be small enough (some benchmarking will be needed to get the right size).
Build an index file
Create an index file such as the one proposed by OvertureMaps. Could be applied in combination with any of the other solutions. This is still a solution that requires additional logic on top of classic parquet, but it is the closest thing to a shared standard we have at the moment, meaning this has the greatest chance of eventually turning into something supported natively by libraries.
The text was updated successfully, but these errors were encountered: