-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store aggregated zone stats in a hypertable #1
Comments
First, we have to get all zone IDs and related area + municipality. We can do so by running a query like: dd-zone-stats-aggregator/db.py Lines 15 to 25 in 163d441
This works, we get a list with zone IDs, area as geojson and municipality. But, getting the One thing we tried is adding a We think of an other solution: we can do a database query at the moment the script starts. We store the results locally in a script variable. Only if zones are updated or removed, we reload the data of this specific zone. We will add a |
Now we have all zones, we will loop all zones and get the amount of vehicles in each. These vehicles will be stored in the database table with aggregated zone stats. At the moment this process of calculating amount of vehicles for 3800 zones costs 22 seconds: dd-zone-stats-aggregator/main.py Lines 23 to 28 in 163d441
The reason that it takes a while is that the zones are very detailled, with many geopoint that define the area boundaries. To improve the speed we could make a shadow geojson object that is less detailled (has less points). There are libraries for this. If we do this the aggregated stats are somewhat less accurate, but the calculation speed is improved. Mostly the municipality borders are very detailed. Custom zones are not so detailled, but this could change in the future. An other way to cope with the amount of seconds it takes to calculate the aggregated zone data, is that we could get aggregated data for the custom zones every minute, while the municipality zones only get a precision of 5 minutes. This way we have the best of both worlds: highest accuracy and detailled stats (1 to 5 minute precision). |
We want to create a table for the aggregated zone stats. We want to be able to filter on zone, on modality and on operator. The aggregated zone stats table
TODO:
|
We choose TIMESTAMPTZ, so we can easily human check if data is correct.
|
Example query for getting stats for minute level:
See database size:
Expectation storage size based on first results
We think we can make this more efficient. One hint we have is:
|
We expect that the database will be at most 100G per year, if we enable compression, and if we have 3000 zones that we aggregate every 15 minutes. We will aggregate microhubs every 15 minutes, but city zones we aggregate every hour. So if there're not many microhubs, the amount of storage space needed will be less than 100G (we expect). Things we want to do as next steps:
|
We want to make it possible to do data analysis on zone level.
To make this possible, we have to create a table with aggregated zone stats.
To make querying the table fast, we make it a TimescaleDB Hypertable.
This issue can be closed if the following is done:
The text was updated successfully, but these errors were encountered: