Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store aggregated zone stats in a hypertable #1

Open
1 task
bartwr opened this issue Oct 26, 2022 · 6 comments
Open
1 task

Store aggregated zone stats in a hypertable #1

bartwr opened this issue Oct 26, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@bartwr
Copy link
Collaborator

bartwr commented Oct 26, 2022

We want to make it possible to do data analysis on zone level.

To make this possible, we have to create a table with aggregated zone stats.

To make querying the table fast, we make it a TimescaleDB Hypertable.

Hypertables are PostgreSQL tables with special features that make it easy to handle time-series data. Anything you can do with regular PostgreSQL tables, you can do with hypertables. In addition, you get the benefits of improved performance and user experience for time-series data.

This issue can be closed if the following is done:

  • We have a database table having aggregated zone stats
@bartwr bartwr added the enhancement New feature or request label Oct 26, 2022
@bartwr bartwr self-assigned this Oct 26, 2022
@bartwr
Copy link
Collaborator Author

bartwr commented Oct 26, 2022

First, we have to get all zone IDs and related area + municipality. We can do so by running a query like:

SELECT zones.zone_id,
-- Add margin of 10m's to take GPS inaccuracy into account.
-- changed 10m to 30m as an experiment.
-- this 30m should only be added in case of custom zones.
json_build_object(
'type', 'Feature',
'geometry', ST_AsGeoJSON( ST_GeometryN(area, 1)::geography)::json,
'properties', json_build_object()
) as area, municipality
FROM zones
WHERE zone_type IN ('municipality', 'residential_area', 'country', 'custom');

This works, we get a list with zone IDs, area as geojson and municipality. But, getting the area column costs a lot of time. For 3800 zones it takes 34 seconds. We have to make this faster.

One thing we tried is adding a area_geojson column with the preprocessed geojson based on area. But this doesn't solve a thing: the response time is still huge, i.e. 38 seconds. It has to do with the fact that there's just a lot of data that we request from the database.

We think of an other solution: we can do a database query at the moment the script starts. We store the results locally in a script variable. Only if zones are updated or removed, we reload the data of this specific zone. We will add a last_updated column to the zones table, to keep track of changes. By having a system like this we only have +- 34 seconds response time at the start of the script, and from then for months it can have quick access to the zone data.

@bartwr
Copy link
Collaborator Author

bartwr commented Oct 26, 2022

Now we have all zones, we will loop all zones and get the amount of vehicles in each. These vehicles will be stored in the database table with aggregated zone stats.

At the moment this process of calculating amount of vehicles for 3800 zones costs 22 seconds:

for zone in zones:
result = await tile38.get_vehicles(zone.area)
if result == None:
continue
test = count_modes(result=result)
print(test)

The reason that it takes a while is that the zones are very detailled, with many geopoint that define the area boundaries.

To improve the speed we could make a shadow geojson object that is less detailled (has less points). There are libraries for this. If we do this the aggregated stats are somewhat less accurate, but the calculation speed is improved. Mostly the municipality borders are very detailed. Custom zones are not so detailled, but this could change in the future.

An other way to cope with the amount of seconds it takes to calculate the aggregated zone data, is that we could get aggregated data for the custom zones every minute, while the municipality zones only get a precision of 5 minutes. This way we have the best of both worlds: highest accuracy and detailled stats (1 to 5 minute precision).

@bartwr
Copy link
Collaborator Author

bartwr commented Oct 26, 2022

We want to create a table for the aggregated zone stats.

We want to be able to filter on zone, on modality and on operator.

The aggregated zone stats table aggregated_zone_stats could look like this:

ID timestamp zone_id provider_id modality
                                          1                  bicycle
                                          NULL        NULL

TODO:

  • In the zone stats aggregator: store aggregated stats into the hypertable

@bartwr
Copy link
Collaborator Author

bartwr commented Oct 26, 2022

CREATE TABLE stats_number_of_vehicles_parked (
  time TIMESTAMPTZ NOT NULL,
  zone_id INTEGER NOT NULL,
  system_id VARCHAR(30) NOT NULL,
  modality VARCHAR(20) NOT NULL,
  number_of_vehicles_parked INT NOT NULL
);

We choose TIMESTAMPTZ, so we can easily human check if data is correct.

SELECT create_hypertable('stats_number_of_vehicles_parked', 'time');
CREATE INDEX stats_number_of_vehicles_parked__zone_id__system_id__modality__time_idx ON stats_number_of_vehicles_parked (zone_id, system_id, modality, time DESC);

@bartwr bartwr changed the title Store aggragated zone stats in a hypertable Store aggregated zone stats in a hypertable Oct 26, 2022
@bartwr
Copy link
Collaborator Author

bartwr commented Oct 28, 2022

Example query for getting stats for minute level:

SELECT
	system_id, modality,
  time_bucket('1 minute', time) AS bucket,
  AVG(number_of_vehicles_parked)
FROM stats_number_of_vehicles_parked
  WHERE time > NOW() - INTERVAL '1 hours'
  AND zone_id = 51748
GROUP BY bucket, system_id, modality
ORDER BY bucket ASC, system_id;

See database size:

SELECT hypertable_size('stats_number_of_vehicles_parked');
SELECT * FROM hypertable_detailed_size('stats_number_of_vehicles_parked') ORDER BY node_name;

Expectation storage size based on first results

  • 1500 MB per day (1.5 G per day) for all zones.

We think we can make this more efficient. One hint we have is:

  • You should turn on native's compression.

@bartwr
Copy link
Collaborator Author

bartwr commented Oct 28, 2022

We expect that the database will be at most 100G per year, if we enable compression, and if we have 3000 zones that we aggregate every 15 minutes.

We will aggregate microhubs every 15 minutes, but city zones we aggregate every hour. So if there're not many microhubs, the amount of storage space needed will be less than 100G (we expect).

Things we want to do as next steps:

  1. Create a separate database for stats_number_of_vehicles_parked
  2. Setup compression
  3. Create an API end point for all kinds of data we want to request
  4. Create the user interface for detailled microhubs analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant