Store aggregated zone stats in a hypertable #1

bartwr · 2022-10-26T10:36:51Z

We want to make it possible to do data analysis on zone level.

To make this possible, we have to create a table with aggregated zone stats.

To make querying the table fast, we make it a TimescaleDB Hypertable.

Hypertables are PostgreSQL tables with special features that make it easy to handle time-series data. Anything you can do with regular PostgreSQL tables, you can do with hypertables. In addition, you get the benefits of improved performance and user experience for time-series data.

This issue can be closed if the following is done:

We have a database table having aggregated zone stats

bartwr · 2022-10-26T12:36:49Z

First, we have to get all zone IDs and related area + municipality. We can do so by running a query like:

dd-zone-stats-aggregator/db.py

Lines 15 to 25 in 163d441

    
                   SELECT zones.zone_id, 
        
                   -- Add margin of 10m's to take GPS inaccuracy into account. 
        
                   -- changed 10m to 30m as an experiment.  
        
                   -- this 30m should only be added in case of custom zones.  
        
                   json_build_object( 
        
                       'type',       'Feature', 
        
                       'geometry',   ST_AsGeoJSON( ST_GeometryN(area, 1)::geography)::json, 
        
                       'properties',  json_build_object() 
        
                   ) as area, municipality 
        
                   FROM zones 
        
                   WHERE zone_type IN ('municipality', 'residential_area', 'country', 'custom');

This works, we get a list with zone IDs, area as geojson and municipality. But, getting the area column costs a lot of time. For 3800 zones it takes 34 seconds. We have to make this faster.

One thing we tried is adding a area_geojson column with the preprocessed geojson based on area. But this doesn't solve a thing: the response time is still huge, i.e. 38 seconds. It has to do with the fact that there's just a lot of data that we request from the database.

We think of an other solution: we can do a database query at the moment the script starts. We store the results locally in a script variable. Only if zones are updated or removed, we reload the data of this specific zone. We will add a last_updated column to the zones table, to keep track of changes. By having a system like this we only have +- 34 seconds response time at the start of the script, and from then for months it can have quick access to the zone data.

bartwr · 2022-10-26T12:52:30Z

Now we have all zones, we will loop all zones and get the amount of vehicles in each. These vehicles will be stored in the database table with aggregated zone stats.

At the moment this process of calculating amount of vehicles for 3800 zones costs 22 seconds:

dd-zone-stats-aggregator/main.py

Lines 23 to 28 in 163d441

    
           for zone in zones: 
        
               result = await tile38.get_vehicles(zone.area) 
        
               if result == None: 
        
                   continue 
        
               test = count_modes(result=result) 
        
               print(test)

The reason that it takes a while is that the zones are very detailled, with many geopoint that define the area boundaries.

To improve the speed we could make a shadow geojson object that is less detailled (has less points). There are libraries for this. If we do this the aggregated stats are somewhat less accurate, but the calculation speed is improved. Mostly the municipality borders are very detailed. Custom zones are not so detailled, but this could change in the future.

An other way to cope with the amount of seconds it takes to calculate the aggregated zone data, is that we could get aggregated data for the custom zones every minute, while the municipality zones only get a precision of 5 minutes. This way we have the best of both worlds: highest accuracy and detailled stats (1 to 5 minute precision).

bartwr · 2022-10-26T13:56:35Z

We want to create a table for the aggregated zone stats.

We want to be able to filter on zone, on modality and on operator.

The aggregated zone stats table aggregated_zone_stats could look like this:

ID timestamp zone_id provider_id modality
                                          1                  bicycle
                                          NULL        NULL

TODO:

In the zone stats aggregator: store aggregated stats into the hypertable

bartwr · 2022-10-26T14:08:02Z

CREATE TABLE stats_number_of_vehicles_parked (
  time TIMESTAMPTZ NOT NULL,
  zone_id INTEGER NOT NULL,
  system_id VARCHAR(30) NOT NULL,
  modality VARCHAR(20) NOT NULL,
  number_of_vehicles_parked INT NOT NULL
);

We choose TIMESTAMPTZ, so we can easily human check if data is correct.

SELECT create_hypertable('stats_number_of_vehicles_parked', 'time');

CREATE INDEX stats_number_of_vehicles_parked__zone_id__system_id__modality__time_idx ON stats_number_of_vehicles_parked (zone_id, system_id, modality, time DESC);

bartwr · 2022-10-28T13:19:17Z

Example query for getting stats for minute level:

SELECT
	system_id, modality,
  time_bucket('1 minute', time) AS bucket,
  AVG(number_of_vehicles_parked)
FROM stats_number_of_vehicles_parked
  WHERE time > NOW() - INTERVAL '1 hours'
  AND zone_id = 51748
GROUP BY bucket, system_id, modality
ORDER BY bucket ASC, system_id;

See database size:

SELECT hypertable_size('stats_number_of_vehicles_parked');
SELECT * FROM hypertable_detailed_size('stats_number_of_vehicles_parked') ORDER BY node_name;

Expectation storage size based on first results

1500 MB per day (1.5 G per day) for all zones.

We think we can make this more efficient. One hint we have is:

You should turn on native's compression.

bartwr · 2022-10-28T13:46:06Z

We expect that the database will be at most 100G per year, if we enable compression, and if we have 3000 zones that we aggregate every 15 minutes.

We will aggregate microhubs every 15 minutes, but city zones we aggregate every hour. So if there're not many microhubs, the amount of storage space needed will be less than 100G (we expect).

Things we want to do as next steps:

Create a separate database for stats_number_of_vehicles_parked
Setup compression
Create an API end point for all kinds of data we want to request
Create the user interface for detailled microhubs analysis

bartwr added the enhancement New feature or request label Oct 26, 2022

bartwr self-assigned this Oct 26, 2022

bartwr changed the title ~~Store aggragated zone stats in a hypertable~~ Store aggregated zone stats in a hypertable Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store aggregated zone stats in a hypertable #1

Store aggregated zone stats in a hypertable #1

bartwr commented Oct 26, 2022

bartwr commented Oct 26, 2022

bartwr commented Oct 26, 2022

bartwr commented Oct 26, 2022

bartwr commented Oct 26, 2022

bartwr commented Oct 28, 2022

bartwr commented Oct 28, 2022

Store aggregated zone stats in a hypertable #1

Store aggregated zone stats in a hypertable #1

Comments

bartwr commented Oct 26, 2022

bartwr commented Oct 26, 2022

bartwr commented Oct 26, 2022

bartwr commented Oct 26, 2022

bartwr commented Oct 26, 2022

bartwr commented Oct 28, 2022

bartwr commented Oct 28, 2022