Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to integrate a timeseries database into the data model? #227

Open
rdmolony opened this issue Jul 6, 2023 · 5 comments
Open

How to integrate a timeseries database into the data model? #227

rdmolony opened this issue Jul 6, 2023 · 5 comments
Labels
tool This is a new tool that could be worked on

Comments

@rdmolony
Copy link

rdmolony commented Jul 6, 2023

I mentioned TimescaleDB & discussed its pros/cons from an implementation perspective in #222 (comment) which was a little off-topic so I'm moving it here for further discussion

@rdmolony
Copy link
Author

rdmolony commented Jul 6, 2023

@stephenholleran mentioned -

based on the fact that Windographer will very soon be able to import the data model, why try and prepare the data i.e. calibrate it, before loading it into Windographer? In theory, Windographer should be able to apply the calibrations for you as the adjustments required should all be contained in the meta data.

This could be a good option. I've been holding out on this a while now.

A "platform" could ...

  1. Link each station to its raw files of generic readings
  2. Export the data model

End-users would have to open all raw data files in Windographer & then import the data model


On the flip side calibrating & flagging readings in a "platform" enables it to be more agnostic about tooling. In our case we're interested in tracking the data availability which requires this being applied

@abohara
Copy link
Collaborator

abohara commented Jul 7, 2023

Thanks @rdmolony . TimescaleDB sounds like would be one good option for storing timeseries. Though recommending a db is not part of core data model, we do have a postgres example like you noted. Would you be open to editing this example to include a timescaledb hypertable ? Perhaps, it can evolve into a blog on how the metadata could be joined with the timeseries.

@rdmolony
Copy link
Author

rdmolony commented Jul 7, 2023

No problem @abohara, I could do that or also a separate sql file to tools/ with a few additional CREATE statements without much hassle. I'm not fussed.

The TimescaleDB team invited me to collaborate on a blog post on this experience to date so I do intend to document this all a bit more in the coming weeks/months

@abohara
Copy link
Collaborator

abohara commented Jul 14, 2023

Thanks @rdmolony . That would be great.

@stephenholleran stephenholleran added the tool This is a new tool that could be worked on label Aug 3, 2023
@rdmolony
Copy link
Author

rdmolony commented Dec 11, 2023

Hi all,

I finally got around to writing up my experience on integrating Postgres/TimescaleDB database into the Mainstream Renewable Power (MRP) data flow -

Struggling to sync sensors & databases | rowanmolony.com

My experience

I wanted to gain some experience running it smoothly in production (the switch was made in September) to be sure it actually worked before posting anything.

In the MRP context, Postgres/TimescaleDB was a good fit since it enabled reducing the complexity of the data flow system to make it more manageable by a small team of 1-3 people.

How?

It allowed us to use the web application database (i.e. the sensor metadata database) for storing & processing time series data rather than multiple tools. Two databases became one database1 & pandas was ditched for processing timeseries2. Add data flow unit tests (code that checks other code) became possibl since this system became far easier to test.

TimescaleDB is an extension to the Postgres database that enables working with timeseries readings from within the database. It provides a special table called a Hypertable which speeds up insert & query performance for timeseries tables3.


How does this relate to the IEA Task 43 data model?

I’m hesitant to recommend any next steps here since I have just left my job at Mainstream Renewable Power, and don’t intend to adapt this data model for Postgres/TimescaleDB, however, I can make a few comments in case someone would find that interesting.

Postgres/TimescaleDB probably only makes sense as an “extension” of this data model since it couples this project with a specific technology.

How could it extend the data model?

It enables storing readings in a database table just like the metadata.

One could create one table per data source like ...

timestamp Wind Speed 10m 180deg Wind Direction 10m 180deg
2023-11-27 00:00:00 5 60
... ... ...
timestamp WS10.180 WD10.180
2023-11-27 00:00:00 4 70
... ... ...

... or a single “long-format” table like …

timestamp mast_id sensor_id value
2023-11-27 00:00:00 1 1 5
2023-11-27 00:00:00 1 2 60
2023-11-27 00:00:00 1 3 4
2023-11-27 00:00:00 1 4 70

I opted for the "long-format" table since it easily “mergeable” with metadata like …

mast_id sensor_id sensor_name data_type height magnetic_orientation
1 1 “WS10.90” “Wind Speed" 10 90
1 2 “WS20.90” “Wind Speed" 20 90
1 3 “WS10.180” “Wind Speed" 10 180
1 4 “WS20.180” “Wind Speed" 20 180

… or flags of erroneous readings like …

sensor_id timestamp flag_type
1 2023-11-27 00:00:00 “Fault”

… in SQL

     select *
       from sensor_reading_generic as reading
inner join sensor_metadata as meta
           on  meta.sensor_id = reading.sensor_id,
     where station_id = <ID>
       and timestamp between '2023-12-01' and now()

Once the TimescaleDB extension has been installed in the database, one can create this “readings” table like …

      create table sensor_reading_generic (
          timestamp timestamp not null,
          mast_id integer not null,
          file_id integer not null,
          sensor_id integer not null,
          generic_reading float,
          primary key (timestamp, station_id, sensor_id)
      );
      select create_hypertable('sensor_reading_generic', 'timestamp');
      alter table sensor_reading_generic set (
          timescaledb.compress,
          timescaledb.compress_orderby = 'timestamp DESC',
          timescaledb.compress_segmentby = 'mast_id, file_id, sensor_id'
      );
      select add_compression_policy('sensor_reading_generic', interval '1 week');

mast_id, file_id & sensor_id are foreign keys to other tables

I’m not that familiar with this data model.

So I built it locally by …

From this model, I can see that one could use -

  • column_name to link sensor readings to metadata
  • measurement_location to link to mast name

I don’t see -

  • A model for logger files?
  • Or erroneous sensor readings?

To make the above joins easier in the context of the MRP data model I used views to merge multiple tables of interest so that things like mast_name & sensor_name are all accessible from a “table-of-tables” which I can then merge with my readings.

Let me know if you have any questions, or if interested want to discuss in more detail over a virtual coffee

Footnotes

  1. Previously one for timeseries, and one for everything else

  2. Linking timeseries readings to sensor metadata for re-calibrating & flagging erroneous readings

  3. As of 2023-11-20 TimescaleDB by default chunks readings in compressed hypertables into intervals of one week so Postgres doesn't need to read all rows for queries that only care about a particular period of time, see Hypertables

  4. I used nix -

    nix profile install --impure --expr 'with import <nixpkgs> {}; pkgs.postgresql.withPackages   (p: [ p.timescaledb ])'
    

    This works on Linux/OSX or on Windows via WSL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tool This is a new tool that could be worked on
Projects
None yet
Development

No branches or pull requests

3 participants