Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to extend the data model to allow self-describing data files? #222

Open
AndyClifton opened this issue May 12, 2023 · 16 comments
Open

How to extend the data model to allow self-describing data files? #222

AndyClifton opened this issue May 12, 2023 · 16 comments

Comments

@AndyClifton
Copy link

Hi folks,

We're looking at using the data model as the basis for a single file that contains both metadata and measurements from a met mast or wind lidar.

The reason for this is, to keep measurements together with metadata and so allow continuity when data are transferred from person to person, or from process to process, or between organisations. In our experience, any approach that separates metadata from data runs the risk of losing the link. Our goal is to be able to create a single file that contains the data the user/process requires as an export from a file collection, database, web app or whatever.

As we understand it, currently the data model does not have a way to store time series data. (Thanks @stephenholleran for the patience / confirmation)

🤔 How would you extend / modify the data model to include time series data?

@stephenholleran
Copy link
Collaborator

Thanks @AndyClifton for posing the question.

I understand the benefits of a self describing file so am interested to explore this.

I suppose I have two thoughts so far;

  1. JSON isn't the most efficient at capturing timeseries data. It can be done, but will take up space. Do you have any schema suggestion that could robustly handle table data and yet be space efficient?
  2. How do we link the table data, more specifically the column headings, back to the data model. If there are no configuration changes in the logger/lidar setup then the raw data files column headings could be used. But if there are configuration changes then the measurement point names should be used. I would assume that the self describing data file produced would be the assembled data from all the daily files and so would include configuration changes. Therefore I would use the measurement point names.

Cheers,

@AndyClifton
Copy link
Author

Re your questions...

  1. We're not too worried about the space required, especially if the file is only created when required. Or are we missing something here?
  2. Agreed, the output data should account for configuration changes. So yes, it would make sense to use the measurement point names.

@stephenholleran
Copy link
Collaborator

Sorry, what I mean when talking about space is not for actual storage but in transmitting the data e.g. via a web api. Ideally you'd want it to be compact so it can be transferred quickly.

@abohara
Copy link
Collaborator

abohara commented May 15, 2023

Formally, I don't believe we have any sort of recommendation on how to combine it into a single file. Pre-fixing the json header into the CSV seems like one option but parsing it would require a little more complicated parser than reading in a csv or json individually. This does seem like the format many OEM are using as well.

Regarding how the data model ( in json format ) ties into the table of measurement (e.g. in csv format ), the natural way for them to link is through the data columns table ( which also specifies the stat ) as opposed to the measurement name which does not specify the stat.

@AndyClifton
Copy link
Author

AndyClifton commented May 22, 2023

@stephenholleran :

Sorry, what I mean when talking about space is not for actual storage but in transmitting the data e.g. via a web api. Ideally you'd want it to be compact so it can be transferred quickly.

Understood. I think this depends on the use case. We only anticipate doing this bundling of data and metadata for specific purposes, and the main one is a user requesting an export of their data so they can use it in another application. Then it would be fine if a call was made and the user informed later that the data was ready for download. So, I don't think that speed is too much of a challenge. But, I guess the problem could be if the API call is fast enough, it would be tempting to use it more frequently.

Because this seems like a UI / UX issue, I would say its worth setting out how the data model could include data, and worry about the implementation challenges separately?

@AndyClifton
Copy link
Author

@abohara: thanks for this.

Formally, I don't believe we have any sort of recommendation on how to combine it into a single file. Pre-fixing the json header into the CSV seems like one option but parsing it would require a little more complicated parser than reading in a csv or json individually. This does seem like the format many OEM are using as well.

Regarding how the data model ( in json format ) ties into the table of measurement (e.g. in csv format ), the natural way for them to link is through the data columns table ( which also specifies the stat ) as opposed to the measurement name which does not specify the stat.

Do you have an example of what this could look like?

@kersting
Copy link
Collaborator

@AndyClifton I know you're talking about lidars but Kintech Orbit 360 logger for met mast is a good example of metadata in json in the header and time series in the file's body. Still, such files need to be parsed in IEA task 43 format.

@stephenholleran
Copy link
Collaborator

stephenholleran commented Jun 1, 2023

Sorry for the delay, I was at WESC last week.

From @abohara:

Regarding how the data model ( in json format ) ties into the table of measurement (e.g. in csv format ), the natural way for them to link is through the data columns table ( which also specifies the stat ) as opposed to the measurement name which does not specify the stat.

This can work however it just makes working with the data a little trickier if there was a certain type of config change, for example an 80m anemometer been moved to 100m but it is still connected to Ch1. The Ch1 column in the data now means two different things (a wind speed at 80m for the first part and then a wind speed at 100m for the second part). The data model captures the config change so all the info is there to break up that column and rename it something else. What we do is assemble the data taking into account these config changes and so then use the measurement point name as the column heading. To get around the lack of a stat we just append the stat enum onto the end. I know not a great solution but it works.

So there are two methods you can link the metadata to the timeseries data. The difference is just when the assembling of the data is done.


From @AndyClifton :

But, I guess the problem could be if the API call is fast enough, it would be tempting to use it more frequently.

Because this seems like a UI / UX issue, I would say its worth setting out how the data model could include data, and worry about the implementation challenges separately?

People are impatient @AndyClifton, they'll want the data immediately 😄 . There are several ways to capture tabular data in JSON Schema, we should consider data transfer speeds when picking one.

Of the possible JSON formats for storing table data I would probably lean towards one of the formats that pandas outputs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html . I know we have used one of them in the past but I can't remember which. This would make it easy for the user to load the JSON back into a pandas dataframe or something similar.

@oriollacave
Copy link

filename and checksum would be possible options to keep original format and adding metadata. We've done that and feels comfortable and retro compatibility.

Anyway , if data is updated then it gets broken and checksum needs to be updated.

@rdmolony
Copy link

rdmolony commented Jun 29, 2023

Hey there,

If you haven't encountered it yet, TimescaleDB might be useful as you have already linked a PostgreSQL schema.

It's an extension to PostgreSQL that enables storing timeseries data in your Postgres database alongside your relational database metadata in a special table format called a Hypertable.

We're currently migrating to it at Mainstream Rewewable Power from our legacy multiple-database system (one for timeseries & one for metadata). We're using Django, Django Rest Framework & psycopg2 as a glue for combining timeseries & metadata for exporting to various formats

@stephenholleran
Copy link
Collaborator

Thanks @rdmolony,

Yeah, that could be a useful for an extension of the SQL create table statements.

Just a note of caution with timeseries databases. They are mostly geared towards retrieving the most recent time based data and not what wind analysts want which is the whole dataset in one go. They do shard the tables based on time, keeping the most recent in memory for fast retrieval which is great if you are wanting the last 5 days for data monitoring but maybe not so fast for 2 years of data. It all depends on what hardware you can throw at it (i.e. cost) and how you configure it too and as you say TimescaleDB does give you extra benefits by been integrated into Postgres.

Thanks also for contributing in yesterday's workshop and it is great to hear that Mainstream are further aligning with the data model.

@rdmolony
Copy link

rdmolony commented Jul 5, 2023

There's no such thing as a free lunch! As always, it depends on your use case!

If you're interested in our specific quirks so far with this tech, some notes -

  • All of the data is in one place; sensor metadata & timeseries

Our goal of migrating was simplifying our systems. We are still storing our timeseries readings in Microsoft SQL Server & our metadata in PostgreSQL & combining the two in pandas. It's messy.

  • Querying 10y of data is fast; once the data is stored in a compressed hypertable.

Rather than storing row-by-row as is typical in a relational database, these tables store in chunks & so are optimised for bulk reads. However, querying is not as fast as an in-memory DataFrame library like pandas because it doesn't load the entire query into memory. For the same reason it does scale to v large datasets.

For 11 years of 50 sensor readings for every 10s (on our self-hosted Windows machine with 32GB RAM | 2.7GH\ Intel Xeon Platinum 8168 CPU); a simple select takes 24s & calibrating + selecting takes 1m10s

  • Ingesting data into compressed hypertables is slow for "old" data; depending on if you've configured compression for all data older than the last month or week or day

Compressed Hypertables assumes that you're only ingesting recent data (last month or week or day). To ingest an "older" data source you have to decompress chunks before insertion which can be really slow for inserting say 10y!

  • Calibrating data stored in compressed hypertables is slow; this involves reading millions of rows, linking each reading to our sensor metadata & recalibrating each reading using it.

Ideally I'd like to cache calibrated readings in another compressed hypertable, however, it's slow to update compressed hypertables so this isn't useful yet.

  • Pivoting from long to wide format for Windographer exports is surprisingly hard in Postgres/Python

There is the builtin function crosstab, however, this is very slow on exporting calibrated readings on top of the existing computation. As an alternative we can also stream over the query result in Python & pivot chunk by chunk - a bit messy ...

In our case this compressed hypertable ingest constraint is "okay" for ingesting our logger data but not so convenient for 3rd party data sources. To work around the query slowness on calibrated readings we we run periodic jobs to dump calibrated readings to files.

EDIT 1: I looped in the TimescaleDB team for comment -

(@jonatas) Very interesting challenge! thanks for mentioning Timescaledb :timescale-mark: Also, I see a great potential for continuous aggregates and avoid touching old data at all. If data is computed to be accessed, probably a cagg can replace the raw data. Another advantage is that parallelization will always speed up the queries compared to regular postgresql tables. Loving to learn from the thread

EDIT 2: Calibrating generic readings approximately doubles the query time as Postgres has to make two round trips for a SELECT *; one to fetch generic readings & another to recalibrate these readings like (readings.generic_reading - meta.logger_offset) * (meta.certificate_slope / meta.logger_slope) + meta.certificate_offset (see the TimescaleDB slack discussion above for more info)

@stephenholleran
Copy link
Collaborator

stephenholleran commented Jul 5, 2023

Thanks @rdmolony, brilliant info there on how you are solving this.

One comment, based on the fact that Windographer will very soon be able to import the data model, why try and prepare the data i.e. calibrate it, before loading it into Windographer? In theory, Windographer should be able to apply the calibrations for you as the adjustments required should all be contained in the meta data. This could potentially save you the headache of trying to provide calibrated data. Just a thought and we are digressing from the opening question about how the data model could accommodate the timeseries data. :) Sorry everyone else!

EDIT: This timescaleDB discussion is now moved to #227.

@stephenholleran
Copy link
Collaborator

Also, @rdmolony if you are interested in joining our Task 43 wind resource data model working group we would love to have you? Experience from someone who is implementing the data model would be very useful and as you are a software engineer your perspective on changes going forward could be valuable. We have a 1 hour call every second Thursday at 4 pm Irish time. There will be no homework if you don't want to, it is mostly turn up and contribute within that 1 hour.

@rdmolony
Copy link

rdmolony commented Jul 6, 2023

Moving to #227 @stephenholleran!

@rdmolony
Copy link

rdmolony commented Jul 6, 2023

Sure, add me to the email thread or however these things are organised :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants