-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add timeseries purge script #899
base: master
Are you sure you want to change the base?
Conversation
When operating a server, the `Stage_timeseries` database can become quite big. In the case where only the `Stage_analysis_timeseries` is actually useful after the pipeline execution, the user's timeseries can be deleted to speed up the pipeline and gain some disk space.
the OpenPATH data model is that the timeseries, which consists of user inputs, cannot be recreated (since we can't go back in time) so is extremely precious and immutable. So while running this script is clearly optional, the script itself is inherently contrary to the spirit of the project, and I am very reluctant to merge it. Regardless, for your better understanding of the pipeline:
|
Although I don't think we should delete the timeseries, I would support moving them to a separate long-term/archival storage after a while. This would still allow pipeline resets as long as issues were detected quickly (before the timeseries data is archived), and could also allow for improved algorithms by copying the data back from the archival storage. |
You might want to see the related |
Hey, here's some context related to this PR. The way we find use-cases in France is always linked to experimental projects, financed by local authorities. The caveat that comes with this kind of projects is that all gathered data has a limited lifetime: at the end of the program, all user data has to be deleted. Given the duration of these projects (around a year usually), the likelihood that we rerun timeseries analysis or that new use cases appear is very low. In practice, it never happened. I'm wondering if we're an isolated case, or if other community members fit this requirement of limited lifetime of data. If we're alone, I agree that this PR is no use for the central repo, if not, maybe it should be part of the available toolkit. Another element of context that is quite important, is that this is the first step in our goal, with Cozy Cloud, of optional decentralization of openPath. The goal of this PR is to provide a way of removing old data from the pipeline process. It doesn't exclude the idea of moving such data to long term storage before that, although this would require another process not included in this PR. In a decentralized approach, such data would be stored in the personal cloud storage of each user, and users would be free to decide to save or delete it. In the case where we're an isolated case, do you think this PR would be useful if it included some kind of long term storage before deletion ? Perhaps by exporting the data to files (targz'ed CSVs?) to begin with ? |
Hello @shankari, In our decentralized vision, the data should only remain in each Cozy instance, 1 for user. In the long-term vision, we would strive for a decentralized openPATH, but for now, we would run a centralized server to compute the trips, and each instance will periodically request the
In our personal Cloud use-case, we do not need to keep the data in the centralized server, as long as we do not work on the pipeline execution itself, improve the algorithms and so on. This is definitely not to be excluded in the future, but for now our short-term goal is to be able to operate a scalable openPATH server for thousands of users. So maybe something that would make more sense for both of us would be a script that would purge all the data, except for the user profile + usercache? In this perspective, we would just need a way to store a "retrieval date" for each user, once the data has been imported on a Cozy instance, to know when starting the purge is safe. |
I didn't see @TTalex comment before posting, but it is actually quite complementary :) |
@asiripanich @lgharib @ericafenyo do you have a sense of whether your use cases require/prefer retention of raw timeseries data? Please see the discussion above for additional context. @paultranvan we have a monthly meeting scheduled at 9pm CET on the last Monday of each month focused around technical coordination. Would you like to attend? If we don't hear back from the others by then, we can bring it up as a topic of discussion To summarize, I believe that the two options that have been proposed are:
|
@TTalex @paultranvan couple of follow ups to your insightful comments: The NREL hosted version of the OpenPATH server is already decentralizedEach partner gets their own database, their own server instance, and their own public and admin dashboards. We support separate configs for each partner that are used to customize the platform to meet their needs. As a concrete example, consider the UPRM CIVIC vs. MassCEC deployments
It is true that it is not decentralized at the level of individual users, but it is the same concept, only at a much more massive scale - e.g. see my recent issue e-mission/e-mission-docs#852 NREL OpenPATH already supports anonymized trip usage analysisThis is essentially the public dashboard, which aggregates trip information across multiple users. Note that there is you cannot anonymize fine-grained spatio-temporal information, particularly if it is linked to a particular user. We have monthly charts, and will likely add aggregate spatial analysis similar to the graphs below soon. Scatter plot of start/end points (Blue: all trips, Green: e-bike trips, Red: car-like trips)Trajectories for e-bike trips ONLY at various zoom levelsAt this point, for the public dashboard, we plan to stay with coarse-grained static images to avoid issues with repeated queries. For the longer-term, we plan to automate analyses as much as possible in a "code against data" scenario so that we can understand the high level impact without needing access to individual spatio-temporal data. Note that we also have a more sophisticated prototype that supports arbitrary code execution against user data (https://github.com/e-mission/e-mission-upc-aggregator) but it is not in our roadmap for now because:
Our plan is to write the public and admin dashboards, figure out how they are being used, and then figure out how to tighten down the user spatio-temporal access to support those use cases. @paultranvan what did you have in mind for "distributed collaboration protocols to collectively learn about trips usages, while enforcing anonymity."? |
Also, @paultranvan @TTalex, wrt
we are going to remove the From my post-meeting notes:
If you want to copy data over periodically in a principled fashion, you probably want to look at |
Thanks for the detailed replies, as always 🙂 ! |
Thank you for all the insights.
I was thinking about the work we started with a PhD student: https://www.theses.fr/2019SACLV067
That's very interesting and might eventually be useful for us in the future!
Well I'm unsure as well :) |
😄
Yes,
Yes, it is a call to Having said all that, we are now (as part of the trip and place additions functionality for @asiripanich and @AlirezaRa94, and the upcoming diary and label unification), planning to precompute We currently don't envision the composite trips being in geojson format since we do have to implement trip2geojson on the phone anyway, but we could be persuaded to change that. Again, if you are actively working on OpenPATH, I would strongly encourage you to attend the monthly developer meetings so that we can discuss and finalize high-level changes as a community. Please let me know if I should send you an invite. |
For my studies, I use the |
Ok, that's definitely something that would be interesting for us, particularly for the GeoJSON format! This is a requirement for us, as we are using this format in our CoachCO2 app, using data imported from the server through the dedicated connector
Unfortunately, I'll be off for few weeks... Let's talk about this when I come back 🙏 |
Even if we return in a non-geojson format (which is what we are leaning towards now), you should be able to convert to geojson in your connector code. We will convert to geojson on the phone (display layer) before display. There is a tradeoff between the amount of information exchanged and the standardization; we can discuss the design further here.
We meet every month so maybe you could attend the one in March? Let me know. |
When operating a server with a lot of users, the
Stage_timeseries
database can quickly become quite big.In the case where only the
Stage_analysis_timeseries
is actually useful after the pipeline execution, the user's timeseries can be deleted to speed up the pipeline and gain some disk space.@shankari I have concerns about the dates handling. As far as I understand, the
metadata.write_ts
is written by the device for the docs coming from ios/android. If a user edits a trip, the date will change, right?Now let's assume the following events:
metadata.write_ts = T0
last_ts_run = T1
metadata.write_ts < T1
But in this scenario, the edited trip with T0 < T1 is not deleted, because it is in the usercache collection, right? It will be moved in the timeseries collection only at the first step of the pipeline. And if a problem occurs during the pipeline execution, the
last_ts_run
date won't be updated for theCREATE_CONFIRMED_OBJECTS
step.Hope I understood correctly, please correct me if I'm not!