-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create database tables for teams, datasets, and account aliases #23
Conversation
29cace8
to
38f78ef
Compare
One thing that jumps out here — how we handle scrambling. We had talked some time ago about the complexities of aliasing/scrambling, and possibly different needs for database-dump scrambling and online scrambling for sending user data to experiments, with the 2 possibilities being:
At the risk of premature optimization, I'm concerned about the data size explosion of using (2) for dataset exports. I think (2) is definitely a good idea for online querying of experiment endpoints, because (1) requires the scrambling code to live in more places, but for dataset export should we consider (1) since export should be relatively self-contained? Raising this now because if we we depend on storing the mapped IDs in the database at the beginning, we at least need to keep those historical logs for as long as we want to be able to regenerate or re-identify those data sets. |
Our earlier proposed solution to the increasing size of the account alias data was to skip writing them to the database, store them in Parquet files, and only load the parts we needed (i.e. the aliases for accounts that are actually assigned to an experiment) in the rare case that we permit someone to re-use a set of aliases across a data export and a subsequent experiment. You may recall that you suggested that we just store them in a table for now in order to get something working. 😄 |
I think the bones of a medium-to-long-term solution are there though: for any experiment where we don't expect to re-use the aliases, we can write them to a Parquet file (for us) alongside the experiment results data export (for experimenters) and remove them from the database. If we ever do need to use them again, we can reload the mapping or whichever parts we need. |
I was actually thinking slightly further -- all dataset renaming tables are written to a file and that is the canonical version, but we cache whatever we want/need |
Should experiment -- team be 1-1 or should it be many to one (one team can have many experiments in theory) |
Yeah, those should both be many to one |
I've updated the diagram to reflect many-to-one relationships in the places those were missing. I think the code already reflected that and I missed it in the picture. |
38f78ef
to
451223e
Compare
Agree on the "save to Parquet" archive strategy, at least in the medium term. We can stash those in cold (or even Glacier) S3 storage, at least until such a time as there is so much activity it is cost-prohibitive. Having an archive (with good compression) of each of the data sets we have generated is probably a good thing in the forseeable future. Regarding the model, it looks sound to me except for a couple of tweaks on account aliases.
|
Ah yeah, these are artifacts of me drawing the schema by hand and not translating the code faithfully into the picture. If you check the migration code, |
2ec9ae9
to
060e302
Compare
There's corresponding work to set up repository classes etc yet to do, but sharing this now so others can take a look at the schema and tell me if things seem off.