Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create database tables for teams, datasets, and account aliases #23

Merged
merged 12 commits into from
Jul 30, 2024

Conversation

karlhigley
Copy link
Collaborator

@karlhigley karlhigley commented Jul 26, 2024

There's corresponding work to set up repository classes etc yet to do, but sharing this now so others can take a look at the schema and tell me if things seem off.

Screenshot-20240729145546-1382x799

@karlhigley karlhigley requested a review from kluver July 26, 2024 14:40
@karlhigley karlhigley self-assigned this Jul 26, 2024
@karlhigley karlhigley force-pushed the karl/feature/account-aliases branch from 29cace8 to 38f78ef Compare July 26, 2024 14:44
@karlhigley karlhigley requested a review from mdekstrand July 26, 2024 14:44
@mdekstrand
Copy link

One thing that jumps out here — how we handle scrambling. We had talked some time ago about the complexities of aliasing/scrambling, and possibly different needs for database-dump scrambling and online scrambling for sending user data to experiments, with the 2 possibilities being:

  1. deterministic scrambling with a key (encrypt user ids)
  2. store scrambled / reassigned user IDs

At the risk of premature optimization, I'm concerned about the data size explosion of using (2) for dataset exports. I think (2) is definitely a good idea for online querying of experiment endpoints, because (1) requires the scrambling code to live in more places, but for dataset export should we consider (1) since export should be relatively self-contained? Raising this now because if we we depend on storing the mapped IDs in the database at the beginning, we at least need to keep those historical logs for as long as we want to be able to regenerate or re-identify those data sets.

@karlhigley
Copy link
Collaborator Author

Our earlier proposed solution to the increasing size of the account alias data was to skip writing them to the database, store them in Parquet files, and only load the parts we needed (i.e. the aliases for accounts that are actually assigned to an experiment) in the rare case that we permit someone to re-use a set of aliases across a data export and a subsequent experiment.

You may recall that you suggested that we just store them in a table for now in order to get something working. 😄

@karlhigley
Copy link
Collaborator Author

I think the bones of a medium-to-long-term solution are there though: for any experiment where we don't expect to re-use the aliases, we can write them to a Parquet file (for us) alongside the experiment results data export (for experimenters) and remove them from the database. If we ever do need to use them again, we can reload the mapping or whichever parts we need.

@kluver
Copy link
Contributor

kluver commented Jul 26, 2024

I think the bones of a medium-to-long-term solution are there though: for any experiment where we don't expect to re-use the aliases, we can write them to a Parquet file (for us) alongside the experiment results data export (for experimenters) and remove them from the database. If we ever do need to use them again, we can reload the mapping or whichever parts we need.

I was actually thinking slightly further -- all dataset renaming tables are written to a file and that is the canonical version, but we cache whatever we want/need

@kluver
Copy link
Contributor

kluver commented Jul 26, 2024

Should experiment -- team be 1-1 or should it be many to one (one team can have many experiments in theory)
Should accounts -- expt_assignments be 1-1 or should it be many to one (one user can be assigned many times over-time)

@karlhigley
Copy link
Collaborator Author

karlhigley commented Jul 29, 2024

Should experiment -- team be 1-1 or should it be many to one (one team can have many experiments in theory)
Should accounts -- expt_assignments be 1-1 or should it be many to one (one user can be assigned many times over-time)

Yeah, those should both be many to one

@karlhigley
Copy link
Collaborator Author

I've updated the diagram to reflect many-to-one relationships in the places those were missing. I think the code already reflected that and I missed it in the picture.

@karlhigley karlhigley marked this pull request as ready for review July 29, 2024 17:33
@karlhigley karlhigley force-pushed the karl/feature/account-aliases branch from 38f78ef to 451223e Compare July 29, 2024 17:41
@mdekstrand
Copy link

Agree on the "save to Parquet" archive strategy, at least in the medium term. We can stash those in cold (or even Glacier) S3 storage, at least until such a time as there is so much activity it is cost-prohibitive. Having an archive (with good compression) of each of the data sets we have generated is probably a good thing in the forseeable future.

Regarding the model, it looks sound to me except for a couple of tweaks on account aliases.

  • It looks like it has the wrong PK in the diagram — we either need a synthetic PK, or a composite PK of account & dataset IDs.
  • If I am interpreting correctly, alias_id will be an integer that is returned as the user ID in the deidentified dataset? Definitely open to persuasion, but I think it would be better to use UUIDs for this too, for two reasons:
    • consistency with online environment (experimenter code doesn't have to handle both UUIDs and bigints depending on whether it's running with offline data or online)
    • platform data leakage (random and hash UUIDs don't leak anything about how many there are or in what order they were generated)

@karlhigley
Copy link
Collaborator Author

Ah yeah, these are artifacts of me drawing the schema by hand and not translating the code faithfully into the picture. If you check the migration code, alias_id is a UUID and the primary key of that table. That table will also need a uniqueness constraint to enforce that each user only has one alias per dataset export

@karlhigley karlhigley force-pushed the karl/feature/account-aliases branch from 2ec9ae9 to 060e302 Compare July 29, 2024 19:34
@karlhigley karlhigley merged commit 3d72dff into main Jul 30, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants