Create database tables for teams, datasets, and account aliases #23

karlhigley · 2024-07-26T14:40:31Z

There's corresponding work to set up repository classes etc yet to do, but sharing this now so others can take a look at the schema and tell me if things seem off.

mdekstrand · 2024-07-26T17:07:08Z

One thing that jumps out here — how we handle scrambling. We had talked some time ago about the complexities of aliasing/scrambling, and possibly different needs for database-dump scrambling and online scrambling for sending user data to experiments, with the 2 possibilities being:

deterministic scrambling with a key (encrypt user ids)
store scrambled / reassigned user IDs

At the risk of premature optimization, I'm concerned about the data size explosion of using (2) for dataset exports. I think (2) is definitely a good idea for online querying of experiment endpoints, because (1) requires the scrambling code to live in more places, but for dataset export should we consider (1) since export should be relatively self-contained? Raising this now because if we we depend on storing the mapped IDs in the database at the beginning, we at least need to keep those historical logs for as long as we want to be able to regenerate or re-identify those data sets.

karlhigley · 2024-07-26T19:04:28Z

Our earlier proposed solution to the increasing size of the account alias data was to skip writing them to the database, store them in Parquet files, and only load the parts we needed (i.e. the aliases for accounts that are actually assigned to an experiment) in the rare case that we permit someone to re-use a set of aliases across a data export and a subsequent experiment.

You may recall that you suggested that we just store them in a table for now in order to get something working. 😄

karlhigley · 2024-07-26T19:06:56Z

I think the bones of a medium-to-long-term solution are there though: for any experiment where we don't expect to re-use the aliases, we can write them to a Parquet file (for us) alongside the experiment results data export (for experimenters) and remove them from the database. If we ever do need to use them again, we can reload the mapping or whichever parts we need.

kluver · 2024-07-26T20:03:46Z

I think the bones of a medium-to-long-term solution are there though: for any experiment where we don't expect to re-use the aliases, we can write them to a Parquet file (for us) alongside the experiment results data export (for experimenters) and remove them from the database. If we ever do need to use them again, we can reload the mapping or whichever parts we need.

I was actually thinking slightly further -- all dataset renaming tables are written to a file and that is the canonical version, but we cache whatever we want/need

kluver · 2024-07-26T20:05:48Z

Should experiment -- team be 1-1 or should it be many to one (one team can have many experiments in theory)
Should accounts -- expt_assignments be 1-1 or should it be many to one (one user can be assigned many times over-time)

karlhigley · 2024-07-29T13:21:12Z

Should experiment -- team be 1-1 or should it be many to one (one team can have many experiments in theory)
Should accounts -- expt_assignments be 1-1 or should it be many to one (one user can be assigned many times over-time)

Yeah, those should both be many to one

karlhigley · 2024-07-29T17:33:41Z

I've updated the diagram to reflect many-to-one relationships in the places those were missing. I think the code already reflected that and I missed it in the picture.

mdekstrand · 2024-07-29T18:39:51Z

Agree on the "save to Parquet" archive strategy, at least in the medium term. We can stash those in cold (or even Glacier) S3 storage, at least until such a time as there is so much activity it is cost-prohibitive. Having an archive (with good compression) of each of the data sets we have generated is probably a good thing in the forseeable future.

Regarding the model, it looks sound to me except for a couple of tweaks on account aliases.

It looks like it has the wrong PK in the diagram — we either need a synthetic PK, or a composite PK of account & dataset IDs.
If I am interpreting correctly, alias_id will be an integer that is returned as the user ID in the deidentified dataset? Definitely open to persuasion, but I think it would be better to use UUIDs for this too, for two reasons:
- consistency with online environment (experimenter code doesn't have to handle both UUIDs and bigints depending on whether it's running with offline data or online)
- platform data leakage (random and hash UUIDs don't leak anything about how many there are or in what order they were generated)

karlhigley · 2024-07-29T18:52:09Z

Ah yeah, these are artifacts of me drawing the schema by hand and not translating the code faithfully into the picture. If you check the migration code, alias_id is a UUID and the primary key of that table. That table will also need a uniqueness constraint to enforce that each user only has one alias per dataset export

karlhigley requested a review from kluver July 26, 2024 14:40

karlhigley self-assigned this Jul 26, 2024

karlhigley force-pushed the karl/feature/account-aliases branch from 29cace8 to 38f78ef Compare July 26, 2024 14:44

karlhigley requested a review from mdekstrand July 26, 2024 14:44

karlhigley marked this pull request as ready for review July 29, 2024 17:33

karlhigley added 9 commits July 29, 2024 13:34

Add migration to create teams table

a1cf1d3

Add migration to create team_memberships table

a7ba24f

Add team_id column to experiments table

501f2ac

Add migration to create datasets table

9597f4f

Add migration to create account_aliases table

4e35f73

Add dataset_id to experiments table

5db7dc3

Apply linter formatting

911c778

Merge database revision heads

7c682bf

Make dataset_id and team_id columns nullable

451223e

karlhigley force-pushed the karl/feature/account-aliases branch from 38f78ef to 451223e Compare July 29, 2024 17:41

karlhigley added 3 commits July 29, 2024 14:53

Adjust the stairstep test to avoid downgrading across ambiguous walks

fea21e3

Apply additional linter formatting

6c7279e

Add uniqueness constraint to account_aliases table

060e302

karlhigley force-pushed the karl/feature/account-aliases branch from 2ec9ae9 to 060e302 Compare July 29, 2024 19:34

kluver approved these changes Jul 29, 2024

View reviewed changes

karlhigley merged commit 3d72dff into main Jul 30, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create database tables for teams, datasets, and account aliases #23

Create database tables for teams, datasets, and account aliases #23

karlhigley commented Jul 26, 2024 •

edited

Loading

mdekstrand commented Jul 26, 2024

karlhigley commented Jul 26, 2024

karlhigley commented Jul 26, 2024

kluver commented Jul 26, 2024

kluver commented Jul 26, 2024

karlhigley commented Jul 29, 2024 •

edited

Loading

karlhigley commented Jul 29, 2024

mdekstrand commented Jul 29, 2024

karlhigley commented Jul 29, 2024

Create database tables for teams, datasets, and account aliases #23

Create database tables for teams, datasets, and account aliases #23

Conversation

karlhigley commented Jul 26, 2024 • edited Loading

mdekstrand commented Jul 26, 2024

karlhigley commented Jul 26, 2024

karlhigley commented Jul 26, 2024

kluver commented Jul 26, 2024

kluver commented Jul 26, 2024

karlhigley commented Jul 29, 2024 • edited Loading

karlhigley commented Jul 29, 2024

mdekstrand commented Jul 29, 2024

karlhigley commented Jul 29, 2024

karlhigley commented Jul 26, 2024 •

edited

Loading

karlhigley commented Jul 29, 2024 •

edited

Loading