Create teams, datasets, and account aliases when ingesting manifests #41

karlhigley · 2024-09-03T19:26:44Z

No description provided.

karlhigley · 2024-09-04T15:57:00Z

This does create e.g. account aliases, but the methods for fetching them don't exist yet so it doesn't explicitly test that. I'm envisioning that as a separate set of PRs that integrates the new aliases into the recommendation request and dataset export processes.

kluver

A few things I want to check, but nothing that looked like a blocker to me.

kluver · 2024-09-05T15:37:21Z

src/poprox_storage/repositories/accounts.py

@@ -55,6 +55,21 @@ def fetch_account_by_email(self, email: str) -> Account | None:
            return accounts[0]
        return None

+    def store_account(self, account: Account) -> UUID | None:


Is it worth revising the web code that uses the following store_new_account to use this version instead?

I'm not totally sure, but maybe? I left the other method there so this wouldn't break anything, but I do kinda like having methods that accept domain objects instead of accepting the relevant info and creating domain objects internally

src/poprox_storage/repositories/accounts.py

kluver · 2024-09-05T15:38:52Z

...b/migrations/versions/2024_09_04_1125-8bf414e0ddfb_remove_name_column_from_datasets_table.py

+
+
+def upgrade() -> None:
+    op.drop_column("datasets", "dataset_name")


I don't have a problem with this change, but it does feel unconnected from the other changes, and I'm not immediately seeing the reasoning here.

Each experiment is owned by a team and corresponds to a dataset, which (among other things) is how we know which account aliases to use when exporting experiment data. Loading a manifest is currently the only way to create a dataset, and the manifest doesn't have a field for naming the dataset so it's not clear what to put in this column.

An alternative change would be to make this column nullable if we expect other ways of creating datasets that would allow us to provide reasonable names.

I think we're running into an overload of the term dataset. My gut says that outside experiment-team-tied datasets we may one-day have "public" datasets.

I don't think that hypothetical is worth keeping a column you're unsure about -- if we hit the hypothetical future where we have "public" datasets, let's solve that problem later.

Public vs private is an interesting wrinkle too, but I was actually thinking of something a little different:

We've designed the database schema so that every experiment has a dataset, but not all datasets have experiments. That leaves us room to do things like export data for an experimenter so they can check properties of the data in order to find out if our platform makes sense for their experiment (like @sophiasun0515 has wanted to do re: domestic/international news bias.) Exporting data that way should also create a dataset with associated account id aliases but won't involve importing a manifest since it's pre-experiment.

kluver · 2024-09-05T15:41:18Z

src/poprox_storage/concepts/manifest.py

+    owner = Team(
+        team_id=manifest.owner.team_id,
+        team_name=manifest.owner.team_name,
+        members=manifest.owner.members,


Is the implication here that the manifest will contain a list of member UUIDs? Do we like this, or should we have the manifest list emails and then look up the UUIDs as part of ingress here?

Also -- from a type-safety standpoint (which might be uninteresting FWIW) would this be list[str] or list[UUID] at this point? I feel like it would be list[str] but the Team class has list[UUID]?

Not sure we care about that, but it occurred to me that it's worth checking.

Is the implication here that the manifest will contain a list of member UUIDs?

Yes!

Do we like this, or should we have the manifest list emails and then look up the UUIDs as part of ingress here?

No! I do not like this but it allowed me to punt on integrating account lookups, which complicate the picture as far as testing goes.

would this be list[str] or list[UUID] at this point? I feel like it would be list[str] but the Team class has list[UUID]?

It should be list[UUID] but it's actually list[str]. I'll fix this!

I sit corrected—after checking with a breakpoint it actually is list[UUID] here, so the types are correct. The conversion from string to UUID happens when Pydantic turns the manifest JSON into model objects.

The conversion from string to UUID happens when Pydantic turns the manifest JSON into model objects.

Neat!

src/poprox_storage/paths.py

kluver · 2024-09-05T15:49:05Z

tests/concepts/test_manifest.py

+
+
+def test_load_manifest():
+    with open(project_root() / "tests" / "data" / "sample_manifest.toml") as f:


str / str is new syntax to me. Does it just put in OS appropriate slashes?

Basically, yes! project_root() returns a pathlib.Path, which provides this syntax and applies the OS-appropriate magic you inferred.

karlhigley self-assigned this Sep 3, 2024

karlhigley marked this pull request as ready for review September 3, 2024 19:26

karlhigley added 3 commits September 4, 2024 09:17

Create a team when ingesting an experiment from a manifest file

4bacad3

Create a dataset when ingesting an experiment from a manifest file

14ec332

Create account aliases when ingesting an experiment from a manifest file

5331dd0

karlhigley force-pushed the karl/feature/team-dataset-aliases branch from 7ee2bf5 to 5331dd0 Compare September 4, 2024 13:17

karlhigley added 2 commits September 4, 2024 09:18

Apply formatting

4f4bb1b

Reorganize the tests

7908ac2

karlhigley force-pushed the karl/feature/team-dataset-aliases branch 2 times, most recently from 60abe68 to fcb2710 Compare September 4, 2024 13:48

Add a test for manifest parsing and conversion

101882d

karlhigley force-pushed the karl/feature/team-dataset-aliases branch from fcb2710 to 101882d Compare September 4, 2024 13:51

karlhigley added 6 commits September 4, 2024 09:54

Move sample manifest to a separate file

6c1a706

Add a simple test that exercises experiment storage

1412ddc

Apply injection decorator to a non-test function

e1f21fe

Remove injection from test

b258e6e

Add additional tables to DbExperimentRepository

380aafe

Fix method call

1ef6939

karlhigley force-pushed the karl/feature/team-dataset-aliases branch 3 times, most recently from 27c7095 to 621dbba Compare September 4, 2024 14:53

Create account for experiment owner

8a0c773

karlhigley force-pushed the karl/feature/team-dataset-aliases branch from 621dbba to 8a0c773 Compare September 4, 2024 14:55

Assert on store_account result

c7008bf

karlhigley force-pushed the karl/feature/team-dataset-aliases branch from f8317fd to c7008bf Compare September 4, 2024 15:07

karlhigley added 5 commits September 4, 2024 11:38

Remove dataset_name column from datasets table

5e8d43e

Assign UUIDs to groups etc when parsing manifest

1e602c0

Fix experiment and group columns for inserts

4e89d9c

Commit instead of rolling back before experiment insert

d77e60f

Create the experiment owner account if it doesn't already exist

1879e9a

karlhigley requested a review from kluver September 4, 2024 15:50

kluver approved these changes Sep 5, 2024

View reviewed changes

karlhigley added 4 commits September 5, 2024 14:41

Fix repo name in docstring

bb67589

Make dataset_name column nullable instead of removing it

4569770

Clear the relevant tables before running the experiment test

bef16dc

Apply clear_tables test helper in other tests

fcbf587

karlhigley force-pushed the karl/feature/team-dataset-aliases branch from ee11f25 to fcbf587 Compare September 6, 2024 13:13

karlhigley merged commit 4589de3 into main Sep 6, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create teams, datasets, and account aliases when ingesting manifests #41

Create teams, datasets, and account aliases when ingesting manifests #41

karlhigley commented Sep 3, 2024

karlhigley commented Sep 4, 2024

kluver left a comment

kluver Sep 5, 2024

karlhigley Sep 5, 2024

kluver Sep 5, 2024

karlhigley Sep 5, 2024

kluver Sep 5, 2024

karlhigley Sep 5, 2024

kluver Sep 5, 2024

kluver Sep 5, 2024

karlhigley Sep 5, 2024 •

edited

Loading

karlhigley Sep 5, 2024 •

edited

Loading

kluver Sep 5, 2024

kluver Sep 5, 2024

karlhigley Sep 5, 2024



		def upgrade() -> None:
		op.drop_column("datasets", "dataset_name")



		def test_load_manifest():
		with open(project_root() / "tests" / "data" / "sample_manifest.toml") as f:

Create teams, datasets, and account aliases when ingesting manifests #41

Create teams, datasets, and account aliases when ingesting manifests #41

Conversation

karlhigley commented Sep 3, 2024

karlhigley commented Sep 4, 2024

kluver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlhigley Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

karlhigley Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlhigley Sep 5, 2024 •

edited

Loading

karlhigley Sep 5, 2024 •

edited

Loading