-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create teams, datasets, and account aliases when ingesting manifests #41
Changes from 19 commits
4bacad3
14ec332
5331dd0
4f4bb1b
7908ac2
101882d
6c1a706
1412ddc
e1f21fe
b258e6e
380aafe
1ef6939
8a0c773
c7008bf
5e8d43e
1e602c0
4e89d9c
d77e60f
1879e9a
bb67589
4569770
bef16dc
fcbf587
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
"""remove name column from datasets table | ||
|
||
Revision ID: 8bf414e0ddfb | ||
Revises: dd50d8e7777e | ||
Create Date: 2024-09-04 11:25:43.062068 | ||
|
||
""" | ||
|
||
from typing import Sequence, Union | ||
|
||
import sqlalchemy as sa | ||
from alembic import op | ||
|
||
# revision identifiers, used by Alembic. | ||
revision: str = "8bf414e0ddfb" | ||
down_revision: Union[str, None] = "dd50d8e7777e" | ||
branch_labels: Union[str, Sequence[str], None] = None | ||
depends_on: Union[str, Sequence[str], None] = None | ||
|
||
|
||
def upgrade() -> None: | ||
op.drop_column("datasets", "dataset_name") | ||
|
||
|
||
def downgrade() -> None: | ||
op.add_column( | ||
"datasets", | ||
sa.Column("dataset_name", sa.String), | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,15 +2,17 @@ | |
|
||
from copy import deepcopy | ||
from datetime import date, timedelta | ||
from uuid import UUID | ||
from uuid import UUID, uuid4 | ||
|
||
import tomli | ||
from pydantic import BaseModel, PositiveInt | ||
|
||
from poprox_storage.concepts.experiment import ( | ||
Experiment, | ||
Group, | ||
Phase, | ||
Recommender, | ||
Team, | ||
Treatment, | ||
) | ||
|
||
|
@@ -22,6 +24,7 @@ class ManifestFile(BaseModel): | |
""" | ||
|
||
experiment: ManifestExperiment | ||
owner: ManifestTeam | ||
users: ManifestGroupSpec | ||
recommenders: dict[str, ManifestRecommender] | ||
phases: ManifestPhases | ||
|
@@ -34,6 +37,12 @@ class ManifestExperiment(BaseModel): | |
start_date: date | None = None | ||
|
||
|
||
class ManifestTeam(BaseModel): | ||
team_id: UUID | ||
team_name: str | ||
members: list[UUID] | ||
|
||
|
||
class ManifestPhases(BaseModel): | ||
sequence: list[str] | ||
phases: dict[str, ManifestPhase] | ||
|
@@ -85,34 +94,43 @@ def manifest_to_experiment(manifest: ManifestFile) -> Experiment: | |
start_date = manifest.experiment.start_date or (date.today() + timedelta(days=1)) # noqa: DTZ011 | ||
end_date = start_date + convert_duration(manifest.experiment.duration) | ||
|
||
owner = Team( | ||
team_id=manifest.owner.team_id, | ||
team_name=manifest.owner.team_name, | ||
members=manifest.owner.members, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the implication here that the manifest will contain a list of member UUIDs? Do we like this, or should we have the manifest list emails and then look up the UUIDs as part of ingress here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also -- from a type-safety standpoint (which might be uninteresting FWIW) would this be Not sure we care about that, but it occurred to me that it's worth checking. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes!
No! I do not like this but it allowed me to punt on integrating account lookups, which complicate the picture as far as testing goes.
It should be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I sit corrected—after checking with a breakpoint it actually is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Neat! |
||
) | ||
|
||
experiment = Experiment( | ||
experiment_id=manifest.experiment.id, | ||
owner=owner, | ||
start_date=start_date, | ||
end_date=end_date, | ||
description=manifest.experiment.description, | ||
phases=[], | ||
) | ||
|
||
recommenders = { | ||
rec_name: Recommender(name=rec_name, endpoint_url=recommender.endpoint) | ||
rec_name: Recommender(recommender_id=uuid4(), name=rec_name, endpoint_url=recommender.endpoint) | ||
for rec_name, recommender in manifest.recommenders.items() | ||
} | ||
|
||
groups = {} | ||
for group_name, group in manifest.users.groups.items(): | ||
if group.identical_to: | ||
new_group = deepcopy(groups[group.identical_to]) | ||
new_group.group_id = uuid4() | ||
new_group.name = group_name | ||
groups[group_name] = new_group | ||
else: | ||
groups[group_name] = Group(name=group_name, minimum_size=group.minimum_size) | ||
groups[group_name] = Group(group_id=uuid4(), name=group_name, minimum_size=group.minimum_size) | ||
|
||
phase_start = start_date | ||
for phase_name in manifest.phases.sequence: | ||
manifest_phase = manifest.phases.phases[phase_name] | ||
duration = convert_duration(manifest_phase.duration) | ||
phase_start = start_date + sum([phase.duration for phase in experiment.phases], start=timedelta(0)) | ||
phase_end = phase_start + duration | ||
phase = Phase(name=phase_name, start_date=phase_start, end_date=phase_end, treatments=[]) | ||
phase = Phase(phase_id=uuid4(), name=phase_name, start_date=phase_start, end_date=phase_end, treatments=[]) | ||
for group_name, assignment in manifest_phase.assignments.items(): | ||
recommender_name = assignment.recommender | ||
phase.treatments.append( | ||
|
@@ -137,3 +155,15 @@ def convert_duration(duration: str) -> timedelta: | |
msg = f"Unsupported duration unit: {unit}" | ||
raise ValueError(msg) | ||
return duration | ||
|
||
|
||
def parse_manifest_toml(manifest_file: str): | ||
manifest_dict = tomli.loads(manifest_file) | ||
phases = {"sequence": manifest_dict["phases"]["sequence"], "phases": {}} | ||
for name, phase in manifest_dict["phases"].items(): | ||
if name != "sequence": | ||
phases["phases"][name] = phase | ||
|
||
manifest_dict["phases"] = phases | ||
|
||
return ManifestFile.model_validate(manifest_dict) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# pyright: strict | ||
from __future__ import annotations | ||
|
||
import logging | ||
from pathlib import Path | ||
from typing import overload | ||
|
||
logger = logging.getLogger(__name__) | ||
_cached_root: Path | None = None | ||
|
||
|
||
@overload | ||
def project_root() -> Path: ... | ||
@overload | ||
def project_root(*, require: bool) -> Path | None: ... | ||
def project_root(*, require: bool = True) -> Path | None: | ||
""" | ||
Find the project root directory (when we are running in the project). | ||
|
||
This searches upwards from the **current working directory** to find the | ||
root of the project, which it identifies by the ``pyproject.toml`` file. If | ||
this function is called from a directory that is not within a checkout of | ||
the ``poprox-recommender`` repository, it will raise an error. | ||
karlhigley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Args: | ||
require: | ||
Whether to fail when the project root is not found, or return | ||
``None``. If ``require=False`` this function will stil fail on a | ||
*defective* project root (contains an invalid ``pyproject.toml``). | ||
|
||
Returns: | ||
The full path to the project root directory. If the project root is | ||
not found and ``require=False``, returns ``None``. | ||
""" | ||
global _cached_root | ||
if _cached_root is None: | ||
cwd = Path(".").resolve() | ||
candidate = cwd | ||
logger.debug("searching for project root upwards from %s", candidate) | ||
while not _is_project_root(candidate): | ||
candidate = candidate.parent | ||
if not candidate or str(candidate) == "/": | ||
if require: | ||
msg = f"cannot find project root for {cwd}" | ||
raise RuntimeError(msg) | ||
else: | ||
# don't cache None | ||
return None | ||
|
||
logger.debug("found project root at %s", candidate) | ||
_cached_root = candidate | ||
|
||
return _cached_root | ||
|
||
|
||
def _is_project_root(path: Path) -> bool: | ||
tomlf = path / "pyproject.toml" | ||
if tomlf.exists(): | ||
return True | ||
else: | ||
return False |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,6 +55,21 @@ def fetch_account_by_email(self, email: str) -> Account | None: | |
return accounts[0] | ||
return None | ||
|
||
def store_account(self, account: Account) -> UUID | None: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it worth revising the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not totally sure, but maybe? I left the other method there so this wouldn't break anything, but I do kinda like having methods that accept domain objects instead of accepting the relevant info and creating domain objects internally
karlhigley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
account_tbl = self.tables["accounts"] | ||
query = ( | ||
sqlalchemy.insert(account_tbl) | ||
.values( | ||
account_id=account.account_id, | ||
email=account.email, | ||
source=account.source, | ||
status="new_account", | ||
) | ||
.returning(account_tbl.c.account_id) | ||
) | ||
row = self.conn.execute(query).one_or_none() | ||
return row.account_id | ||
|
||
def store_new_account(self, email: str, source: str) -> Account: | ||
account_tbl = self.tables["accounts"] | ||
query = ( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a problem with this change, but it does feel unconnected from the other changes, and I'm not immediately seeing the reasoning here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each experiment is owned by a team and corresponds to a dataset, which (among other things) is how we know which account aliases to use when exporting experiment data. Loading a manifest is currently the only way to create a dataset, and the manifest doesn't have a field for naming the dataset so it's not clear what to put in this column.
An alternative change would be to make this column nullable if we expect other ways of creating datasets that would allow us to provide reasonable names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're running into an overload of the term dataset. My gut says that outside experiment-team-tied datasets we may one-day have "public" datasets.
I don't think that hypothetical is worth keeping a column you're unsure about -- if we hit the hypothetical future where we have "public" datasets, let's solve that problem later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public vs private is an interesting wrinkle too, but I was actually thinking of something a little different:
We've designed the database schema so that every experiment has a dataset, but not all datasets have experiments. That leaves us room to do things like export data for an experimenter so they can check properties of the data in order to find out if our platform makes sense for their experiment (like @sophiasun0515 has wanted to do re: domestic/international news bias.) Exporting data that way should also create a dataset with associated account id aliases but won't involve importing a manifest since it's pre-experiment.