Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promotion 2024-11-12 prod (#6699) #6711

Merged
merged 40 commits into from
Nov 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
4e99857
Clarify and enforce bundle FQID equality semantics (#6671)
nadove-ucsc Oct 22, 2024
7cb40d2
Use `SourcedBundleFQID` in `catalog_complete` subtest
nadove-ucsc Oct 22, 2024
787c416
Clarify and enforce bundle FQID equality semantics (#6671, PR #6672)
achave11-ucsc Nov 5, 2024
cad323c
Rename, relocate, and document method
nadove-ucsc Oct 22, 2024
644377f
Fix: Inaccurate module-level docstring in can_bundle script (#6669)
nadove-ucsc Oct 31, 2024
8da52fb
Fix: Faulty/brittle default value for source parameter in can_bundle …
nadove-ucsc Oct 31, 2024
2b0d7d8
Fix: Incorrect help message for version parameter in can_bundle scrip…
nadove-ucsc Oct 31, 2024
f0be49d
can_bundle script requires explicit table name for AnVIL sources (#6669)
nadove-ucsc Oct 19, 2024
85b36ea
Fix formatting
hannes-ucsc Nov 2, 2024
d53b159
Fix missing parameter in inactive code
hannes-ucsc Nov 2, 2024
bee4a60
Fix: Various problems in can_bundle script (#6669, PR #6670)
achave11-ucsc Nov 6, 2024
d206d2f
Eliminate old-style type annotations in integration tests
nadove-ucsc Oct 22, 2024
37cdfde
Refactor SourceRef prefix assignment
nadove-ucsc Oct 21, 2024
3998aba
Fix: Indexing integration test is complicated and inefficient (#6676)
nadove-ucsc Oct 25, 2024
cfe7acf
Fix: Indexing integration test is complicated and inefficient (#6676,…
achave11-ucsc Nov 7, 2024
f841bf7
Fix docstring formatting
nadove-ucsc Oct 23, 2024
58a671a
Populate empty/missing fields in canned tables
nadove-ucsc Oct 5, 2024
6c1c601
Replace `+` with f-string
nadove-ucsc Oct 5, 2024
96747c3
Make bundle-counting method public
nadove-ucsc Oct 25, 2024
ec2e7a2
Refactor list_bundles methods
nadove-ucsc Oct 31, 2024
d9280cc
Improve BigQuery logging
nadove-ucsc Oct 29, 2024
c8a084f
[r] Emit replicas for AnVIL entities not linked to files
nadove-ucsc Nov 3, 2024
104a73b
[r] Index orphaned replicas (#6626)
nadove-ucsc Nov 3, 2024
fc10972
Add FIXME (#6647)
nadove-ucsc Oct 23, 2024
9b6cf31
[r] Include parent dataset/project in hub IDs (#6626)
nadove-ucsc Sep 27, 2024
0b3b6b4
[r] Index orphaned replicas (#6626, PR #6627)
achave11-ucsc Nov 9, 2024
5909b1f
Refactor AnVIL JSONL manifest test
nadove-ucsc Oct 10, 2024
1c64ffe
Eliminate lazy iterator in AnVIL verbatim manifest
nadove-ucsc Oct 31, 2024
0688af0
Refactor replica schema composition
nadove-ucsc Oct 31, 2024
45b7bf9
Include orphans in manifest when filtering by only project/dataset (#…
nadove-ucsc Oct 10, 2024
039dc98
Document SpecialFields and friends
hannes-ucsc Nov 9, 2024
fb648ba
Refactor conditional weakening of access enforcement
hannes-ucsc Nov 9, 2024
dfb41a2
Refactor verbatim JSONL unit test
hannes-ucsc Nov 9, 2024
17dadfa
Include orphans in manifest when filtering by only project/dataset (#…
hannes-ucsc Nov 10, 2024
df02128
Fix: Inspector findings missing from report (#6692)
dsotirho-ucsc Nov 12, 2024
2bfc8ca
Fix: Inspector findings missing from report (#6692, PR #6700)
dsotirho-ucsc Nov 13, 2024
19ba5d9
[r] Add updated snapshot for project `6135382f` to `lm8` (#6689)
achave11-ucsc Nov 12, 2024
fba2e1b
[r] Add updated snapshot for project 6135382f to lm8 (#6689, PR #6701)
dsotirho-ucsc Nov 14, 2024
b053388
[r] Make anvil8 default catalog and remove anvil7 (#6707)
dsotirho-ucsc Nov 14, 2024
efe016f
[r] Make anvil8 default catalog and remove anvil7 (#6707, PR #6708)
hannes-ucsc Nov 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/PULL_REQUEST_TEMPLATE/upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ Connected issue: #0000

### Operator

- [ ] At least one hour has passed since `anvildev.shared` was last deployed
- [ ] At least 24 hours have passed since `anvildev.shared` was last deployed
- [ ] Ran `script/export_inspector_findings.py` against `anvildev`, imported results to [Google Sheet](https://docs.google.com/spreadsheets/d/1RWF7g5wRKWPGovLw4jpJGX_XMi8aWLXLOvvE5rxqgH8) and posted screenshot of relevant<sup>1</sup> findings as a comment on the connected issue.
- [ ] Propagated the `deploy:shared`, `deploy:gitlab`, `deploy:runner` and `backup:gitlab` labels to the next promotion PRs <sub>or this PR carries none of these labels</sub>
- [ ] Propagated any specific instructions related to the `deploy:shared`, `deploy:gitlab`, `deploy:runner` and `backup:gitlab` labels, from the description of this PR to that of the next promotion PRs <sub>or this PR carries none of these labels</sub>
Expand Down
2 changes: 1 addition & 1 deletion .github/pull_request_template.md.template.py
Original file line number Diff line number Diff line change
Expand Up @@ -958,7 +958,7 @@ def emit(t: T, target_branch: str):
*iif(t is T.upgrade, [
{
'type': 'cli',
'content': 'At least one hour has passed since `anvildev.shared` was last deployed'
'content': 'At least 24 hours have passed since `anvildev.shared` was last deployed'
},
{
'type': 'cli',
Expand Down
1 change: 0 additions & 1 deletion deployments/anvilprod/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -975,7 +975,6 @@ def env() -> Mapping[str, Optional[str]]:
repository=dict(name='tdr_anvil')),
sources=list(filter(None, sources.values())))
for atlas, catalog, sources in [
('anvil', 'anvil7', anvil7_sources),
('anvil', 'anvil8', anvil8_sources),
]
for suffix, internal in [
Expand Down
1 change: 0 additions & 1 deletion deployments/hammerbox/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -989,7 +989,6 @@ def env() -> Mapping[str, Optional[str]]:
repository=dict(name='tdr_anvil')),
sources=list(filter(None, sources.values())))
for atlas, catalog, sources in [
('anvil', 'anvil7', anvil7_sources),
('anvil', 'anvil8', anvil8_sources),
]
for suffix, internal in [
Expand Down
1 change: 1 addition & 0 deletions deployments/prod/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -1181,6 +1181,7 @@ def mkdict(previous_catalog: dict[str, str],
mksrc('bigquery', 'datarepo-c9158593', 'lungmap_prod_20037472ea1d4ddb9cd356a11a6f0f76__20220307_20241002_lm8'),
mksrc('bigquery', 'datarepo-35a6d7ca', 'lungmap_prod_3a02d15f9c6a4ef7852b4ddec733b70b__20241001_20241002_lm8'),
mksrc('bigquery', 'datarepo-131a1234', 'lungmap_prod_4ae8c5c91520437198276935661f6c84__20231004_20241002_lm8'),
mksrc('bigquery', 'datarepo-936db385', 'lungmap_prod_6135382f487d4adb9cf84d6634125b68__20230207_20241106_lm8'),
mksrc('bigquery', 'datarepo-3c4905d2', 'lungmap_prod_834e0d1671b64425a8ab022b5000961c__20241001_20241002_lm8'),
mksrc('bigquery', 'datarepo-d7447983', 'lungmap_prod_f899709cae2c4bb988f0131142e6c7ec__20220310_20241002_lm8'),
mksrc('bigquery', 'datarepo-c11ef363', 'lungmap_prod_fdadee7e209745d5bf81cc280bd8348e__20240206_20241002_lm8'),
Expand Down
50 changes: 29 additions & 21 deletions scripts/can_bundle.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
"""
Download manifest and metadata for a given bundle from the given repository
source and store them as separate JSON files in the index test data directory.
Download the contents of a given bundle from the given repository source and
store it as a single JSON file. Users are expected to be familiar with the
structure of the bundle FQIDs for the given source and provide the appropriate
attributes.

Note: silently overwrites the destination file.
"""
Expand All @@ -15,10 +17,6 @@
import sys
import uuid

from more_itertools import (
one,
)

from azul import (
cache,
config,
Expand All @@ -45,41 +43,53 @@
from azul.types import (
AnyJSON,
AnyMutableJSON,
JSON,
)

log = logging.getLogger(__name__)


def main(argv):
parser = argparse.ArgumentParser(description=__doc__, formatter_class=AzulArgumentHelpFormatter)
default_catalog = config.default_catalog
plugin_cls = RepositoryPlugin.load(default_catalog)
plugin = plugin_cls.create(default_catalog)
if len(plugin.sources) == 1:
source_arg = {'default': str(one(plugin.sources))}
else:
source_arg = {'required': True}
parser.add_argument('--source', '-s',
**source_arg,
required=True,
help='The repository source containing the bundle')
parser.add_argument('--uuid', '-b',
required=True,
help='The UUID of the bundle to can.')
parser.add_argument('--version', '-v',
help='The version of the bundle to can (default: the latest version).')
help='The version of the bundle to can. Required for HCA, ignored for AnVIL.')
parser.add_argument('--table-name',
help='The BigQuery table of the bundle to can. Only applicable for AnVIL.')
parser.add_argument('--batch-prefix',
help='The batch prefix of the bundle to can. Only applicable for AnVIL. '
'Use "null" for non-batched bundle formats.')
parser.add_argument('--output-dir', '-O',
default=os.path.join(config.project_root, 'test', 'indexer', 'data'),
help='The path to the output directory (default: %(default)s).')
parser.add_argument('--redaction-key', '-K',
help='Provide a key to redact confidential or sensitive information from the output files')
args = parser.parse_args(argv)
bundle = fetch_bundle(args.source, args.uuid, args.version)
fqid_fields = parse_fqid_fields(args)
bundle = fetch_bundle(args.source, fqid_fields)
if args.redaction_key:
redact_bundle(bundle, args.redaction_key.encode())
save_bundle(bundle, args.output_dir)


def fetch_bundle(source: str, bundle_uuid: str, bundle_version: str) -> Bundle:
def parse_fqid_fields(args: argparse.Namespace) -> JSON:
fields = {'uuid': args.uuid, 'version': args.version}
if args.table_name is not None:
fields['table_name'] = args.table_name
batch_prefix = args.batch_prefix
if batch_prefix is not None:
if batch_prefix == 'null':
batch_prefix = None
fields['batch_prefix'] = batch_prefix
return fields


def fetch_bundle(source: str, fqid_args: JSON) -> Bundle:
for catalog in config.catalogs:
plugin = plugin_for(catalog)
try:
Expand All @@ -90,10 +100,8 @@ def fetch_bundle(source: str, bundle_uuid: str, bundle_version: str) -> Bundle:
log.debug('Searching for %r in catalog %r', source, catalog)
for plugin_source_spec in plugin.sources:
if source_ref.spec.eq_ignoring_prefix(plugin_source_spec):
fqid = SourcedBundleFQIDJSON(source=source_ref.to_json(),
uuid=bundle_uuid,
version=bundle_version)
fqid = plugin.resolve_bundle(fqid)
fqid = SourcedBundleFQIDJSON(source=source_ref.to_json(), **fqid_args)
fqid = plugin.bundle_fqid_from_json(fqid)
bundle = plugin.fetch_bundle(fqid)
log.info('Fetched bundle %r version %r from catalog %r.',
fqid.uuid, fqid.version, catalog)
Expand Down
8 changes: 4 additions & 4 deletions scripts/post_deploy_tdr.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,13 +96,13 @@ def verify_source(self,
plugin = self.repository_plugin(catalog)
ref = plugin.resolve_source(str(source_spec))
log.info('TDR client is authorized for API access to %s.', source_spec)
ref = plugin.partition_source(catalog, ref)
prefix = ref.spec.prefix
if config.deployment.is_main:
require(prefix.common == '', source_spec)
if source_spec.prefix is not None:
require(source_spec.prefix.common == '', source_spec)
self.tdr.check_bigquery_access(source_spec)
else:
subgraph_count = len(plugin.list_bundles(ref, prefix.common))
ref = plugin.partition_source(catalog, ref)
subgraph_count = plugin.count_bundles(ref.spec)
require(subgraph_count > 0, 'Common prefix is too long', ref.spec)
require(subgraph_count <= 512, 'Common prefix is too short', ref.spec)

Expand Down
6 changes: 5 additions & 1 deletion src/azul/azulclient.py
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,11 @@ def list_bundles(self,
source = plugin.resolve_source(source)
else:
assert isinstance(source, SourceRef), source
return plugin.list_bundles(source, prefix)
log.info('Listing bundles with prefix %r in source %r.', prefix, source)
bundle_fqids = plugin.list_bundles(source, prefix)
log.info('There are %i bundle(s) with prefix %r in source %r.',
len(bundle_fqids), prefix, source)
return bundle_fqids

@property
def sqs(self):
Expand Down
133 changes: 115 additions & 18 deletions src/azul/indexer/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
ABCMeta,
abstractmethod,
)
from functools import (
total_ordering,
)
from itertools import (
product,
)
Expand All @@ -17,6 +20,7 @@
Self,
TypeVar,
TypedDict,
final,
)

import attrs
Expand Down Expand Up @@ -45,21 +49,115 @@
BundleVersion = str


@attrs.frozen(kw_only=True, order=True)
class BundleFQID(SupportsLessAndGreaterThan):
# PyCharm can't handle mixing `attrs` with `total_ordering` and falsely claims
# that comparison operators besides `__lt__` are not defined.
# noinspection PyDataclass
@attrs.frozen(kw_only=True, eq=False)
@total_ordering
class BundleFQID:
"""
>>> list(sorted([
... BundleFQID(uuid='d', version='e'),
... BundleFQID(uuid='a', version='c'),
... BundleFQID(uuid='a', version='b'),
... ]))
... # doctest: +NORMALIZE_WHITESPACE
[BundleFQID(uuid='a', version='b'),
BundleFQID(uuid='a', version='c'),
BundleFQID(uuid='d', version='e')]
A fully qualified bundle identifier. The attributes defined in this class
must always be sufficient to decide whether two instances of this class or
its subclasses identify the same bundle or not. Subclasses may define
additional attributes to help describe the bundle, but they are forbidden
from using these attributes in the implementations of their `__eq__` or
`__hash__` methods, either explicitly or in code generated by `attrs`.
"""
uuid: BundleUUID = attrs.field(order=str.lower)
version: BundleVersion = attrs.field(order=str.lower)
uuid: BundleUUID
version: BundleVersion

def _nucleus(self) -> tuple[str, str]:
return self.uuid.lower(), self.version.lower()

# We can't use attrs' generated implementation because it always
# considers operands with different types to be unequal, regardless of
# their inheritance relationships or how their attributes are annotated
# (e.g. specifying `eq=False` has no effect). We want instances of
# all subclasses to compare equal as long as `uuid` and `version` are
# equal. For the same reason, we can't use `typing.Self` in the signature
# because it would constrain the RHS to instances of subclasses of the LHS.
@final
def __eq__(self, other: 'BundleFQID') -> bool:
"""
>>> b1 = BundleFQID(uuid='a', version='b')
>>> b2 = BundleFQID(uuid='a', version='b')
>>> b1 == b2
True

>>> s1 = SourceRef(id='x', spec=SimpleSourceSpec.parse('y:/0'))
>>> sb1 = SourcedBundleFQID(uuid='a', version='b', source=s1)
>>> sb2 = SourcedBundleFQID(uuid='a', version='b', source=s1)
>>> sb1 == sb2
True

>>> b1 == sb1
True

>>> s2 = SourceRef(id='w', spec=SimpleSourceSpec.parse('z:/0'))
>>> sb3 = SourcedBundleFQID(uuid='a', version='b', source=s2)
>>> b1 == sb3
True

>>> sb1 == sb3
... # doctest: +NORMALIZE_WHITESPACE
Traceback (most recent call last):
...
AssertionError: (('a', 'b'),
SourceRef(id='x', spec=SimpleSourceSpec(prefix=Prefix(common='', partition=0), name='y')),
SourceRef(id='w', spec=SimpleSourceSpec(prefix=Prefix(common='', partition=0), name='z')))
"""
same_bundle = self._nucleus() == other._nucleus()
if (
same_bundle
and isinstance(self, SourcedBundleFQID)
and isinstance(other, SourcedBundleFQID)
):
assert self.source == other.source, (self._nucleus(), self.source, other.source)
return same_bundle

@final
def __hash__(self) -> int:
return hash(self._nucleus())

def __init_subclass__(cls, **kwargs):
"""
>>> @attrs.frozen(kw_only=True)
... class FooBundleFQID(SourcedBundleFQID):
... foo: str
Traceback (most recent call last):
...
AssertionError: <class 'azul.indexer.FooBundleFQID'>

>>> @attrs.frozen(kw_only=True, eq=False)
... class FooBundleFQID(SourcedBundleFQID):
... foo: str
"""
super().__init_subclass__(**kwargs)
assert cls.__eq__ is BundleFQID.__eq__, cls
assert cls.__hash__ is BundleFQID.__hash__, cls

# attrs doesn't allow `order=True` when `eq=False`
def __lt__(self, other: 'BundleFQID') -> bool:
"""
>>> aa = BundleFQID(uuid='a', version='a')
>>> ab = BundleFQID(uuid='a', version='b')
>>> ba = BundleFQID(uuid='b', version='a')
>>> aa < ab < ba
True

>>> ba > ab > aa
True

>>> aa <= ab <= ba
True

>>> ba >= ab >= aa
True

>>> aa != ab != ba
True
"""
return self._nucleus() < other._nucleus()

def to_json(self) -> MutableJSON:
return attrs.asdict(self, recurse=False)
Expand Down Expand Up @@ -454,6 +552,9 @@ def spec_cls(cls) -> type[SourceSpec]:
spec_cls, ref_cls = get_generic_type_params(cls, SourceSpec, SourceRef)
return spec_cls

def with_prefix(self, prefix: Prefix) -> Self:
return attrs.evolve(self, spec=attrs.evolve(self.spec, prefix=prefix))


class SourcedBundleFQIDJSON(BundleFQIDJSON):
source: SourceJSON
Expand All @@ -462,7 +563,7 @@ class SourcedBundleFQIDJSON(BundleFQIDJSON):
BUNDLE_FQID = TypeVar('BUNDLE_FQID', bound='SourcedBundleFQID')


@attrs.frozen(kw_only=True, order=True)
@attrs.frozen(kw_only=True, eq=False)
class SourcedBundleFQID(BundleFQID, Generic[SOURCE_REF]):
"""
>>> spec = SimpleSourceSpec(name='', prefix=(Prefix(partition=0)))
Expand Down Expand Up @@ -493,10 +594,6 @@ def from_json(cls, json: SourcedBundleFQIDJSON) -> 'SourcedBundleFQID':
source = cls.source_ref_cls().from_json(json.pop('source'))
return cls(source=source, **json)

def upcast(self) -> BundleFQID:
return BundleFQID(uuid=self.uuid,
version=self.version)

def to_json(self) -> SourcedBundleFQIDJSON:
return dict(super().to_json(),
source=self.source.to_json())
Expand Down
30 changes: 6 additions & 24 deletions src/azul/indexer/document.py
Original file line number Diff line number Diff line change
Expand Up @@ -504,13 +504,12 @@ class ContributionCoordinates(DocumentCoordinates[E], Generic[E]):
subgraph ("bundle") and represent either the addition of metadata to an
entity or the removal of metadata from an entity.

Contributions produced by
transformers don't specify a catalog, the catalog is supplied when the
contributions are written to the index and it is guaranteed to be the same
for all contributions produced in response to one notification. When
contributions are read back during aggregation, they specify a catalog, the
catalog they were read from. Because of that duality this class has to be
generic in E, the type of EntityReference.
Contributions produced by transformers don't specify a catalog. The catalog
is supplied when the contributions are written to the index and it is
guaranteed to be the same for all contributions produced in response to one
notification. When contributions are read back during aggregation, they
specify a catalog, the catalog they were read from. Because of that duality
this class has to be generic in E, the type of EntityReference.
"""

doc_type: ClassVar[DocumentType] = DocumentType.contribution
Expand All @@ -519,23 +518,6 @@ class ContributionCoordinates(DocumentCoordinates[E], Generic[E]):

deleted: bool

def __attrs_post_init__(self):
# If we were to allow instances of subclasses, we'd risk breaking
# equality and hashing semantics. It is impossible to correctly
# implement the transitivity property of equality between instances of
# type and subtype without ignoring the additional attributes added by
# the subtype. Consider types T and S where S is a subtype of T.
# Transitivity requires that s1 == s2 for any two instances s1 and s2
# of S for which s1 == t and s2 == t, where t is any instance of T. The
# subtype instances s1 and s2 can only be equal to t if they match in
# all attributes that T defines. The additional attributes introduced
# by S must be ignored, even when comparing s1 and s2, otherwise s1 and
# s2 might turn out to be unequal. In this particular case that is not
# desirable: we do want to be able to compare instances of subtypes of
# BundleFQID without ignoring any of their attributes.
concrete_type = type(self.bundle)
assert concrete_type is BundleFQID, concrete_type

@property
def document_id(self) -> str:
return '_'.join((
Expand Down
Loading
Loading