Skip to content

Commit f1a2165

Browse files
committed
fixup! Add SOP for schema changes (DataBiosphere/azul#2884)
Removed schema change embargo from spec and added reference to SOP.
1 parent edbf400 commit f1a2165

File tree

1 file changed

+23
-19
lines changed

1 file changed

+23
-19
lines changed

docs/dcp2_system_design.rst

+23-19
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,13 @@ EBI Ingest (the primary channel for incorporating projects into the DCP/2), an
3737
adapter for processing analysis (meta)data from Terra workspaces, and adapters
3838
for high-level matrix data from a range of sources.
3939

40-
All metadata is in JSON format and complies with the HCA Metadata Schema.
41-
Aside from minor schema changes that were necessary for processing the staging
42-
areas, the evolution of the schema is currently on hold.
40+
All metadata is in JSON format and complies with the `HCA Metadata Schema`_.
41+
Changes to that schema are made according to standard `DCP/2 operating
42+
procedures`_.
43+
44+
.. _HCA Metadata Schema: https://github.com/HumanCellAtlas/metadata-schema
45+
46+
.. _DCP/2 operating procedures: dcp2_operating_procedures.rst
4347

4448
The DCP/2 only contains public (meta)data (not controlled access).
4549

@@ -309,7 +313,7 @@ follows:
309313
UUID. Pick the row with the highest version.
310314

311315
ii. read the ``inputs``, ``outputs`` and ``protocols`` properties (they're
312-
all lists).
316+
all lists).
313317

314318
For each input, output and protocol, extract the schema type and
315319
entity ID. Query the TDR table that corresponds to the schema type and
@@ -453,7 +457,7 @@ mapping between the two. Similarly, instead of allocating a random UUIDv4 for
453457
the descriptor ``file_id`` one could also derive a UUIDv5 from the SHA-1 or
454458
SHA-256 hashes of the data file's content.
455459

456-
.. [#]
460+
.. [#]
457461
If a file is referenced by multiple bundles using different file names, the
458462
DSS adapter stages multiple objects with the same content. This case occurs
459463
in the wild, but is of negligible impact (< 1% in volume, zarr store
@@ -503,7 +507,7 @@ files to an ``analysis_process`` in the ``links`` table (`metadata-schema
503507
Naming datasets and snapshots
504508
-----------------------------
505509

506-
|nn|
510+
|nn|
507511

508512
This section contains specific details that anticipate that the DCP/2
509513
will soon need to support multiple snapshots of per catalog, at least one per
@@ -528,7 +532,7 @@ snapshots:
528532
labelling, sorting and filtering are available when listing datasets and
529533
snapshots using the TDR API. Additionally, IDs are hard to read to the
530534
human eye, and hard to distinguish visually, so as long as we manually
531-
confer them between teams, names are preferred.
535+
confer them between teams, names are preferred.
532536

533537
|ne|
534538

@@ -655,7 +659,7 @@ descriptors, one for metadata files and one for ``links.json`` files.
655659

656660
where
657661

658-
``entity_type``
662+
``entity_type``
659663
is the `HCA schema entity type`_ such as ``cell_suspension``.
660664

661665
``entity_id``
@@ -695,7 +699,7 @@ descriptors, one for metadata files and one for ``links.json`` files.
695699

696700
where
697701

698-
``file_name``
702+
``file_name``
699703
is the ``file_name`` property from the file descriptor object for this
700704
data file.
701705

@@ -705,7 +709,7 @@ descriptors, one for metadata files and one for ``links.json`` files.
705709

706710
where
707711

708-
``links_id``
712+
``links_id``
709713
is a UUID that uniquely identifies the subgraph. The DSS adapter uses the
710714
bundle UUID.
711715

@@ -1061,19 +1065,19 @@ EBNF/Regex, starting at the ``strata`` non-terminal::
10611065
strata = "" | stratum , { "\n" , stratum }
10621066

10631067
stratum = point , { ";" , point }
1064-
1068+
10651069
point = dimension , "=" , values
1066-
1070+
10671071
dimension = "genusSpecies" | "organ" | "developmentStage" | "libraryConstructionApproach"
1068-
1072+
10691073
values = value , { "," , value }
1070-
1074+
10711075
value = [^\n;=,]+
10721076

10731077
Examples:
10741078

1075-
- Not stratified::
1076-
1079+
- Not stratified::
1080+
10771081
""
10781082

10791083
- Stratified::
@@ -1231,8 +1235,8 @@ information about CGMs.
12311235
CGM in the deprecated mechanism (`Describing CGMs as supplementary files`_)
12321236

12331237
- the ``analysis_protocol`` contains an optional ``matrix`` module schema
1234-
containing the properties ``data_normalization_methods`` and
1235-
``derivation_process``
1238+
containing the properties ``data_normalization_methods`` and
1239+
``derivation_process``
12361240

12371241
Traversing the approximate CGM subgraphs, the Azul indexer infers a
12381242
stratification tree of exactly the same structure as the one it derives from
@@ -1242,7 +1246,7 @@ mechanism (`Describing CGMs as supplementary files`_). The Data Browser
12421246
exposes that tree in the same manner on the project details page. The inferral
12431247
algorithm is identical to the one used for ``DCP/2-generated matrices`` with
12441248
the one distinction that the subgraphs in the latter are exact, not
1245-
approximate.
1249+
approximate.
12461250

12471251
Additionally, the CGM analysis files are listed on the Files tab of the Data
12481252
Browser.

0 commit comments

Comments
 (0)