Skip to content

Commit

Permalink
Merge pull request #628 from monarch-initiative/externalclingenmedgenefo
Browse files Browse the repository at this point in the history
Upgrade externally managed content
  • Loading branch information
twhetzel authored Nov 22, 2024
2 parents 4156066 + cd99892 commit df7daf6
Show file tree
Hide file tree
Showing 49 changed files with 2,624,127 additions and 44 deletions.
33 changes: 32 additions & 1 deletion docs/externally-managed-content.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Externally managed content
## Externally managed content (EMC)

Externally managed content is content that is provided by trusted providers and is merged in _unreviewed_. Currently, we support three types of externally managed content:

Expand All @@ -13,6 +13,37 @@ Externally managed content is content that is provided by trusted providers and
1. External provider provides a TSV. (Ideally they use the same template that NORD uses - see `src/ontology/external/nord.robot.tsv`).
2. We pull it in and turn it into a ROBOT template and transform it to owl.

### QC system

We have implemented a full QC system for EMC here: https://github.com/monarch-initiative/mondo-ingest/pull/628.

It works like this:

1. A check is added to `src/ontology/config/robot_report_external_content.txt` (this is a standard ROBOT profile.txt). This check _must_ confirm to the ROBOT report formatting requirements, e.g. return exactly three variables (`?entity`, `?propert`, `?value`).
2. The QC system first transforms the externally managed content to OWL, then it tests it against the ROBOT report, then it removes _any_ ID - value combination identified to to be problematic by the QC. Note: Right now all EMCs are checked at once, and the QC system cant know for sure where which value has originally come from. This is usually a reasonable assumption, but may occassionally lead to false positives, i.e. where a value that is correct in one source was indeed correct in another. The most likely scenario is that two sources say "X" is a synonym, but source A says "narrow" and source B says "broad" - the QC system will remove them both as it cannot distinguish from the check alone which is the offending one.
3. After the QC system has removed all of the faulty values, it writes them to a report for the stakeholders which is shared in the src/ontology/external directory for each EMC source. It also produces a variant of the EMC file which is labelled "processed".
4. The "processed" variant should be used by the the Mondo pipeline when updating EMC.

### Adding a new EMC to Mondo ingest

1. Add the id of the EMC to the `EXTERNAL_FILES` variable, e.g.
```
EXTERNAL_FILES = \
efo-proxy-merges \
mondo-new
```
2. Add a goal that handles the update (add helpful comments where the information is pulled from for future souls), e.g:
```
###### ClinGen #########
# Managed in Google Sheets:
# https://docs.google.com/spreadsheets/d/1JAgySABpRkmXl-8lu5Yrxd9yjTGNbH8aoDcMlHqpssQ/edit?gid=637121472#gid=637121472
$(EXTERNAL_CONTENT_DIR)/mondo-clingen.robot.tsv:
wget "https://docs.google.com/spreadsheets/d/e/2PACX-1vRiYDV1n1nDuJOgnlFx6DsYGyIGlbgI1HeDzI740OgmOKYy2RCCyBqLHiBh-IMadYXjVglsxDPypArh/pub?gid=637121472&single=true&output=tsv" -O $@
```
3. Add the active columns in that file to `src/scripts/post_process_externally_managed_content.py`, to the method `def _get_column_of_external_source_related_to_qc_failure(qc_failure, erroneous_row, external)`. Only add the columns which actually contain values, like xrefs and synonyms, not the columns which contain provenance information.

### Related issues and PRs:

- [Issue: Represent Externally Managed Content in the Mondo Ingest](https://github.com/monarch-initiative/mondo-ingest/issues/439)
Expand Down
2 changes: 1 addition & 1 deletion src/mappings/mondo-nando.sssom.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
# mapping_provider: MONDO:NANDO
# mapping_set_description: This mapping set is manually curated by the NANDO team at
# nanbyodata.jp.
# mapping_set_id: https://w3id.org/sssom/mappings/204f6b4a-3c0d-47c7-998e-1668ceaa99da
# mapping_set_id: https://w3id.org/sssom/mappings/056ce54c-b213-4c8b-80e2-1f9d7add42e0
# mapping_set_title: NANDO - Mondo mappings provided by nanbyodata.jp
subject_id subject_label predicate_id object_id object_label mapping_justification
MONDO:0000050 isolated congenital growth hormone deficiency skos:closeMatch NANDO:2200317 Congenital growth hormone deficiency semapv:MappingInversion
Expand Down
9 changes: 9 additions & 0 deletions src/ontology/config/robot_report_external_content.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
ERROR file:tmp/mondo/src/sparql/qc/general/qc-syn-source-not-xref.sparql
ERROR invalid_xref
ERROR duplicate_scoped_synonym
ERROR file:tmp/mondo/src/sparql/qc/general/qc-duplicate-exact-synonym-no-abbrev.sparql
ERROR file:tmp/mondo/src/sparql/qc/mondo/qc-proxy-merges.sparql
ERROR file:tmp/mondo/src/sparql/qc/general/qc-related-exact-synonym.sparql
ERROR file:tmp/mondo/src/sparql/qc/mondo/qc-animal-disease-rare.sparql
ERROR file:tmp/mondo/src/sparql/qc/general/qc-trailing-whitespace.sparql
ERROR file:tmp/mondo/src/sparql/qc/mondo/qc-ordo-subset-exact-mapping.sparql
27 changes: 27 additions & 0 deletions src/ontology/external/efo-proxy-merges.robot.owl
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@base <http://www.w3.org/2002/07/owl#> .

[ rdf:type owl:Ontology
] .

#################################################################
# Annotation properties
#################################################################

### http://www.geneontology.org/formats/oboInOwl#hasDbXref
<http://www.geneontology.org/formats/oboInOwl#hasDbXref> rdf:type owl:AnnotationProperty .


### http://www.geneontology.org/formats/oboInOwl#source
<http://www.geneontology.org/formats/oboInOwl#source> rdf:type owl:AnnotationProperty .


#################################################################
# Classes
#################################################################

### Generated by the OWL API (version 4.5.29) https://github.com/owlcs/owlapi
4 changes: 4 additions & 0 deletions src/ontology/external/gard-qc-failures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

# QC Report for gard

No QC failures found.
Loading

0 comments on commit df7daf6

Please sign in to comment.