Review normalization of terms #157

bgyori · 2024-02-12T03:37:36Z

I found that

MATCH (n:BioEntity {name: "leukemia"}) RETURN n.id LIMIT 5

yields "doid:1240", "efo:0000565", "mondo:0005059", and in addition, we have two nodes called Leukemia, "hp:0001909", and "mesh:D007938".

These nodes might appear just due to ontology imports (which would fine) but I have suspicions that these are actually involved in distinct relations without being normalized, leading to fragmentation.

The text was updated successfully, but these errors were encountered:

cthoyt · 2024-02-14T15:42:03Z

MATCH p=(n:BioEntity)-[r]-() 
WHERE toLower(n.name) = 'leukemia' 
RETURN n.id, r.source, count(r.source)

gives

n.id	r.source	count(r.source)
"hp:0001909"	null	0
"hp:0001909"	"gilda"	1
"doid:1240"	"disgenet"	61
"doid:1240"	null	0
"doid:1240"	"gilda"	1
"efo:0000565"	null	0
"efo:0000565"	"gilda"	1
"mesh:D007938"	null	0
"mesh:D007938"	"sider_side_effects"	29
"mesh:D007938"	"gilda"	2
"mesh:D007938"	"chembl"	262
"mondo:0005059"	null	0

I double checked that all of the "null" for doid, efo, and mondo all come from the ontology hierarchies (would be good to require source annotations for all edges as well). The "gilda" edges are xrefs

bgyori · 2024-02-14T16:26:06Z

I see, so we don't have any issues with EFO and MONDO but almost certainly, mesh:D007938 will show up from other sources (the name there is capitalized as Leukemia), so we could run a query for that as well.

cthoyt · 2024-02-14T16:37:40Z

I just updated the chart above. we'll want to follow-up by checking sider, chembl, and disgenet are all standardized the same way

bgyori · 2024-02-14T17:45:50Z

Great, so this reveals that we have an issue in normalizing between DOID and MeSH. Since MeSH is higher in the default priority order, standardization should typically map to it (assuming we have all the right xrefs). It looks like here:
https://github.com/gyorilab/indra_cogex/blob/main/src/indra_cogex/sources/disgenet/__init__.py#L65-L68 we already standardize nodes from Disgenet. So two possible issues (1) the processor was not re-run with the latest code or (2) we are missing xrefs.

The standardization code seems to be working:

> from indra_cogex.representation import Node
> Node.standardized(db_ns='DOID', db_id='DOID:1240', labels=['BioEntity'])
(:BioEntity { id:'MESH:D007938', name:'Leukemia' })

so I suspect the issue is that the processor wasn't re-run, or another possibility (which would be fun) is if the processor calls

Node.standardized(db_ns='DOID', db_id='1240', labels=['BioEntity'])

without the DOID: prefix for the ID, which doesn't standardize correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review normalization of terms #157

Review normalization of terms #157

bgyori commented Feb 12, 2024

cthoyt commented Feb 14, 2024 •

edited

Loading

bgyori commented Feb 14, 2024

cthoyt commented Feb 14, 2024 •

edited

Loading

bgyori commented Feb 14, 2024

Review normalization of terms #157

Review normalization of terms #157

Comments

bgyori commented Feb 12, 2024

cthoyt commented Feb 14, 2024 • edited Loading

bgyori commented Feb 14, 2024

cthoyt commented Feb 14, 2024 • edited Loading

bgyori commented Feb 14, 2024

cthoyt commented Feb 14, 2024 •

edited

Loading

cthoyt commented Feb 14, 2024 •

edited

Loading