Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review normalization of terms #157

Open
bgyori opened this issue Feb 12, 2024 · 4 comments
Open

Review normalization of terms #157

bgyori opened this issue Feb 12, 2024 · 4 comments

Comments

@bgyori
Copy link
Member

bgyori commented Feb 12, 2024

I found that

MATCH (n:BioEntity {name: "leukemia"}) RETURN n.id LIMIT 5

yields "doid:1240", "efo:0000565", "mondo:0005059", and in addition, we have two nodes called Leukemia, "hp:0001909", and "mesh:D007938".

These nodes might appear just due to ontology imports (which would fine) but I have suspicions that these are actually involved in distinct relations without being normalized, leading to fragmentation.

@cthoyt
Copy link
Member

cthoyt commented Feb 14, 2024

MATCH p=(n:BioEntity)-[r]-() 
WHERE toLower(n.name) = 'leukemia' 
RETURN n.id, r.source, count(r.source)

gives

n.id r.source count(r.source)
"hp:0001909" null 0
"hp:0001909" "gilda" 1
"doid:1240" "disgenet" 61
"doid:1240" null 0
"doid:1240" "gilda" 1
"efo:0000565" null 0
"efo:0000565" "gilda" 1
"mesh:D007938" null 0
"mesh:D007938" "sider_side_effects" 29
"mesh:D007938" "gilda" 2
"mesh:D007938" "chembl" 262
"mondo:0005059" null 0

I double checked that all of the "null" for doid, efo, and mondo all come from the ontology hierarchies (would be good to require source annotations for all edges as well). The "gilda" edges are xrefs

@bgyori
Copy link
Member Author

bgyori commented Feb 14, 2024

I see, so we don't have any issues with EFO and MONDO but almost certainly, mesh:D007938 will show up from other sources (the name there is capitalized as Leukemia), so we could run a query for that as well.

@cthoyt
Copy link
Member

cthoyt commented Feb 14, 2024

I just updated the chart above. we'll want to follow-up by checking sider, chembl, and disgenet are all standardized the same way

@bgyori
Copy link
Member Author

bgyori commented Feb 14, 2024

Great, so this reveals that we have an issue in normalizing between DOID and MeSH. Since MeSH is higher in the default priority order, standardization should typically map to it (assuming we have all the right xrefs). It looks like here:
https://github.com/gyorilab/indra_cogex/blob/main/src/indra_cogex/sources/disgenet/__init__.py#L65-L68 we already standardize nodes from Disgenet. So two possible issues (1) the processor was not re-run with the latest code or (2) we are missing xrefs.

The standardization code seems to be working:

> from indra_cogex.representation import Node
> Node.standardized(db_ns='DOID', db_id='DOID:1240', labels=['BioEntity'])
(:BioEntity { id:'MESH:D007938', name:'Leukemia' })

so I suspect the issue is that the processor wasn't re-run, or another possibility (which would be fun) is if the processor calls

Node.standardized(db_ns='DOID', db_id='1240', labels=['BioEntity'])

without the DOID: prefix for the ID, which doesn't standardize correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants