Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identical gene names in different official gene sets lead to an omission of these genes #37

Open
cmayer opened this issue Mar 18, 2021 · 0 comments

Comments

@cmayer
Copy link

cmayer commented Mar 18, 2021

Especially in vertebrate genome projects it is common that the same gene name is used in different species, but Orthograph does not seem to be able to handle this.

If one loads a first OGS into Orthograph, Orthograph reports the number of sequences it read successfully. For the first set this is always the expected number. If a second set is loaded that contains gene names that are identical to gene names in the first set, the number of sequences Orthograph reports to have been entered to the data base, is smaller than the total number of genes that is present in the second set.

It seems that Orthograph does not include these genes and simply ignores them? This would be fatal for the functionality. There should at least be a major warning if this happens.

The problem could be solved by adding an OGS identifier to the gene names used by Orthograph internally.

Workaround:
For genome projects in which you expect that gene names might be the same in different official gene sets, one has to rename the genes in the OGS files and the tab delimited files correspondingly, e.g. by prepending a species identifier to all gene name, which makes the gene names unique across a set of OGSs.

This problem should affect at least all vertebrate OGSs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant