-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve chebi mapping #5
Comments
i'm thinking https://github.com/glasgowcompbio/pyMultiOmics/blob/main/pyMultiOmics/mapping.py#L42-L45, add a parameter |
I'm not sure how this should be implemented, if the user gives KEGG IDs for the compounds do you want to find the Chebi for the KEGG IDs and then all the related Chebi ids or do you just want this to run if Chebi IDs are provided? |
Working on the related_chebi branch, I added this parameter Eventually that flag is passed to the reactome mapping function. If set to True, then we should find all the related chebi ids that are linked to the input chebi ids: pyMultiOmics/pyMultiOmics/functions.py Lines 72 to 74 in e5f64e3
Could you help to add the implementation inside the pyMultiOmics/pyMultiOmics/functions.py Lines 308 to 310 in e5f64e3
|
Code currently finds all rows that have chebi_ids with related chebi_ids. For each such chebi_id all rows are selected, copied and the Indentifier replaced with the related chebi. The code then checks to see if these new rows are in the original DF and if they are not the rows are appended to the orginal DF (same intensity rows, new related_chebi identifiers). For a large DF such as found in FlyMet which has duplicate rows of chebi_ids (same compound different peak data) and 30,000 rows - this is a slow process. This procedure would not be run on the flymet DF as all related chebi_ids have already been added but it is a test case. If this method is okay, I will implement it. |
Added the notebook to https://github.com/glasgowcompbio/pyMultiOmics/blob/related_chebi/notebooks/analysis_zebrafish-related_chebi.ipynb I did an alternative version of
The duplicate check I added it into a separate function that can be called after the above (or even separately if we want). It groups rows in the dataframe using the 'Identifier' column. If there are multiple rows in the same group (same compound different peak data), then we select the row with the largest sum of intensities across the entire row and delete the rest. Not sure if selecting based on this criteria is the best thing to do, but let's go with this for now.
|
I'm confused why you are removing these duplicates. PALS was set up to use all of the peaks that could potentially be a compound. Choosing the one with the largest sum of intensities doesn't make sense to me at all - why is that most likely to be the actual compound? |
Users are expected to provide a clean input (with one compound per row). The check to remove duplicates is only there when the users fail to do that. |
Done by @kmcluskey |
Related to glasgowcompbio/GraphOmics#70
@kmcluskey has implemented some codes to pull related chebi ids from the csv. We should incorporate them into this package (as an option when mapping) to get more hits.
The text was updated successfully, but these errors were encountered: