Improve chebi mapping #5

joewandy · 2021-02-04T16:21:56Z

@kmcluskey has implemented some codes to pull related chebi ids from the csv. We should incorporate them into this package (as an option when mapping) to get more hits.

joewandy · 2021-03-11T11:47:18Z

i'm thinking https://github.com/glasgowcompbio/pyMultiOmics/blob/main/pyMultiOmics/mapping.py#L42-L45, add a parameter related=True. If True, then run the codes to retrieve related compounds, and set that into self.compound_df in place of the original df

kmcluskey · 2021-03-15T17:55:00Z

I'm not sure how this should be implemented, if the user gives KEGG IDs for the compounds do you want to find the Chebi for the KEGG IDs and then all the related Chebi ids or do you just want this to run if Chebi IDs are provided?

joewandy · 2021-03-23T14:23:07Z

Working on the related_chebi branch, I added this parameter include_related_chebi in the constructor of the Mapper class: https://github.com/glasgowcompbio/pyMultiOmics/blob/related_chebi/pyMultiOmics/mapping.py#L18.

Eventually that flag is passed to the reactome mapping function. If set to True, then we should find all the related chebi ids that are linked to the input chebi ids:

pyMultiOmics/pyMultiOmics/functions.py

Lines 72 to 74 in e5f64e3

    
           # get related chebi ids if necessary so we get more hits when mapping compounds 
        
           if include_related_chebi: 
        
               observed_compound_df = get_related_chebi(observed_compound_df)

.

Could you help to add the implementation inside the get_related_chebi method please? Thanks. It's here:

pyMultiOmics/pyMultiOmics/functions.py

Lines 308 to 310 in e5f64e3

    
           def get_related_chebi(df): 
        
               # TODO: replace df with another where related chebi ids have been pulled 
        
               return df

kmcluskey · 2021-03-25T17:11:26Z

Code currently finds all rows that have chebi_ids with related chebi_ids. For each such chebi_id all rows are selected, copied and the Indentifier replaced with the related chebi. The code then checks to see if these new rows are in the original DF and if they are not the rows are appended to the orginal DF (same intensity rows, new related_chebi identifiers). For a large DF such as found in FlyMet which has duplicate rows of chebi_ids (same compound different peak data) and 30,000 rows - this is a slow process. This procedure would not be run on the flymet DF as all related chebi_ids have already been added but it is a test case. If this method is okay, I will implement it.

joewandy · 2021-03-26T15:57:48Z

Added the notebook to https://github.com/glasgowcompbio/pyMultiOmics/blob/related_chebi/notebooks/analysis_zebrafish-related_chebi.ipynb

I did an alternative version of get_related_chebi_data_v2, what do you think? I guess functionally it's the same as yours. It loops through each row and check if there are related chebi ids. If yes, then add them into the results, but only if it doesn't already exist in the original dataframe.

def get_related_chebi_data_v2(cmpd_data):
    cmpd_data = cmpd_data.copy()
    
    # ensure index type is set to string, since get_chebi_relation_dict also returns string as the keys
    cmpd_data.index = cmpd_data.index.map(str)
    cmpd_data = cmpd_data.reset_index()
    original_cmpds = set(cmpd_data['Identifier']) # used for checking later

    # construct the related chebi dict
    chebi_rel_dict = get_chebi_relation_dict()    

    # loop through each row in cmpd_data
    with_related_data = []
    for ix, row in cmpd_data.iterrows():   
        
        # add the current row we're looping
        current_identifier = row['Identifier']
        with_related_data.append(row)

        # check if there are related compounds to add
        if current_identifier in chebi_rel_dict:

            # if yes, get the related compounds
            chebi_list = chebi_rel_dict[current_identifier]        
            for c in chebi_list:

                # add the related chebi, but only if it's not already present in the original compound
                if c not in original_cmpds:
                    current_row = row.copy()
                    current_row['Identifier'] = c
                    with_related_data.append(current_row)

    # combine all the rows into a single dataframe
    df = pd.concat(with_related_data, axis=1).T
    df = df.set_index('Identifier')
    logger.info('Inserted %d related compounds' % (len(df) - len(cmpd_data)))    
    return df

The duplicate check I added it into a separate function that can be called after the above (or even separately if we want). It groups rows in the dataframe using the 'Identifier' column. If there are multiple rows in the same group (same compound different peak data), then we select the row with the largest sum of intensities across the entire row and delete the rest. Not sure if selecting based on this criteria is the best thing to do, but let's go with this for now.

def remove_dupes(df):    
    df = df.reset_index()

    # group df by the 'Identifier' column
    to_delete = []
    grouped = df.groupby(df['Identifier'])
    for identifier, group_df in grouped:
        
        # if there are multiple rows sharing the same identifier
        if len(group_df) > 1: 

            # remove 'Identifier' column from the grouped df since it can't be summed
            group_df = group_df.drop('Identifier', axis=1)

            # find the row with the largest sum across the row in the group
            idxmax = group_df.sum(axis=1).idxmax()

            # mark all the rows in the group for deletion, except the one with the largest sum
            temp = group_df.index.tolist()
            temp.remove(idxmax)
            to_delete.extend(temp)

    # actually do the deletion here
    logger.info('Removing %d rows with duplicate identifiers' % (len(to_delete)))
    df = df.drop(to_delete)
    df = df.set_index('Identifier')
    return df

kmcluskey · 2021-03-29T10:28:25Z

I'm confused why you are removing these duplicates. PALS was set up to use all of the peaks that could potentially be a compound. Choosing the one with the largest sum of intensities doesn't make sense to me at all - why is that most likely to be the actual compound?

joewandy · 2021-04-01T15:48:32Z

Users are expected to provide a clean input (with one compound per row). The check to remove duplicates is only there when the users fail to do that.

joewandy · 2021-04-29T15:51:23Z

Done by @kmcluskey

joewandy added the enhancement New feature or request label Feb 4, 2021

joewandy assigned kmcluskey Feb 4, 2021

joewandy mentioned this issue Apr 1, 2021

Mapping of Compound IDs glasgowcompbio/GraphOmics#70

Open

joewandy closed this as completed Apr 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve chebi mapping #5

Improve chebi mapping #5

joewandy commented Feb 4, 2021

joewandy commented Mar 11, 2021 •

edited

Loading

kmcluskey commented Mar 15, 2021 •

edited

Loading

joewandy commented Mar 23, 2021 •

edited

Loading

kmcluskey commented Mar 25, 2021

joewandy commented Mar 26, 2021

kmcluskey commented Mar 29, 2021

joewandy commented Apr 1, 2021

joewandy commented Apr 29, 2021

Improve chebi mapping #5

Improve chebi mapping #5

Comments

joewandy commented Feb 4, 2021

joewandy commented Mar 11, 2021 • edited Loading

kmcluskey commented Mar 15, 2021 • edited Loading

joewandy commented Mar 23, 2021 • edited Loading

kmcluskey commented Mar 25, 2021

joewandy commented Mar 26, 2021

kmcluskey commented Mar 29, 2021

joewandy commented Apr 1, 2021

joewandy commented Apr 29, 2021

joewandy commented Mar 11, 2021 •

edited

Loading

kmcluskey commented Mar 15, 2021 •

edited

Loading

joewandy commented Mar 23, 2021 •

edited

Loading