Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve chebi mapping #5

Closed
joewandy opened this issue Feb 4, 2021 · 8 comments
Closed

Improve chebi mapping #5

joewandy opened this issue Feb 4, 2021 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@joewandy
Copy link
Member

joewandy commented Feb 4, 2021

Related to glasgowcompbio/GraphOmics#70

@kmcluskey has implemented some codes to pull related chebi ids from the csv. We should incorporate them into this package (as an option when mapping) to get more hits.

@joewandy joewandy added the enhancement New feature or request label Feb 4, 2021
@joewandy
Copy link
Member Author

joewandy commented Mar 11, 2021

i'm thinking https://github.com/glasgowcompbio/pyMultiOmics/blob/main/pyMultiOmics/mapping.py#L42-L45, add a parameter related=True. If True, then run the codes to retrieve related compounds, and set that into self.compound_df in place of the original df

@kmcluskey
Copy link
Contributor

kmcluskey commented Mar 15, 2021

I'm not sure how this should be implemented, if the user gives KEGG IDs for the compounds do you want to find the Chebi for the KEGG IDs and then all the related Chebi ids or do you just want this to run if Chebi IDs are provided?

@joewandy
Copy link
Member Author

joewandy commented Mar 23, 2021

Working on the related_chebi branch, I added this parameter include_related_chebi in the constructor of the Mapper class: https://github.com/glasgowcompbio/pyMultiOmics/blob/related_chebi/pyMultiOmics/mapping.py#L18.

Eventually that flag is passed to the reactome mapping function. If set to True, then we should find all the related chebi ids that are linked to the input chebi ids:

# get related chebi ids if necessary so we get more hits when mapping compounds
if include_related_chebi:
observed_compound_df = get_related_chebi(observed_compound_df)
.

Could you help to add the implementation inside the get_related_chebi method please? Thanks. It's here:

def get_related_chebi(df):
# TODO: replace df with another where related chebi ids have been pulled
return df

@kmcluskey
Copy link
Contributor

Code currently finds all rows that have chebi_ids with related chebi_ids. For each such chebi_id all rows are selected, copied and the Indentifier replaced with the related chebi. The code then checks to see if these new rows are in the original DF and if they are not the rows are appended to the orginal DF (same intensity rows, new related_chebi identifiers). For a large DF such as found in FlyMet which has duplicate rows of chebi_ids (same compound different peak data) and 30,000 rows - this is a slow process. This procedure would not be run on the flymet DF as all related chebi_ids have already been added but it is a test case. If this method is okay, I will implement it.

@joewandy
Copy link
Member Author

Added the notebook to https://github.com/glasgowcompbio/pyMultiOmics/blob/related_chebi/notebooks/analysis_zebrafish-related_chebi.ipynb

I did an alternative version of get_related_chebi_data_v2, what do you think? I guess functionally it's the same as yours. It loops through each row and check if there are related chebi ids. If yes, then add them into the results, but only if it doesn't already exist in the original dataframe.

def get_related_chebi_data_v2(cmpd_data):
    cmpd_data = cmpd_data.copy()
    
    # ensure index type is set to string, since get_chebi_relation_dict also returns string as the keys
    cmpd_data.index = cmpd_data.index.map(str)
    cmpd_data = cmpd_data.reset_index()
    original_cmpds = set(cmpd_data['Identifier']) # used for checking later

    # construct the related chebi dict
    chebi_rel_dict = get_chebi_relation_dict()    

    # loop through each row in cmpd_data
    with_related_data = []
    for ix, row in cmpd_data.iterrows():   
        
        # add the current row we're looping
        current_identifier = row['Identifier']
        with_related_data.append(row)

        # check if there are related compounds to add
        if current_identifier in chebi_rel_dict:

            # if yes, get the related compounds
            chebi_list = chebi_rel_dict[current_identifier]        
            for c in chebi_list:

                # add the related chebi, but only if it's not already present in the original compound
                if c not in original_cmpds:
                    current_row = row.copy()
                    current_row['Identifier'] = c
                    with_related_data.append(current_row)

    # combine all the rows into a single dataframe
    df = pd.concat(with_related_data, axis=1).T
    df = df.set_index('Identifier')
    logger.info('Inserted %d related compounds' % (len(df) - len(cmpd_data)))    
    return df

The duplicate check I added it into a separate function that can be called after the above (or even separately if we want). It groups rows in the dataframe using the 'Identifier' column. If there are multiple rows in the same group (same compound different peak data), then we select the row with the largest sum of intensities across the entire row and delete the rest. Not sure if selecting based on this criteria is the best thing to do, but let's go with this for now.

def remove_dupes(df):    
    df = df.reset_index()

    # group df by the 'Identifier' column
    to_delete = []
    grouped = df.groupby(df['Identifier'])
    for identifier, group_df in grouped:
        
        # if there are multiple rows sharing the same identifier
        if len(group_df) > 1: 

            # remove 'Identifier' column from the grouped df since it can't be summed
            group_df = group_df.drop('Identifier', axis=1)

            # find the row with the largest sum across the row in the group
            idxmax = group_df.sum(axis=1).idxmax()

            # mark all the rows in the group for deletion, except the one with the largest sum
            temp = group_df.index.tolist()
            temp.remove(idxmax)
            to_delete.extend(temp)

    # actually do the deletion here
    logger.info('Removing %d rows with duplicate identifiers' % (len(to_delete)))
    df = df.drop(to_delete)
    df = df.set_index('Identifier')
    return df

@kmcluskey
Copy link
Contributor

I'm confused why you are removing these duplicates. PALS was set up to use all of the peaks that could potentially be a compound. Choosing the one with the largest sum of intensities doesn't make sense to me at all - why is that most likely to be the actual compound?

@joewandy
Copy link
Member Author

joewandy commented Apr 1, 2021

Users are expected to provide a clean input (with one compound per row). The check to remove duplicates is only there when the users fail to do that.

@joewandy
Copy link
Member Author

Done by @kmcluskey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants