Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove highly co-linear (or equivalent confounds) #312

Open
jdkent opened this issue Apr 18, 2020 · 0 comments
Open

Remove highly co-linear (or equivalent confounds) #312

jdkent opened this issue Apr 18, 2020 · 0 comments

Comments

@jdkent
Copy link
Member

jdkent commented Apr 18, 2020

Is your feature request related to a problem? Please describe.
It can look concerning if the design matrix is "singular" and needs to be "regularized".
equivalent confounds can be removed, so that a "singular" matrix would only refer to (near) equivalence between task regressors and/or a task regressor and confound.

Describe the solution you'd like
Automatic detection and deletion of duplicate columns (with helpful warning being raised)

Describe alternatives you've considered
Keep all columns or allow user to decide

Additional context
This will be useful code to implement this feature:

def get_duplicate_columns(df):
    '''
    Get a list of duplicate columns.
    It will iterate over all the columns in dataframe and find the columns whose contents are duplicate.
    :param df: Dataframe object
    :return: List of columns whose contents are duplicates.
    '''
    duplicateColumnNames = set()
    # Iterate over all the columns in dataframe
    for x in range(df.shape[1]):
        # Select column at xth index.
        col = df.iloc[:, x]
        # Iterate over all the columns in DataFrame from (x+1)th index till end
        for y in range(x + 1, df.shape[1]):
            # Select column at yth index.
            otherCol = df.iloc[:, y]
            # Check if two columns at x 7 y index are equal
            if np.all(np.isclose(col, otherCol)):
                duplicateColumnNames.add(df.columns.values[y])
    return list(duplicateColumnNames)

source

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant