Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask integration #8

Open
bendruitt opened this issue Nov 8, 2017 · 3 comments
Open

Dask integration #8

bendruitt opened this issue Nov 8, 2017 · 3 comments
Assignees

Comments

@bendruitt
Copy link

Much like your idea for pyspark integration, I would like to see simliar support for passing in a dask client as is supported by the dask-xgboost library. I have found initial success in reducing high dimensional data using the BoostaRoota library but find the bottleneck to be during the initial load of the parquet file repository. I'll offer what assitance I can regarding this work.

Ben.

@chasedehan
Copy link
Owner

I have never used dask before, but have been wanting to look into it. This gives me a reason to! I'll start looking into set up and usage, but might reach back out to you for assistance. Feel free to send me an email: chasedehan at yahoo dot com

@chasedehan
Copy link
Owner

Just an update: I have gotten dask and dask-xgboost working on my local and cluster, but will need to do some work on the shadow feature creation. I thought I would be able to just drop in the dxgb.train() along with Client(), but I am doing all the feature work under the hood with pandas. The dask dataframe is slightly different; it doesn't look too hard, but might take me a few days to work it out how it will fit in with the rest of the package. (I really want to avoid bloat on the main functionality)

For example, this is one of the helper functions I need to rework:

def _create_shadow(x_train):
    x_shadow = x_train.copy()
    for c in x_shadow.columns:
        np.random.shuffle(x_shadow[c].values)
    # rename the shadow
    shadow_names = ["ShadowVar" + str(i + 1) for i in range(x_train.shape[1])]
    x_shadow.columns = shadow_names
    # Combine to make one new dataframe
    new_x = pd.concat([x_train, x_shadow], axis=1)
    return new_x, shadow_names

@chasedehan chasedehan self-assigned this Nov 15, 2017
@jonimatix
Copy link

Hello,
Is there any update on this feature? Would be great as it would speed up processing even more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants