Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Can we use array extensions to speed up string processing with dask #36

Open
birdsarah opened this issue Mar 11, 2019 · 13 comments
Open
Labels
good first issue Good for newcomers research question Outstanding questions that have not been investigated yet.

Comments

@birdsarah
Copy link
Contributor

Much of the analysis done on this dataset uses dask. Dask is excellent for distributed numerical computing but seems to struggle with strings.

Array extensions unfortunately can't be serialized pandas-dev/pandas#20612

https://github.com/xhochy/fletcher is an array extension that adds string processing functionality.

Doing some standard tasks for this dataset like: collecting all domains, or number of script domains per location domain compare the performance of spark, dask, and dask with extension arrays.

@birdsarah birdsarah added good first issue Good for newcomers research question Outstanding questions that have not been investigated yet. labels Mar 11, 2019
@Muskan284
Copy link

@birdsarah ,I am an outreachy applicant.Can i work on this issue?
I have done text pre-processing using pyhton,pandas for my earlier projects and would love to contribute on this issue.

@birdsarah
Copy link
Contributor Author

Of course! No need to ask. I will upload a notebook tomorrow that contains the kind of text processing I'm trying to speed up.

@birdsarah
Copy link
Contributor Author

I have added notebook here: https://github.com/mozilla/overscripted/blob/master/analyses/issue_36.ipynb with some examples of string processing that I often find myself doing.

@Muskan284
Copy link

Thanks @birdsarah looking into it.

@shruthi0898
Copy link

Hi,
Can I contribute to this as well?
Shruthi

@srujana121
Copy link

@birdsarah dask is performing much worse than numpy. Any idea why?

@srujana121
Copy link

@birdsarah also where should I send you the notebooks?

Aimaanhasan pushed a commit to Aimaanhasan/overscripted that referenced this issue Mar 18, 2019
Analysis on efficiency and usage of extension arrays in dask

Issue mozilla#36
@birdsarah
Copy link
Contributor Author

@srujana121

@birdsarah also where should I send you the notebooks?

instructions for submitting a PR are on the landing page: https://github.com/mozilla/overscripted - please ask questions like this on the chat and keep issue questions

@birdsarah dask is performing much worse than numpy. Any idea why?

I could speculate wildly but that wouldn't be productive. Suggest a PR with a concrete examples documenting your observation of "dask performing much worse than numpy."

@birdsarah
Copy link
Contributor Author

@shruthi0898 there's no need to ask permission to work on an issue.

@mozilla mozilla deleted a comment from nandinibansal2000 Mar 20, 2019
@mozilla mozilla deleted a comment from nandinibansal2000 Mar 20, 2019
Aimaanhasan added a commit to Aimaanhasan/overscripted that referenced this issue Mar 20, 2019
Added link to the fletcher docs and gave an example for usability. 
Relocated the analyses for readability

Issue mozilla#36
@SubbulakshmiRS
Copy link

@birdsarah Can i also make a commit or is it almost over ? I had researched and had found that with fletcher , we can also use map_partitions and etc to improve dask itself .
https://medium.com/mindorks/speeding-up-text-pre-processing-using-dask-45cc3ede1366

@birdsarah
Copy link
Contributor Author

The deadline is April 2. Please join on chat https://gitter.im/overscripted-discuss/community to read about this kind of information, and to ask questions. I look forward to seeing your contribution @SubbulakshmiRS

@birdsarah
Copy link
Contributor Author

Some notes from elsewhere:

To say the fletcher docs are limited would be an understatement. I had to dig around in the fletcher codebase to figure this out, here's some pseudocode for how I think you change a dask string column to a dask fletcher column

import pyarrow as pa
fletcher_string_dtype = fr.FletcherDtype(pa.string())
df[col] = df[col].astype(fletcher_string_dtype)

I've just been reading this which gives some context for fletcher so I thought I'd share: https://www.dataschool.io/future-of-pandas/ and the linked video: https://www.youtube.com/watch?v=tvmX8YAFK80

Aimaanhasan pushed a commit to Aimaanhasan/overscripted that referenced this issue Apr 2, 2019
@xhochy
Copy link

xhochy commented May 10, 2019

@birdsarah Yes, my Fletcher documententation is very sparse but always feel free to open questions as issues on the repo or reach out in another way. I'm happy to help with the progress of it. I'm currently missing a use case to work on Fletcher but I'm happy to support any helping person who has a use case and wants to work on it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
good first issue Good for newcomers research question Outstanding questions that have not been investigated yet.
Projects
None yet
Development

No branches or pull requests

6 participants