-
Notifications
You must be signed in to change notification settings - Fork 53
Can we use array extensions to speed up string processing with dask #36
Comments
@birdsarah ,I am an outreachy applicant.Can i work on this issue? |
Of course! No need to ask. I will upload a notebook tomorrow that contains the kind of text processing I'm trying to speed up. |
I have added notebook here: https://github.com/mozilla/overscripted/blob/master/analyses/issue_36.ipynb with some examples of string processing that I often find myself doing. |
Thanks @birdsarah looking into it. |
Hi, |
@birdsarah dask is performing much worse than numpy. Any idea why? |
@birdsarah also where should I send you the notebooks? |
Analysis on efficiency and usage of extension arrays in dask Issue mozilla#36
instructions for submitting a PR are on the landing page: https://github.com/mozilla/overscripted - please ask questions like this on the chat and keep issue questions
I could speculate wildly but that wouldn't be productive. Suggest a PR with a concrete examples documenting your observation of "dask performing much worse than numpy." |
@shruthi0898 there's no need to ask permission to work on an issue. |
Added link to the fletcher docs and gave an example for usability. Relocated the analyses for readability Issue mozilla#36
@birdsarah Can i also make a commit or is it almost over ? I had researched and had found that with fletcher , we can also use map_partitions and etc to improve dask itself . |
The deadline is April 2. Please join on chat https://gitter.im/overscripted-discuss/community to read about this kind of information, and to ask questions. I look forward to seeing your contribution @SubbulakshmiRS |
Some notes from elsewhere: To say the fletcher docs are limited would be an understatement. I had to dig around in the fletcher codebase to figure this out, here's some pseudocode for how I think you change a dask string column to a dask fletcher column import pyarrow as pa
fletcher_string_dtype = fr.FletcherDtype(pa.string())
df[col] = df[col].astype(fletcher_string_dtype) I've just been reading this which gives some context for fletcher so I thought I'd share: https://www.dataschool.io/future-of-pandas/ and the linked video: https://www.youtube.com/watch?v=tvmX8YAFK80 |
@birdsarah Yes, my Fletcher documententation is very sparse but always feel free to open questions as issues on the repo or reach out in another way. I'm happy to help with the progress of it. I'm currently missing a use case to work on Fletcher but I'm happy to support any helping person who has a use case and wants to work on it. |
Much of the analysis done on this dataset uses dask. Dask is excellent for distributed numerical computing but seems to struggle with strings.
Array extensions unfortunately can't be serialized pandas-dev/pandas#20612
https://github.com/xhochy/fletcher is an array extension that adds string processing functionality.
Doing some standard tasks for this dataset like: collecting all domains, or number of script domains per location domain compare the performance of spark, dask, and dask with extension arrays.
The text was updated successfully, but these errors were encountered: