Can we use array extensions to speed up string processing with dask #36

birdsarah · 2019-03-11T20:31:06Z

Much of the analysis done on this dataset uses dask. Dask is excellent for distributed numerical computing but seems to struggle with strings.

Array extensions unfortunately can't be serialized pandas-dev/pandas#20612

https://github.com/xhochy/fletcher is an array extension that adds string processing functionality.

Doing some standard tasks for this dataset like: collecting all domains, or number of script domains per location domain compare the performance of spark, dask, and dask with extension arrays.

Muskan284 · 2019-03-13T18:44:51Z

@birdsarah ,I am an outreachy applicant.Can i work on this issue?
I have done text pre-processing using pyhton,pandas for my earlier projects and would love to contribute on this issue.

birdsarah · 2019-03-14T10:00:17Z

Of course! No need to ask. I will upload a notebook tomorrow that contains the kind of text processing I'm trying to speed up.

birdsarah · 2019-03-15T10:10:55Z

I have added notebook here: https://github.com/mozilla/overscripted/blob/master/analyses/issue_36.ipynb with some examples of string processing that I often find myself doing.

Muskan284 · 2019-03-15T15:52:21Z

Thanks @birdsarah looking into it.

shruthi0898 · 2019-03-16T09:10:27Z

Hi,
Can I contribute to this as well?
Shruthi

srujana121 · 2019-03-17T15:16:01Z

@birdsarah dask is performing much worse than numpy. Any idea why?

srujana121 · 2019-03-17T15:16:36Z

@birdsarah also where should I send you the notebooks?

Analysis on efficiency and usage of extension arrays in dask Issue mozilla#36

birdsarah · 2019-03-19T00:22:15Z

@srujana121

@birdsarah also where should I send you the notebooks?

instructions for submitting a PR are on the landing page: https://github.com/mozilla/overscripted - please ask questions like this on the chat and keep issue questions

@birdsarah dask is performing much worse than numpy. Any idea why?

I could speculate wildly but that wouldn't be productive. Suggest a PR with a concrete examples documenting your observation of "dask performing much worse than numpy."

birdsarah · 2019-03-19T00:23:11Z

@shruthi0898 there's no need to ask permission to work on an issue.

Added link to the fletcher docs and gave an example for usability. Relocated the analyses for readability Issue mozilla#36

SubbulakshmiRS · 2019-03-21T11:46:52Z

@birdsarah Can i also make a commit or is it almost over ? I had researched and had found that with fletcher , we can also use map_partitions and etc to improve dask itself .
https://medium.com/mindorks/speeding-up-text-pre-processing-using-dask-45cc3ede1366

birdsarah · 2019-03-21T16:45:21Z

The deadline is April 2. Please join on chat https://gitter.im/overscripted-discuss/community to read about this kind of information, and to ask questions. I look forward to seeing your contribution @SubbulakshmiRS

birdsarah · 2019-03-22T06:34:40Z

Some notes from elsewhere:

To say the fletcher docs are limited would be an understatement. I had to dig around in the fletcher codebase to figure this out, here's some pseudocode for how I think you change a dask string column to a dask fletcher column

import pyarrow as pa
fletcher_string_dtype = fr.FletcherDtype(pa.string())
df[col] = df[col].astype(fletcher_string_dtype)

I've just been reading this which gives some context for fletcher so I thought I'd share: https://www.dataschool.io/future-of-pandas/ and the linked video: https://www.youtube.com/watch?v=tvmX8YAFK80

Issue mozilla#36

xhochy · 2019-05-10T07:50:04Z

@birdsarah Yes, my Fletcher documententation is very sparse but always feel free to open questions as issues on the repo or reach out in another way. I'm happy to help with the progress of it. I'm currently missing a use case to work on Fletcher but I'm happy to support any helping person who has a use case and wants to work on it.

birdsarah added good first issue Good for newcomers research question Outstanding questions that have not been investigated yet. labels Mar 11, 2019

Aimaanhasan pushed a commit to Aimaanhasan/overscripted that referenced this issue Mar 18, 2019

Analysis on issue mozilla#36

f81c97a

Analysis on efficiency and usage of extension arrays in dask Issue mozilla#36

Aimaanhasan mentioned this issue Mar 18, 2019

Analysis on issue #36 #71

Closed

mozilla deleted a comment from nandinibansal2000 Mar 20, 2019

Aimaanhasan added a commit to Aimaanhasan/overscripted that referenced this issue Mar 20, 2019

Added Fletcher Docs and restructured analysis

f587fde

Added link to the fletcher docs and gave an example for usability. Relocated the analyses for readability Issue mozilla#36

Aimaanhasan pushed a commit to Aimaanhasan/overscripted that referenced this issue Apr 2, 2019

Worked on script_url and arguments for issue # 36

ceb2d09

Issue mozilla#36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we use array extensions to speed up string processing with dask #36

Can we use array extensions to speed up string processing with dask #36

birdsarah commented Mar 11, 2019

Muskan284 commented Mar 13, 2019

birdsarah commented Mar 14, 2019

birdsarah commented Mar 15, 2019

Muskan284 commented Mar 15, 2019

shruthi0898 commented Mar 16, 2019

srujana121 commented Mar 17, 2019

srujana121 commented Mar 17, 2019

birdsarah commented Mar 19, 2019

birdsarah commented Mar 19, 2019

SubbulakshmiRS commented Mar 21, 2019

birdsarah commented Mar 21, 2019

birdsarah commented Mar 22, 2019

xhochy commented May 10, 2019

Can we use array extensions to speed up string processing with dask #36

Can we use array extensions to speed up string processing with dask #36

Comments

birdsarah commented Mar 11, 2019

Muskan284 commented Mar 13, 2019

birdsarah commented Mar 14, 2019

birdsarah commented Mar 15, 2019

Muskan284 commented Mar 15, 2019

shruthi0898 commented Mar 16, 2019

srujana121 commented Mar 17, 2019

srujana121 commented Mar 17, 2019

birdsarah commented Mar 19, 2019

birdsarah commented Mar 19, 2019

SubbulakshmiRS commented Mar 21, 2019

birdsarah commented Mar 21, 2019

birdsarah commented Mar 22, 2019

xhochy commented May 10, 2019