-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dask query-planning support #139
Changes from all commits
980cc87
66384e1
4200ea0
77682e9
57628c9
e38ec1a
50c9eff
8ba6e7a
51e54ff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -155,7 +155,6 @@ def from_pandas( | |
npartitions=npartitions, | ||
chunksize=chunksize, | ||
sort=sort, | ||
name=name, | ||
) | ||
) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,12 +18,10 @@ | |
import numpy as np | ||
import pandas as pd | ||
from dask.base import tokenize | ||
from dask.dataframe.core import new_dd_object | ||
from dask.dataframe.shuffle import partitioning_index | ||
from dask.highlevelgraph import HighLevelGraph | ||
from dask.utils import M | ||
|
||
from nemo_curator._compat import DASK_SHUFFLE_CAST_DTYPE | ||
from nemo_curator._compat import DASK_SHUFFLE_CAST_DTYPE, query_planning_enabled | ||
|
||
|
||
def _split_part(part, nsplits): | ||
|
@@ -36,6 +34,16 @@ def _split_part(part, nsplits): | |
|
||
|
||
def text_bytes_aware_merge(text_df, right_df, broadcast=True, *, on): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this function currently used anywhere? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see any usage of this function. cc: @VibhuJawa in case you recall any background and if this is needed or not There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Dont think this is needed anymore. We previously added it while trying to be more conservative IIRC |
||
|
||
if query_planning_enabled(): | ||
raise NotImplementedError( | ||
"The text_bytes_aware_merge function is not supported when " | ||
"query-planning is enabled." | ||
) | ||
|
||
from dask.dataframe.core import new_dd_object | ||
from dask.highlevelgraph import HighLevelGraph | ||
|
||
if not isinstance(on, list): | ||
on = [on] | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the
name
argument doesn't work for 24.08 (the corresponding dask-expr version has a small bug). Althoughname
is supported for later dask-expr versions, I'm not sure how important it is for curator to support this argument? (I don't see it used internally anywhere, but could be wrong)