-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PlaceholderArray encountered in BitMaskedArray.to_ByteMaskedArray
when it shouldn't be
#524
Comments
Another indicator: passing |
Is it expected to be possible to do concatenate on typetracers? It should be needed, since we will need to know the columns to select from both input layers independently - we have no mechanism to carry the columns found needed for one layer across to another. So far I have found these edges:
|
So indeed, the second partition is receiving |
I can confirm that the one-pass branch successfully computes the second failing case, but only_the_first_time. Subsequent computes fail. The fail mode in having no required columns passed to parquet at all. Calling Given that #526 has a variant of this same problem, is it time to dust off the one-pass PR? |
(I should say, that a trivial, but not great, workaround for the issue here is to touch all inputs to a concatenate, which somehow is what the other linked issue ended up doing (because of axis=1, presumably). |
Hi @jpivarski / @martindurant, just to revive this: I've run into the PlaceholderArray/PyCPointerType once again, though this time concatenate doesn't seem to be to blame. Basically, I have somehow created a set of parquet files that really propagate this placeholder trouble. If I try to do any manipulation on them (say, divide), I get the following error:
Backtracing this a little more, even just reading the arrays from a parquet file seems problematic. To be specific, if I do
Everything is fine. But when reading in from delayed, though, I get errors. Doing
Fails with the error Do these perhaps stem from the same place? I had never noticed this before because I never had a reason to read these files with dask awkward/delayed. I can provide example parquet files if it would be helpful. Extended errors I'll attach. |
Ah so this means that the parquet dak-array doesn't know to materialize all the data. Can you try |
Quick question: you mention "delayed" here, but I don't see any calls to Simply failing to compute data loaded from parquet without any further operations would be bad. Can you share the file? |
Indeed @lgray , if may be that the single-layer case is an edge we hadn't considered. |
Sorry, looks like I got my terminology mixed up. I did mean lazy-- reading the parquet files lazily with @lgray I tried reversing the order as you suggested and while the error is gone, it just produces an empty array:
Which is odd and definitely incorrect and not what happens when I read eagerly. I've uploaded the example parquet file to google drive, it should be accessible to anyone with the link: https://drive.google.com/drive/folders/1548z0m6IYuIKUA1EzfgG3_OylpPK8T3U?usp=sharing |
That's certainly weird. We'll try to get back to you soon on it. |
In the meantime can you use dask-dataframe or dask-array, if it's just a bunch of flat lists of floats? |
Yeah, I should be able to convert for now. Though maybe also helpful to point out, Jim mentioned to me in mattermost a few months ago (and above) that passing
Works as intended. I forgot he mentioned that until I went digging for it. |
So the problem appears to be the keys/record labels. I went through and changed all the keys (removing the caret and the period) and now the problem is gone:
and
Now work as intended without having to give |
@pfackeldey , so this sounds like a mapper issue already in the v1 ("two-pass") column optimization. Maybe it works with v2? Field names containing "." are allowed in the parquet spec, although they are unusual, and other frameworks also use "field.subfield" as a shorthand for ["field", "subfield"]. At som point, @jpivarski suggested using syntax like "field. |
Here's a reproducer:
files.tar.gz
succeeds but
fails with
Going into more detail, the troublemaker is
self._mask.data
, which is a PlaceholderArray. The rehydration must be saying that this buffer is not needed, but it is needed. The concatenation needs to know which array elements are missing.The text was updated successfully, but these errors were encountered: