PlaceholderArray encountered in `BitMaskedArray.to_ByteMaskedArray` when it shouldn't be #524

jpivarski · 2024-07-17T19:40:15Z

Here's a reproducer:

files.tar.gz

import awkward as ak
ak.concatenate([
    ak.from_parquet("one.parquet")["goodjets"],
    ak.from_parquet("two.parquet")["goodjets"],
])

succeeds but

import awkward as ak
import dask_awkward as dak
ak.concatenate([
    dak.from_parquet("one.parquet")["goodjets"],
    dak.from_parquet("two.parquet")["goodjets"],
]).compute()

fails with

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/jpivarski/miniforge3/lib/python3.11/site-packages/dask/base.py", line 376, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/miniforge3/lib/python3.11/site-packages/dask/base.py", line 664, in compute
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/miniforge3/lib/python3.11/site-packages/dask/base.py", line 664, in <listcomp>
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
                   ^^^^^^^^
  File "/home/jpivarski/miniforge3/lib/python3.11/site-packages/dask_awkward/lib/core.py", line 830, in _finalize_array
    return ak.concatenate(results)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 64, in dispatch
    next(gen_or_result)
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_concatenate.py", line 64, in concatenate
    return _impl(arrays, axis, mergebool, highlevel, behavior, attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_concatenate.py", line 160, in _impl
    contents = [ak._do.mergemany(b) for b in batches]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_concatenate.py", line 160, in <listcomp>
    contents = [ak._do.mergemany(b) for b in batches]
                ^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/_do.py", line 218, in mergemany
    return contents[0]._mergemany(contents[1:])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/listoffsetarray.py", line 808, in _mergemany
    out = listarray._mergemany(others)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/listarray.py", line 1128, in _mergemany
    nextcontent = contents[0]._mergemany(tail_contents)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/recordarray.py", line 723, in _mergemany
    trimmed = field[0 : array.length]
              ~~~~~^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 512, in __getitem__
    return self._getitem(where)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 523, in _getitem
    return self._getitem_range(start, stop)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/bitmaskedarray.py", line 493, in _getitem_range
    return self.to_ByteMaskedArray()._getitem_range(start, stop)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/bitmaskedarray.py", line 384, in to_ByteMaskedArray
    self._backend[
  File "/home/jpivarski/irishep/awkward/src/awkward/_kernels.py", line 91, in __call__
    return self._impl(
           ^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/_kernels.py", line 92, in <genexpr>
    *(self._cast(x, t) for x, t in zip(args, self._impl.argtypes))
      ^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/_kernels.py", line 82, in _cast
    raise AssertionError(
AssertionError: Only NumPy buffers should be passed to Numpy Kernels, received PyCPointerType

Going into more detail, the troublemaker is self._mask.data, which is a PlaceholderArray. The rehydration must be saying that this buffer is not needed, but it is needed. The concatenation needs to know which array elements are missing.

The text was updated successfully, but these errors were encountered:

jpivarski · 2024-07-17T19:43:34Z

Another indicator: passing optimize_graph=False to compute make it work. It's the column-optimization.

martindurant · 2024-07-18T14:32:51Z

Is it expected to be possible to do concatenate on typetracers? It should be needed, since we will need to know the columns to select from both input layers independently - we have no mechanism to carry the columns found needed for one layer across to another.

So far I have found these edges:

in concatenate, we were always going on the "enforce form" route even if the inputs are the same, which is not really a problem, but easy to fix
the concatenated dak object has two partitions - so effectively a list of typetracers rather than a combined typetracer. (I think) we only ever touch the first; this may cause the following problem
the output has read data in partition 0 and a typetracer in partition 1, causing the exception.

martindurant · 2024-07-18T15:00:59Z

So indeed, the second partition is receiving columns=[] (nothing to load), and it unproject_layout is turning all the missing columns into typetracers.

martindurant · 2024-07-22T20:26:39Z

I can confirm that the one-pass branch successfully computes the second failing case, but only_the_first_time. Subsequent computes fail. The fail mode in having no required columns passed to parquet at all. Calling dak.core.dak_cache.clear() causes it to pass again, so we have a good hint of where the problem is.

Given that #526 has a variant of this same problem, is it time to dust off the one-pass PR?

martindurant · 2024-07-22T20:29:40Z

(I should say, that a trivial, but not great, workaround for the issue here is to touch all inputs to a concatenate, which somehow is what the other linked issue ended up doing (because of axis=1, presumably).

cmoore24-24 · 2025-01-13T19:33:17Z

Hi @jpivarski / @martindurant, just to revive this:

I've run into the PlaceholderArray/PyCPointerType once again, though this time concatenate doesn't seem to be to blame. Basically, I have somehow created a set of parquet files that really propagate this placeholder trouble. If I try to do any manipulation on them (say, divide), I get the following error:

AssertionError: Only NumPy buffers should be passed to Numpy Kernels, received PyCPointerType

Backtracing this a little more, even just reading the arrays from a parquet file seems problematic. To be specific, if I do

file = ak.from_parquet(path)
print(file['ungroomed_ecfs'])

Everything is fine. But when reading in from delayed, though, I get errors.

Doing

file = dak.from_parquet(path)
print(file.compute()['ungroomed_ecfs'])

Fails with the error TypeError: PlaceholderArray supports only trivial slices, not int.

Do these perhaps stem from the same place? I had never noticed this before because I never had a reason to read these files with dask awkward/delayed. I can provide example parquet files if it would be helpful. Extended errors I'll attach.
pyc_error.txt
TypeError.txt

lgray · 2025-01-13T19:47:23Z

Ah so this means that the parquet dak-array doesn't know to materialize all the data.

Can you try file["ungroomed_ecfs"].compute() instead?

martindurant · 2025-01-13T19:47:37Z

Quick question: you mention "delayed" here, but I don't see any calls to delayed - you mean lazy operations, or did I miss something?

Simply failing to compute data loaded from parquet without any further operations would be bad. Can you share the file?

martindurant · 2025-01-13T19:48:22Z

Indeed @lgray , if may be that the single-layer case is an edge we hadn't considered.

cmoore24-24 · 2025-01-13T19:57:19Z

Sorry, looks like I got my terminology mixed up. I did mean lazy-- reading the parquet files lazily with dak.from_parquet(*,*,compute=False).

@lgray I tried reversing the order as you suggested and while the error is gone, it just produces an empty array:

>>> file["ungroomed_ecfs"].compute()

-------------------------
type: 0 * {
    "1e2^0.5": ?float64,
    "1e2^1.0": ?float64,
    "1e2^1.5": ?float64,
    "1e2^2.0": ?float64,
    "1e2^2.5": ?float64,
    "1e2^3.0": ?float64,
    "1e2^3.5": ?float64,
    .
    .
    .

Which is odd and definitely incorrect and not what happens when I read eagerly.

I've uploaded the example parquet file to google drive, it should be accessible to anyone with the link: https://drive.google.com/drive/folders/1548z0m6IYuIKUA1EzfgG3_OylpPK8T3U?usp=sharing

lgray · 2025-01-13T20:02:10Z

That's certainly weird. We'll try to get back to you soon on it.

lgray · 2025-01-13T20:02:59Z

In the meantime can you use dask-dataframe or dask-array, if it's just a bunch of flat lists of floats?

cmoore24-24 · 2025-01-13T20:06:59Z

Yeah, I should be able to convert for now. Though maybe also helpful to point out, Jim mentioned to me in mattermost a few months ago (and above) that passing optimize_graph=False is a workaround, and that actually is still true here. If I do

>>>file["ungroomed_ecfs"].compute(optimize_graph=False)
[{'1e2^0.5': 0.15, '1e2^1.0': 0.0632, '1e2^1.5': 0.0305, '1e2^2.0': ..., ...},
 {'1e2^0.5': 0.199, '1e2^1.0': 0.116, '1e2^1.5': 0.0754, '1e2^2.0': ..., ...},
 {'1e2^0.5': 0.24, '1e2^1.0': 0.143, '1e2^1.5': 0.0939, '1e2^2.0': 0.0656, ...},
 {'1e2^0.5': 0.221, '1e2^1.0': 0.135, '1e2^1.5': 0.0912, '1e2^2.0': ..., ...},
 {'1e2^0.5': 0.26, '1e2^1.0': 0.166, '1e2^1.5': 0.114, '1e2^2.0': 0.0822, ...},
 {'1e2^0.5': 0.223, '1e2^1.0': 0.136, '1e2^1.5': 0.0924, '1e2^2.0': ..., ...},
 {'1e2^0.5': 0.267, '1e2^1.0': 0.169, '1e2^1.5': 0.116, '1e2^2.0': 0.0833, ...},
 {'1e2^0.5': 0.249, '1e2^1.0': 0.148, '1e2^1.5': 0.0957, '1e2^2.0': ..., ...},
 {'1e2^0.5': 0.265, '1e2^1.0': 0.174, '1e2^1.5': 0.124, '1e2^2.0': 0.0932, ...},
 {'1e2^0.5': 0.186, '1e2^1.0': 0.0837, '1e2^1.5': 0.0404, '1e2^2.0': ..., ...},
 ...,
 {'1e2^0.5': 0.212, '1e2^1.0': 0.112, '1e2^1.5': 0.0649, '1e2^2.0': ..., ...},
 {'1e2^0.5': 0.256, '1e2^1.0': 0.158, '1e2^1.5': 0.106, '1e2^2.0': 0.0755, ...},
 {'1e2^0.5': 0.26, '1e2^1.0': 0.185, '1e2^1.5': 0.143, '1e2^2.0': 0.115, ...},
 {'1e2^0.5': 0.229, '1e2^1.0': 0.152, '1e2^1.5': 0.116, '1e2^2.0': 0.0944, ...},
 {'1e2^0.5': 0.201, '1e2^1.0': 0.123, '1e2^1.5': 0.09, '1e2^2.0': 0.0718, ...},
 {'1e2^0.5': 0.17, '1e2^1.0': 0.107, '1e2^1.5': 0.0754, '1e2^2.0': 0.055, ...},
 {'1e2^0.5': 0.202, '1e2^1.0': 0.102, '1e2^1.5': 0.0562, '1e2^2.0': ..., ...},
 {'1e2^0.5': 0.256, '1e2^1.0': 0.159, '1e2^1.5': 0.107, '1e2^2.0': 0.0751, ...},
 {'1e2^0.5': 0.234, '1e2^1.0': 0.148, '1e2^1.5': 0.104, '1e2^2.0': 0.0763, ...}]
--------------------------------------------------------------------------------
type: 13050 * {
    "1e2^0.5": ?float64,
    "1e2^1.0": ?float64,
    "1e2^1.5": ?float64,
    "1e2^2.0": ?float64,
    "1e2^2.5": ?float64,
    .
    .
    .

Works as intended. I forgot he mentioned that until I went digging for it.

cmoore24-24 · 2025-01-16T03:02:39Z

So the problem appears to be the keys/record labels. I went through and changed all the keys (removing the caret and the period) and now the problem is gone:

>>>file.compute()['ungroomed_ecfs']
{'1e205': 0.256, '1e210': 0.159, '1e215': 0.107, '1e220': 0.0751, ...},
 {'1e205': 0.234, '1e210': 0.148, '1e215': 0.104, '1e220': 0.0763, ...}]
--------------------------------------------------------------------------
type: 13050 * {
    "1e205": float64,
    .
    .
    .

and

>>>file["ungroomed_ecfs"].compute()
{'1e205': 0.256, '1e210': 0.159, '1e215': 0.107, '1e220': 0.0751, ...},
 {'1e205': 0.234, '1e210': 0.148, '1e215': 0.104, '1e220': 0.0763, ...}]
--------------------------------------------------------------------------
type: 13050 * {
    "1e205": float64,
    .
    .
    .

Now work as intended without having to give optimize_graph=False

martindurant · 2025-01-17T14:12:16Z

@pfackeldey , so this sounds like a mapper issue already in the v1 ("two-pass") column optimization. Maybe it works with v2? Field names containing "." are allowed in the parquet spec, although they are unusual, and other frameworks also use "field.subfield" as a shorthand for ["field", "subfield"]. At som point, @jpivarski suggested using syntax like "field.subfield.with.dots" for the unusual but general case. To me, this feels unwieldy, but I have no better suggestion except the list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PlaceholderArray encountered in `BitMaskedArray.to_ByteMaskedArray` when it shouldn't be #524

PlaceholderArray encountered in `BitMaskedArray.to_ByteMaskedArray` when it shouldn't be #524

jpivarski commented Jul 17, 2024

jpivarski commented Jul 17, 2024

martindurant commented Jul 18, 2024

martindurant commented Jul 18, 2024

martindurant commented Jul 22, 2024

martindurant commented Jul 22, 2024

cmoore24-24 commented Jan 13, 2025 •

edited

Loading

lgray commented Jan 13, 2025

martindurant commented Jan 13, 2025

martindurant commented Jan 13, 2025

cmoore24-24 commented Jan 13, 2025 •

edited

Loading

lgray commented Jan 13, 2025

lgray commented Jan 13, 2025

cmoore24-24 commented Jan 13, 2025 •

edited

Loading

cmoore24-24 commented Jan 16, 2025

martindurant commented Jan 17, 2025

PlaceholderArray encountered in BitMaskedArray.to_ByteMaskedArray when it shouldn't be #524

PlaceholderArray encountered in BitMaskedArray.to_ByteMaskedArray when it shouldn't be #524

Comments

jpivarski commented Jul 17, 2024

jpivarski commented Jul 17, 2024

martindurant commented Jul 18, 2024

martindurant commented Jul 18, 2024

martindurant commented Jul 22, 2024

martindurant commented Jul 22, 2024

cmoore24-24 commented Jan 13, 2025 • edited Loading

lgray commented Jan 13, 2025

martindurant commented Jan 13, 2025

martindurant commented Jan 13, 2025

cmoore24-24 commented Jan 13, 2025 • edited Loading

lgray commented Jan 13, 2025

lgray commented Jan 13, 2025

cmoore24-24 commented Jan 13, 2025 • edited Loading

cmoore24-24 commented Jan 16, 2025

martindurant commented Jan 17, 2025

PlaceholderArray encountered in `BitMaskedArray.to_ByteMaskedArray` when it shouldn't be #524

PlaceholderArray encountered in `BitMaskedArray.to_ByteMaskedArray` when it shouldn't be #524

cmoore24-24 commented Jan 13, 2025 •

edited

Loading

cmoore24-24 commented Jan 13, 2025 •

edited

Loading

cmoore24-24 commented Jan 13, 2025 •

edited

Loading