Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet structured column maps to JSONBOID by default which causes error on scan #178

Open
2 tasks done
tucnak opened this issue Nov 22, 2024 · 0 comments
Open
2 tasks done
Labels
bug Something isn't working

Comments

@tucnak
Copy link

tucnak commented Nov 22, 2024

What happens?

[XX000] ERROR: Column messages has Arrow data type List(Field { name: "l", data_type: Struct([Field { name: "content", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "role", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) but is mapped to the BuiltIn(JSONBOID) type in Postgres, which are incompatible. If you believe this conversion should be supported, please submit a request at https://github.com/paradedb/paradedb/ issues.

So does ParadeDB actually support jsonb for these structured columns, or does it not?

To Reproduce

See dataset on Huggingface; it's split into a handful of Parquet files, and I'm not sure if that's exactly relevant, but it may be? We deal with many HF datasets in this manner, and so far had no problems. I was under the assumption that pg_analytics supported JSON natively, and it seems to fail unless the conversion isn't specified exactly:

-- fails with the aforementioned error
CREATE FOREIGN TABLE tulu_3_sft_mixture ()
SERVER parquet
OPTIONS (
    files 'https://huggingface.co/api/datasets/allenai/tulu-3-sft-mixture/parquet/default/train/0.parquet'
);

-- doesn't fail
CREATE FOREIGN TABLE tulu_3_sft_mixture ()
SERVER parquet
OPTIONS (
    files 'https://huggingface.co/api/datasets/allenai/tulu-3-sft-mixture/parquet/default/train/0.parquet',

    select 'messages::json AS messages' -- jsonb fails, too!
);

I wonder if it's possible to similarly override the columns with custom-defined domains? We have a chat domain which is a jsonb with multiple constraints, casts, and helper functions defined over it. However, I'd previously tried to cast to it but pg_analytics couldn't recognise the type:

SELECT null::chat; -- no issue

CREATE FOREIGN TABLE tulu_3_sft_mixture ()
SERVER parquet
OPTIONS (
    files 'https://huggingface.co/api/datasets/allenai/tulu-3-sft-mixture/parquet/default/train/0.parquet',
    select 'messages::chat AS messages'
);
-- [XX000] ERROR: Catalog Error: Type with name chat does not exist!

OS:

ppc64el

ParadeDB Version:

v0.2.1

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB pg_analytics Extension

Full Name:

Ilya Kowalewski

Affiliation:

The Stone Cross Foundation of Ukraine

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

  • Yes, I have
@tucnak tucnak added the bug Something isn't working label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant