Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading data from a Parquet file without the fuss #49

Open
sfkeller opened this issue Oct 21, 2024 · 0 comments
Open

Reading data from a Parquet file without the fuss #49

sfkeller opened this issue Oct 21, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@sfkeller
Copy link

sfkeller commented Oct 21, 2024

How about reading data from a Parquet file without the fuss (inspired by https://duckdb.org/docs/guides/file_formats/parquet_import )?

Below a very sketchy, untested script how to implement such a thing. I actually could only guess what parquet.schema returns, since there's no docs yet about this (see #48).

The resulting table product_example2 from reading product_example.parquet (as defined in https://github.com/CrunchyData/pg_parquet) should be created without having created table product_example2 before hand:

CREATE OR REPLACE FUNCTION import_foreign_schema_from_parquet(uri TEXT, table_name TEXT DEFAULT 'parquet_table') RETURNS VOID AS $$
DECLARE
    parquet_info RECORD;
    columns TEXT := '';
BEGIN
    -- Loop through the metadata returned by parquet.metadata function
    FOR parquet_info IN EXECUTE 'SELECT column_name, type_name FROM parquet.metadata(' || quote_literal(uri) || ')' LOOP
        -- Append column name and type to the columns string
        columns := columns || quote_ident(parquet_info.column_name) || ' ' || 
                   CASE parquet_info.type_name
                       WHEN 'INT32' THEN 'INTEGER'
                       WHEN 'INT64' THEN 'BIGINT'
                       WHEN 'FLOAT' THEN 'REAL'
                       WHEN 'DOUBLE' THEN 'DOUBLE PRECISION'
                       WHEN 'BOOLEAN' THEN 'BOOLEAN'
                       WHEN 'UTF8' THEN 'TEXT'
                       WHEN 'LIST' THEN 'ANYELEMENT[]' 
                       WHEN 'TIMESTAMP_MICROS' THEN 'TIMESTAMP'
                       WHEN 'TIMESTAMP_MILLIS' THEN 'TIMESTAMPTZ'
                       ELSE 'TEXT'  -- Default to TEXT for unknown types
                   END || ', ';
    END LOOP;

    -- Remove trailing comma and space
    columns := RTRIM(columns, ', ');

    -- Execute the CREATE TABLE statement with the generated columns
    EXECUTE 'CREATE TABLE ' || quote_ident(table_name) || ' (' || columns || ');';
END;
$$ LANGUAGE plpgsql;

-- Import foreign "schema" (i.e. table definition) from Parquet - creates table "product_example2":
SELECT import_foreign_schema_from_parquet('/tmp/product_example.parquet', 'product_example2');

-- Copy the Parquet file to the newly created table product_example2: 
COPY product_example2 FROM '/tmp/product_example.parquet' (format 'parquet', compression 'gzip');

-- Show table product_example2 - should have the same structure as the original table product_example:
SELECT * FROM product_example2
UNION ALL
SELECT * FROM product_example;
@marcoslot marcoslot added the enhancement New feature or request label Oct 22, 2024
@sfkeller sfkeller changed the title Reading data from a Parquet file without the fuss (feature request) Reading data from a Parquet file without the fuss Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants