Reading data from a Parquet file without the fuss #49

sfkeller · 2024-10-21T22:02:47Z

How about reading data from a Parquet file without the fuss (inspired by https://duckdb.org/docs/guides/file_formats/parquet_import )?

Below a very sketchy, untested script how to implement such a thing. I actually could only guess what parquet.schema returns, since there's no docs yet about this (see #48).

The resulting table product_example2 from reading product_example.parquet (as defined in https://github.com/CrunchyData/pg_parquet) should be created without having created table product_example2 before hand:

CREATE OR REPLACE FUNCTION import_foreign_schema_from_parquet(uri TEXT, table_name TEXT DEFAULT 'parquet_table') RETURNS VOID AS $$
DECLARE
    parquet_info RECORD;
    columns TEXT := '';
BEGIN
    -- Loop through the metadata returned by parquet.metadata function
    FOR parquet_info IN EXECUTE 'SELECT column_name, type_name FROM parquet.metadata(' || quote_literal(uri) || ')' LOOP
        -- Append column name and type to the columns string
        columns := columns || quote_ident(parquet_info.column_name) || ' ' || 
                   CASE parquet_info.type_name
                       WHEN 'INT32' THEN 'INTEGER'
                       WHEN 'INT64' THEN 'BIGINT'
                       WHEN 'FLOAT' THEN 'REAL'
                       WHEN 'DOUBLE' THEN 'DOUBLE PRECISION'
                       WHEN 'BOOLEAN' THEN 'BOOLEAN'
                       WHEN 'UTF8' THEN 'TEXT'
                       WHEN 'LIST' THEN 'ANYELEMENT[]' 
                       WHEN 'TIMESTAMP_MICROS' THEN 'TIMESTAMP'
                       WHEN 'TIMESTAMP_MILLIS' THEN 'TIMESTAMPTZ'
                       ELSE 'TEXT'  -- Default to TEXT for unknown types
                   END || ', ';
    END LOOP;

    -- Remove trailing comma and space
    columns := RTRIM(columns, ', ');

    -- Execute the CREATE TABLE statement with the generated columns
    EXECUTE 'CREATE TABLE ' || quote_ident(table_name) || ' (' || columns || ');';
END;
$$ LANGUAGE plpgsql;

-- Import foreign "schema" (i.e. table definition) from Parquet - creates table "product_example2":
SELECT import_foreign_schema_from_parquet('/tmp/product_example.parquet', 'product_example2');

-- Copy the Parquet file to the newly created table product_example2: 
COPY product_example2 FROM '/tmp/product_example.parquet' (format 'parquet', compression 'gzip');

-- Show table product_example2 - should have the same structure as the original table product_example:
SELECT * FROM product_example2
UNION ALL
SELECT * FROM product_example;

The text was updated successfully, but these errors were encountered:

marcoslot added the enhancement New feature or request label Oct 22, 2024

sfkeller changed the title ~~Reading data from a Parquet file without the fuss (feature request)~~ Reading data from a Parquet file without the fuss Oct 22, 2024

aykut-bozkurt mentioned this issue Nov 15, 2024

Coerce types on read #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading data from a Parquet file without the fuss #49

Reading data from a Parquet file without the fuss #49

sfkeller commented Oct 21, 2024 •

edited

Loading

Reading data from a Parquet file without the fuss #49

Reading data from a Parquet file without the fuss #49

Comments

sfkeller commented Oct 21, 2024 • edited Loading

sfkeller commented Oct 21, 2024 •

edited

Loading