Skip to content

Commit

Permalink
Adds support for COPY TO/FROM Azure Blob Storage
Browse files Browse the repository at this point in the history
Only supports Azure Blob uris in the form of `https://{account}.blob.core.windows.net/{container}/key`.

Azure Blob client can be configured with environment variables `AZURE_STORAGE_ACCOUNT_NAME` or `AZURE_STORAGE_SAS_TOKEN`.

Additionally, PR supports following S3 uri forms:
- `s3(a)://{bucket}/key`
- `https://s3.amazonaws.com/{bucket}/key`
- `https://{bucket}.s3.amazonaws.com/key`

Closes #50
  • Loading branch information
aykut-bozkurt committed Oct 23, 2024
1 parent 0bfc8b6 commit 4dc228c
Show file tree
Hide file tree
Showing 12 changed files with 392 additions and 109 deletions.
25 changes: 13 additions & 12 deletions .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,11 @@ ENV TZ="Europe/Istanbul"
ARG PG_MAJOR=17

# install deps
RUN apt-get update && apt-get -y install build-essential libreadline-dev zlib1g-dev \
flex bison libxml2-dev libxslt-dev libssl-dev \
libxml2-utils xsltproc ccache pkg-config wget \
curl lsb-release sudo nano net-tools git awscli
RUN apt-get update && apt-get -y install build-essential libreadline-dev zlib1g-dev \
flex bison libxml2-dev libxslt-dev libssl-dev \
libxml2-utils xsltproc ccache pkg-config wget \
curl lsb-release ca-certificates gnupg sudo git \
nano net-tools awscli

# install Postgres
RUN sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
Expand All @@ -19,6 +20,14 @@ RUN apt-get update && apt-get -y install postgresql-${PG_MAJOR}-postgis-3 \
postgresql-client-${PG_MAJOR} \
libpq-dev

# install azure-cli and azurite
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash -
RUN apt-get update && apt-get install -y nodejs
RUN curl -sL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor | tee /etc/apt/trusted.gpg.d/microsoft.gpg > /dev/null
RUN echo "deb [arch=`dpkg --print-architecture` signed-by=/etc/apt/trusted.gpg.d/microsoft.gpg] https://packages.microsoft.com/repos/azure-cli/ `lsb_release -cs` main" | tee /etc/apt/sources.list.d/azure-cli.list
RUN apt-get update && apt-get install -y azure-cli
RUN npm install -g azurite

# download and install MinIO server and client
RUN wget https://dl.min.io/server/minio/release/linux-amd64/minio
RUN chmod +x minio
Expand Down Expand Up @@ -58,11 +67,3 @@ ARG PGRX_VERSION=0.12.6
RUN cargo install --locked cargo-pgrx@${PGRX_VERSION}
RUN cargo pgrx init --pg${PG_MAJOR} $(which pg_config)
RUN echo "shared_preload_libraries = 'pg_parquet'" >> $HOME/.pgrx/data-${PG_MAJOR}/postgresql.conf

ENV MINIO_ROOT_USER=admin
ENV MINIO_ROOT_PASSWORD=admin123
ENV AWS_S3_TEST_BUCKET=testbucket
ENV AWS_REGION=us-east-1
ENV AWS_ACCESS_KEY_ID=admin
ENV AWS_SECRET_ACCESS_KEY=admin123
ENV PG_PARQUET_TEST=true
2 changes: 1 addition & 1 deletion .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
]
}
},
"postStartCommand": "bash .devcontainer/scripts/setup-minio.sh",
"postStartCommand": "bash .devcontainer/scripts/setup_minio.sh && bash .devcontainer/scripts/setup_azurite.sh",
"forwardPorts": [
5432
],
Expand Down
7 changes: 7 additions & 0 deletions .devcontainer/scripts/setup_azurite.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

source setup_test_envs.sh

nohup azurite --location /tmp/azurite-storage > /dev/null 2>&1 &

az storage container create --name "${AZURE_TEST_CONTAINER_NAME}" --public off --connection-string "$AZURE_STORAGE_CONNECTION_STRING"
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/bin/bash

source setup_test_envs.sh

nohup minio server /tmp/minio-storage > /dev/null 2>&1 &

mc alias set local http://localhost:9000 $MINIO_ROOT_USER $MINIO_ROOT_PASSWORD
Expand Down
15 changes: 15 additions & 0 deletions .devcontainer/scripts/setup_test_envs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# S3 tests
export AWS_ACCESS_KEY_ID=admin
export AWS_SECRET_ACCESS_KEY=admin123
export AWS_REGION=us-east-1
export AWS_S3_TEST_BUCKET=testbucket
export MINIO_ROOT_USER=admin
export MINIO_ROOT_PASSWORD=admin123

# Azure Blob tests
export AZURE_TEST_CONTAINER_NAME=testcontainer
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://localhost:10000/devstoreaccount1;"

# Other
export PG_PARQUET_TEST=true
export RUST_TEST_THREADS=1
5 changes: 0 additions & 5 deletions .env_sample

This file was deleted.

47 changes: 23 additions & 24 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
name: CI lints and tests
on:
push:
branches:
- "*"
branches: [ "main" ]
pull_request:
branches: [ "main" ]

concurrency:
group: ${{ github.ref }}
Expand Down Expand Up @@ -69,12 +70,23 @@ jobs:
sudo sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt-get update
sudo apt-get install build-essential libreadline-dev zlib1g-dev flex bison libxml2-dev libxslt-dev libssl-dev libxml2-utils xsltproc ccache pkg-config
sudo apt-get -y install build-essential libreadline-dev zlib1g-dev flex bison libxml2-dev \
libxslt-dev libssl-dev libxml2-utils xsltproc ccache pkg-config \
gnupg ca-certificates
sudo apt-get -y install postgresql-${{ env.PG_MAJOR }}-postgis-3 \
postgresql-server-dev-${{ env.PG_MAJOR }} \
postgresql-client-${{ env.PG_MAJOR }} \
libpq-dev
- name: Install Azurite
run: |
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo bash -
sudo apt-get update && sudo apt-get install -y nodejs
curl -sL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/microsoft.gpg > /dev/null
echo "deb [arch=`dpkg --print-architecture` signed-by=/etc/apt/trusted.gpg.d/microsoft.gpg] https://packages.microsoft.com/repos/azure-cli/ `lsb_release -cs` main" | sudo tee /etc/apt/sources.list.d/azure-cli.list
sudo apt-get update && sudo apt-get install -y azure-cli
npm install -g azurite
- name: Install MinIO
run: |
# Download and install MinIO server and client
Expand Down Expand Up @@ -107,23 +119,14 @@ jobs:
$(pg_config --sharedir)/extension \
/var/run/postgresql/
# pgrx tests with runas argument ignores environment variables, so
# we read env vars from .env file in tests (https://github.com/pgcentralfoundation/pgrx/pull/1674)
touch /tmp/.env
echo AWS_ACCESS_KEY_ID=${{ env.AWS_ACCESS_KEY_ID }} >> /tmp/.env
echo AWS_SECRET_ACCESS_KEY=${{ env.AWS_SECRET_ACCESS_KEY }} >> /tmp/.env
echo AWS_S3_TEST_BUCKET=${{ env.AWS_S3_TEST_BUCKET }} >> /tmp/.env
echo AWS_REGION=${{ env.AWS_REGION }} >> /tmp/.env
echo PG_PARQUET_TEST=${{ env.PG_PARQUET_TEST }} >> /tmp/.env
# Set up test environments
source .devcontainer/scripts/setup_test_envs.sh
# Start MinIO server
export MINIO_ROOT_USER=${{ env.AWS_ACCESS_KEY_ID }}
export MINIO_ROOT_PASSWORD=${{ env.AWS_SECRET_ACCESS_KEY }}
minio server /tmp/minio-storage > /dev/null 2>&1 &
bash .devcontainer/scripts/setup_minio.sh
# Set access key and create test bucket
mc alias set local http://localhost:9000 ${{ env.AWS_ACCESS_KEY_ID }} ${{ env.AWS_SECRET_ACCESS_KEY }}
aws --endpoint-url http://localhost:9000 s3 mb s3://${{ env.AWS_S3_TEST_BUCKET }}
# Start Azurite server
bash .devcontainer/scripts/setup_azurite.sh
# Run tests with coverage tool
source <(cargo llvm-cov show-env --export-prefix)
Expand All @@ -134,13 +137,9 @@ jobs:
# Stop MinIO server
pkill -9 minio
env:
RUST_TEST_THREADS: 1
AWS_ACCESS_KEY_ID: test_secret_access_key
AWS_SECRET_ACCESS_KEY: test_access_key_id
AWS_REGION: us-east-1
AWS_S3_TEST_BUCKET: testbucket
PG_PARQUET_TEST: true
# Stop Azurite server
pkill -9 node
- name: Upload coverage report to Codecov
if: ${{ env.PG_MAJOR }} == 17
Expand Down
7 changes: 0 additions & 7 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 1 addition & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,8 @@ arrow = {version = "53", default-features = false}
arrow-schema = {version = "53", default-features = false}
aws-config = { version = "1.5", default-features = false, features = ["rustls"]}
aws-credential-types = {version = "1.2", default-features = false}
dotenvy = "0.15"
futures = "0.3"
object_store = {version = "0.11", default-features = false, features = ["aws"]}
object_store = {version = "0.11", default-features = false, features = ["aws", "azure"]}
once_cell = "1"
parquet = {version = "53", default-features = false, features = [
"arrow",
Expand Down
25 changes: 21 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,13 @@ You can call `SELECT * FROM parquet.file_metadata(<uri>)` to discover file level
You can call `SELECT * FROM parquet.kv_metadata(<uri>)` to query custom key-value metadata of the Parquet file at given uri.

## Object Store Support
`pg_parquet` supports reading and writing Parquet files from/to `S3` object store. Only the uris with `s3://` scheme is supported.
`pg_parquet` supports reading and writing Parquet files from/to `S3` and `Azure Blob Storage` object stores.

> [!NOTE]
> To be able to write into a object store location, you need to grant `parquet_object_store_write` role to your current postgres user.
> Similarly, to read from an object store location, you need to grant `parquet_object_store_read` role to your current postgres user.
#### S3 Storage

The simplest way to configure object storage is by creating the standard `~/.aws/credentials` and `~/.aws/config` files:

Expand All @@ -129,9 +135,20 @@ Alternatively, you can use the following environment variables when starting pos
- `AWS_CONFIG_FILE`: an alternative location for the config file
- `AWS_PROFILE`: the name of the profile from the credentials and config file (default profile name is `default`)

> [!NOTE]
> To be able to write into a object store location, you need to grant `parquet_object_store_write` role to your current postgres user.
> Similarly, to read from an object store location, you need to grant `parquet_object_store_read` role to your current postgres user.
Supported S3 uri formats are shown below:
- s3:// \<bucket\> / \<path\>
- s3a:// \<bucket\> / \<path\>
- https:// \<bucket\>.s3.amazonaws.com / \<path\>
- https:// s3.amazonaws.com / \<bucket\> / \<path\>

#### Azure Blob Storage

You can use the following environment variables when starting postgres to configure the Azure Blob Storage client:
- `AZURE_STORAGE_ACCOUNT_KEY`: the storage account key of the Azure Blob
- `AZURE_STORAGE_SAS_TOKEN`: the storage SAS token for the Azure Blob

Supported Azure Blob Storage uri formats are shown below:
- https:// \<account\>.blob.core.windows.net / \<container\> / \<path\>

## Copy Options
`pg_parquet` supports the following options in the `COPY TO` command:
Expand Down
Loading

0 comments on commit 4dc228c

Please sign in to comment.