diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 469123d87..a323797fe 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -40,7 +40,7 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b | sts\_endpoint | Custom endpoint for the STS API. | None | | profile | Option to specify an AWS Profile for credentials. | default | | canned\_acl | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | None | -| compression | Compression type for S3 objects. 'gzip' is currently the only supported value by default. If Apache Arrow support was enabled at compile time, you can also use 'arrow'. For gzip compression, the Content-Encoding HTTP Header will be set to 'gzip'. Gzip compression can be enabled when `use_put_object` is 'on' or 'off' (PutObject and Multipart). Arrow compression can only be enabled with `use_put_object On`. | None | +| compression | Compression type for S3 objects. 'gzip' and 'parquet' are currently the only supported value by default. If Apache Arrow support was enabled at compile time, you can also use 'arrow'. If [columnify](https://github.com/reproio/columnify) command is installed, you can also compress as parquet format. For gzip compression, the Content-Encoding HTTP Header will be set to 'gzip'. Gzip and parquet compression can be enabled when `use_put_object` is 'on' or 'off' (PutObject and Multipart). Arrow compression can only be enabled with `use_put_object On`. A configuration error will be triggered for invalid combinations. Default is empty and no compression therefore. | | | content\_type | A standard MIME type for the S3 object; this will be set as the Content-Type HTTP header. | None | | send\_content\_md5 | Send the Content-MD5 header with PutObject and UploadPart requests, as is required when Object Lock is enabled. | false | | auto\_retry\_requests | Immediately retry failed requests to AWS services once. This option does not affect the normal Fluent Bit retry mechanism with backoff. Instead, it enables an immediate retry with no delay for networking errors, which may help improve throughput when there are transient/random networking issues. | true | @@ -49,6 +49,14 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b | storage\_class | Specify the [storage class](https://docs.aws.amazon.com/AmazonS3/latest/API/API\_PutObject.html#AmazonS3-PutObject-request-header-StorageClass) for S3 objects. If this option is not specified, objects will be stored with the default 'STANDARD' storage class. | None | | retry\_limit | Integer value to set the maximum number of retries allowed. Note: this configuration is released since version 1.9.10 and 2.0.1. For previous version, the number of retries is 5 and is not configurable. | 1 | | external\_id | Specify an external ID for the STS API, can be used with the role\_arn parameter if your role requires an external ID. | None | +| parquet.compression | Compression type for parquet. 'uncompressed', 'snappy', 'gzip', 'zstd' are the supported values by default. 'lzo', 'brotli', 'lz4' are not supported for now. The default value is upper case of snappy. Also, users can specify lower case of compression types. The specified value will be converted to upper case automatically. | SNAPPY | +| parquet.pagesize | Page size of parquet format. Defaults to 8192 bytes (8KiB). | 8192 bytes | +| parquet.row\_group\_size | Row group size of parquet format. Defaults to 134217728 bytes (128MiB). | 134217728 bytes | +| parquet.record\_type | Format type of records on parquet format. Defaults to json. | json | +| parquet.schema\_type | Format type of schema on parquet format. Defaults to json. | avro | +| parquet.schema\_file | Specify path to schema file for parquet compression. | None | +| parquet.process\_dir | Specify a temporary directory for processing parquet objects. This paramater is effective for non Windows platforms. | Windows: %TMP_DIR%\parquet\s3 Linux/macOS: /tmp/parquet/s3 | + ## TLS / SSL @@ -282,6 +290,44 @@ Example: Then, the records will be stored into the MinIO server. +## Usage for Parquet Compression + +For parquet compression, users need to install to install [columnify](https://github.com/reproio/columnify) in the running system or container at runtime. + +After installing that command, out_s3 can handle parquet compression: + +``` +[OUTPUT] + Name s3 + Match * + bucket your-bucket + Use_Put_object true + compression parquet + parquet.schema_file /path/to/your-schema.avsc + parquet.compression snappy +``` + +### Build columnify Command + +For building columnify command, users need to inherit and use Golang development container: + +``` +FROM golang:1-alpine as builder # Always refer the latest golang:1-aline image + +ENV ROOT=/go/src/cmd +WORKDIR ${ROOT} + +RUN apk update && \ + apk add git + +RUN go install github.com/reproio/columnify/cmd/columnify@latest + +FROM debian:bullseye-slim as production + +# Put columnify command inside the PATH. +COPY --from=builder /go/bin/columnify /usr/bin/columnify +``` + ## Getting Started In order to send records into Amazon S3, you can run the plugin from the command line or through the configuration file.