From e1543090e5be89be5b981c114b42540eada01580 Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Fri, 24 May 2024 15:32:15 +0900 Subject: [PATCH 01/12] out_s3: Add descriptions for parquet compression related parameters Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 469123d87..05c17f117 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -40,7 +40,7 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b | sts\_endpoint | Custom endpoint for the STS API. | None | | profile | Option to specify an AWS Profile for credentials. | default | | canned\_acl | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | None | -| compression | Compression type for S3 objects. 'gzip' is currently the only supported value by default. If Apache Arrow support was enabled at compile time, you can also use 'arrow'. For gzip compression, the Content-Encoding HTTP Header will be set to 'gzip'. Gzip compression can be enabled when `use_put_object` is 'on' or 'off' (PutObject and Multipart). Arrow compression can only be enabled with `use_put_object On`. | None | +| compression | Compression type for S3 objects. 'gzip' and 'parquet' are currently the only supported value by default. If Apache Arrow support was enabled at compile time, you can also use 'arrow'. If columnify command is installed, you can also compress as parquet format. For gzip compression, the Content-Encoding HTTP Header will be set to 'gzip'. Gzip and parquet compression can be enabled when `use_put_object` is 'on' or 'off' (PutObject and Multipart). Arrow compression can only be enabled with `use_put_object On`. | None | | content\_type | A standard MIME type for the S3 object; this will be set as the Content-Type HTTP header. | None | | send\_content\_md5 | Send the Content-MD5 header with PutObject and UploadPart requests, as is required when Object Lock is enabled. | false | | auto\_retry\_requests | Immediately retry failed requests to AWS services once. This option does not affect the normal Fluent Bit retry mechanism with backoff. Instead, it enables an immediate retry with no delay for networking errors, which may help improve throughput when there are transient/random networking issues. | true | @@ -49,6 +49,13 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b | storage\_class | Specify the [storage class](https://docs.aws.amazon.com/AmazonS3/latest/API/API\_PutObject.html#AmazonS3-PutObject-request-header-StorageClass) for S3 objects. If this option is not specified, objects will be stored with the default 'STANDARD' storage class. | None | | retry\_limit | Integer value to set the maximum number of retries allowed. Note: this configuration is released since version 1.9.10 and 2.0.1. For previous version, the number of retries is 5 and is not configurable. | 1 | | external\_id | Specify an external ID for the STS API, can be used with the role\_arn parameter if your role requires an external ID. | None | +| parquet.compression | Compression type for parquet. 'uncompressed', 'snappy', 'gzip', 'zstd' are the supported values by default. 'lzo', 'brotli', 'lz4' are not supported for now. | SNAPPY | +| parquet.pagesize | Page size of parquet format. Defaults to 8192 bytes (8KiB). | 8192 | +| parquet.row\_group\_size | Row group size of parquet format. Defaults to 134217728 bytes (128MiB). | 134217728 | +| parquet.record\_type | Format type of records on parquet format. Defaults to json. | json | +| parquet.schema\_type | Format type of schema on parquet format. Defaults to json. | avro | +| parquet.schema\_file | Speficy path to schema file for parquet compression. | None | + ## TLS / SSL From b8727990b5aa23c71c56268a7100d7c88839d8eb Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Fri, 24 May 2024 15:45:23 +0900 Subject: [PATCH 02/12] out_s3: Add an example configuration for parquet Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 05c17f117..40bf0d09e 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -289,6 +289,23 @@ Example: Then, the records will be stored into the MinIO server. +## Usage for Parquet Compression + +For parquet compression, it needs to install [columnify](https://github.com/reproio/columnify) at runtime. + +After installing that command, out_s3 can handle parquet compression: + +``` +[OUTPUT] + Name s3 + Match * + bucket your-bucket + Use_Put_object true + compression parquet + parquet.schema_file /path/to/your-schema.avsc + parquet.compression snappy +``` + ## Getting Started In order to send records into Amazon S3, you can run the plugin from the command line or through the configuration file. From 9b411e4ada35506a98539112cea8a13b97340f66 Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Fri, 24 May 2024 20:01:12 +0900 Subject: [PATCH 03/12] out_s3: Fix a typo Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 40bf0d09e..613d58f65 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -54,7 +54,7 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b | parquet.row\_group\_size | Row group size of parquet format. Defaults to 134217728 bytes (128MiB). | 134217728 | | parquet.record\_type | Format type of records on parquet format. Defaults to json. | json | | parquet.schema\_type | Format type of schema on parquet format. Defaults to json. | avro | -| parquet.schema\_file | Speficy path to schema file for parquet compression. | None | +| parquet.schema\_file | Specify path to schema file for parquet compression. | None | ## TLS / SSL From 0b6238b03597107d8727c82557635a77a57e03db Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Fri, 24 May 2024 20:12:03 +0900 Subject: [PATCH 04/12] out_s3: Add more concrete description for columnify requirements Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 613d58f65..f85423222 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -291,7 +291,7 @@ Then, the records will be stored into the MinIO server. ## Usage for Parquet Compression -For parquet compression, it needs to install [columnify](https://github.com/reproio/columnify) at runtime. +For parquet compression, it needs to install [columnify](https://github.com/reproio/columnify) in the running system or container at runtime. After installing that command, out_s3 can handle parquet compression: From 073b167c396a6425d9efed706a9d94ba64e9f987 Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Fri, 24 May 2024 20:29:32 +0900 Subject: [PATCH 05/12] Update pipeline/outputs/s3.md Co-authored-by: Pat Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index f85423222..bcfbbeca8 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -40,7 +40,7 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b | sts\_endpoint | Custom endpoint for the STS API. | None | | profile | Option to specify an AWS Profile for credentials. | default | | canned\_acl | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | None | -| compression | Compression type for S3 objects. 'gzip' and 'parquet' are currently the only supported value by default. If Apache Arrow support was enabled at compile time, you can also use 'arrow'. If columnify command is installed, you can also compress as parquet format. For gzip compression, the Content-Encoding HTTP Header will be set to 'gzip'. Gzip and parquet compression can be enabled when `use_put_object` is 'on' or 'off' (PutObject and Multipart). Arrow compression can only be enabled with `use_put_object On`. | None | +| compression | Compression type for S3 objects. 'gzip' and 'parquet' are currently the only supported value by default. If Apache Arrow support was enabled at compile time, you can also use 'arrow'. If columnify command is installed, you can also compress as parquet format. For gzip compression, the Content-Encoding HTTP Header will be set to 'gzip'. Gzip and parquet compression can be enabled when `use_put_object` is 'on' or 'off' (PutObject and Multipart). Arrow compression can only be enabled with `use_put_object On`. A configuration error will be triggered for invalid combinations. Default is empty and no compression therefore. | | | content\_type | A standard MIME type for the S3 object; this will be set as the Content-Type HTTP header. | None | | send\_content\_md5 | Send the Content-MD5 header with PutObject and UploadPart requests, as is required when Object Lock is enabled. | false | | auto\_retry\_requests | Immediately retry failed requests to AWS services once. This option does not affect the normal Fluent Bit retry mechanism with backoff. Instead, it enables an immediate retry with no delay for networking errors, which may help improve throughput when there are transient/random networking issues. | true | From ccdf7e12549ba26006bcba5a629197a692609010 Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Fri, 24 May 2024 20:33:18 +0900 Subject: [PATCH 06/12] out_s3: Add bytes unit explicitly Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index bcfbbeca8..7f6d1a820 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -50,8 +50,8 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b | retry\_limit | Integer value to set the maximum number of retries allowed. Note: this configuration is released since version 1.9.10 and 2.0.1. For previous version, the number of retries is 5 and is not configurable. | 1 | | external\_id | Specify an external ID for the STS API, can be used with the role\_arn parameter if your role requires an external ID. | None | | parquet.compression | Compression type for parquet. 'uncompressed', 'snappy', 'gzip', 'zstd' are the supported values by default. 'lzo', 'brotli', 'lz4' are not supported for now. | SNAPPY | -| parquet.pagesize | Page size of parquet format. Defaults to 8192 bytes (8KiB). | 8192 | -| parquet.row\_group\_size | Row group size of parquet format. Defaults to 134217728 bytes (128MiB). | 134217728 | +| parquet.pagesize | Page size of parquet format. Defaults to 8192 bytes (8KiB). | 8192 bytes | +| parquet.row\_group\_size | Row group size of parquet format. Defaults to 134217728 bytes (128MiB). | 134217728 bytes | | parquet.record\_type | Format type of records on parquet format. Defaults to json. | json | | parquet.schema\_type | Format type of schema on parquet format. Defaults to json. | avro | | parquet.schema\_file | Specify path to schema file for parquet compression. | None | From 38a32d2eabf5b4b34af0a982d541846e314235b9 Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Fri, 24 May 2024 20:37:30 +0900 Subject: [PATCH 07/12] out_s3: Add more descriptions for parquet.compression parameter which indicates that lower case is also accepted Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 7f6d1a820..93fb64ec0 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -49,7 +49,7 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b | storage\_class | Specify the [storage class](https://docs.aws.amazon.com/AmazonS3/latest/API/API\_PutObject.html#AmazonS3-PutObject-request-header-StorageClass) for S3 objects. If this option is not specified, objects will be stored with the default 'STANDARD' storage class. | None | | retry\_limit | Integer value to set the maximum number of retries allowed. Note: this configuration is released since version 1.9.10 and 2.0.1. For previous version, the number of retries is 5 and is not configurable. | 1 | | external\_id | Specify an external ID for the STS API, can be used with the role\_arn parameter if your role requires an external ID. | None | -| parquet.compression | Compression type for parquet. 'uncompressed', 'snappy', 'gzip', 'zstd' are the supported values by default. 'lzo', 'brotli', 'lz4' are not supported for now. | SNAPPY | +| parquet.compression | Compression type for parquet. 'uncompressed', 'snappy', 'gzip', 'zstd' are the supported values by default. 'lzo', 'brotli', 'lz4' are not supported for now. The default value is upper case of snappy. Also, users can specify lower case of compression types. The specified value will be converted to upper case automatically. | SNAPPY | | parquet.pagesize | Page size of parquet format. Defaults to 8192 bytes (8KiB). | 8192 bytes | | parquet.row\_group\_size | Row group size of parquet format. Defaults to 134217728 bytes (128MiB). | 134217728 bytes | | parquet.record\_type | Format type of records on parquet format. Defaults to json. | json | From 362d632ad71318e27959b3d1877d0901775b80b3 Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Mon, 27 May 2024 13:57:36 +0900 Subject: [PATCH 08/12] out_s3: Add a link for columnify Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 93fb64ec0..d5ede8fee 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -40,7 +40,7 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b | sts\_endpoint | Custom endpoint for the STS API. | None | | profile | Option to specify an AWS Profile for credentials. | default | | canned\_acl | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | None | -| compression | Compression type for S3 objects. 'gzip' and 'parquet' are currently the only supported value by default. If Apache Arrow support was enabled at compile time, you can also use 'arrow'. If columnify command is installed, you can also compress as parquet format. For gzip compression, the Content-Encoding HTTP Header will be set to 'gzip'. Gzip and parquet compression can be enabled when `use_put_object` is 'on' or 'off' (PutObject and Multipart). Arrow compression can only be enabled with `use_put_object On`. A configuration error will be triggered for invalid combinations. Default is empty and no compression therefore. | | +| compression | Compression type for S3 objects. 'gzip' and 'parquet' are currently the only supported value by default. If Apache Arrow support was enabled at compile time, you can also use 'arrow'. If [columnify](https://github.com/reproio/columnify) command is installed, you can also compress as parquet format. For gzip compression, the Content-Encoding HTTP Header will be set to 'gzip'. Gzip and parquet compression can be enabled when `use_put_object` is 'on' or 'off' (PutObject and Multipart). Arrow compression can only be enabled with `use_put_object On`. A configuration error will be triggered for invalid combinations. Default is empty and no compression therefore. | | | content\_type | A standard MIME type for the S3 object; this will be set as the Content-Type HTTP header. | None | | send\_content\_md5 | Send the Content-MD5 header with PutObject and UploadPart requests, as is required when Object Lock is enabled. | false | | auto\_retry\_requests | Immediately retry failed requests to AWS services once. This option does not affect the normal Fluent Bit retry mechanism with backoff. Instead, it enables an immediate retry with no delay for networking errors, which may help improve throughput when there are transient/random networking issues. | true | From 55ab7ac4adbbf2e65a7513d4c5c03383bd4c75af Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Mon, 27 May 2024 13:59:15 +0900 Subject: [PATCH 09/12] out_s3: Mention target people explicitly Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index d5ede8fee..6299a542d 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -291,7 +291,7 @@ Then, the records will be stored into the MinIO server. ## Usage for Parquet Compression -For parquet compression, it needs to install [columnify](https://github.com/reproio/columnify) in the running system or container at runtime. +For parquet compression, users need to install to install [columnify](https://github.com/reproio/columnify) in the running system or container at runtime. After installing that command, out_s3 can handle parquet compression: From 48cf583c704ec5d5e4f227ab27d9721ed6a5e487 Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Mon, 3 Jun 2024 19:44:22 +0900 Subject: [PATCH 10/12] out_s3: Add more description for configurable parquet.process_dir parameter Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 1 + 1 file changed, 1 insertion(+) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 6299a542d..ef775ce5b 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -55,6 +55,7 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b | parquet.record\_type | Format type of records on parquet format. Defaults to json. | json | | parquet.schema\_type | Format type of schema on parquet format. Defaults to json. | avro | | parquet.schema\_file | Specify path to schema file for parquet compression. | None | +| parquet.process\_dir | Specify a temporary directory for processing parquet objects. This paramater is effective for non Windows platforms. | /tmp | ## TLS / SSL From 94838f393644cce15c6883bc63a064661b33496f Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Mon, 3 Jun 2024 19:53:55 +0900 Subject: [PATCH 11/12] out_s3: Add an example for building columnify command inside containers Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index ef775ce5b..38c0fc9d8 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -307,6 +307,27 @@ After installing that command, out_s3 can handle parquet compression: parquet.compression snappy ``` +### Build columnify Command + +For building columnify command, users need to inherit and use Golang development container: + +``` +FROM golang:1-alpine as builder # Always refer the latest golang:1-aline image + +ENV ROOT=/go/src/cmd +WORKDIR ${ROOT} + +RUN apk update && \ + apk add git + +RUN go install github.com/reproio/columnify/cmd/columnify@latest + +FROM debian:bullseye-slim as production + +# Put columnify command inside the PATH. +COPY --from=builder /go/bin/columnify /usr/bin/columnify +``` + ## Getting Started In order to send records into Amazon S3, you can run the plugin from the command line or through the configuration file. From 08a802dd218daf215eeba0ca0ef34d3fcb008168 Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Tue, 4 Jun 2024 18:25:09 +0900 Subject: [PATCH 12/12] out_s3: Update a default value for parquet.process_dir parameter Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 38c0fc9d8..a323797fe 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -55,7 +55,7 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b | parquet.record\_type | Format type of records on parquet format. Defaults to json. | json | | parquet.schema\_type | Format type of schema on parquet format. Defaults to json. | avro | | parquet.schema\_file | Specify path to schema file for parquet compression. | None | -| parquet.process\_dir | Specify a temporary directory for processing parquet objects. This paramater is effective for non Windows platforms. | /tmp | +| parquet.process\_dir | Specify a temporary directory for processing parquet objects. This paramater is effective for non Windows platforms. | Windows: %TMP_DIR%\parquet\s3 Linux/macOS: /tmp/parquet/s3 | ## TLS / SSL