Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet compressor using columnify #338

Merged
merged 5 commits into from
Apr 8, 2021

Conversation

okkez
Copy link
Contributor

@okkez okkez commented Jun 30, 2020

How to use:

Install columnify.

The sample configuration for parquet compressor.

<match>
  @id s3-parquet
  @type s3

  s3_region ap-northeast-1
  s3_bucket xxx
  store_as parquet
  <compress>
    parquet_compression_codec snappy
    record_type msgpack
    schema_type avro
    schema_file /path/to/log.avsc
  </compress>

  <format>
    @type msgpack
  </format>
</match>

log.avsc is like following:

{
  "name": "AccessLog",
  "type": "record",
  "fields": [
    { "name": "container_id", "type": "string" },
    { "name": "method", "type": "string" },...
  ]
}

See https://avro.apache.org/docs/current/spec.html for more details about avro schema.

Notice:
columnify's memory usage is proportional to file size. For example, columnify consumes about 750MB memory (RSS) while processing 128MB msgpack.

See also #221

@okkez okkez force-pushed the add-parquet-compressor branch from 32b4c97 to 07542d8 Compare June 30, 2020 05:05
@okkez okkez force-pushed the add-parquet-compressor branch from 07542d8 to 16d2ba1 Compare June 30, 2020 05:06
@repeatedly
Copy link
Member

Looks good to me.
Need feedback from other users.

@repeatedly repeatedly self-assigned this Jun 30, 2020
@scrwr
Copy link

scrwr commented Jul 5, 2020

I'd actually really love to test it. But. We use bundler in our build process for installing all plugins and fluentd unfortunately stopped importing bundler installed git sources, so I end up with "Unknown output plugin 's3'" if I add this to our Gemfile

gem 'fluent-plugin-s3', git: 'https://github.com/okkez/fluent-plugin-s3.git', branch: 'add-parquet-compressor'

Can you help with getting this to work without completely workarounding the bundle install.

@lsudo
Copy link

lsudo commented Sep 29, 2020

thank you for this, I really appreciate that

@okkez
Copy link
Contributor Author

okkez commented Mar 31, 2021

This parquet compressor has worked fine for the recent 9 months on our system.

Can I merge this PR?

@okkez okkez force-pushed the add-parquet-compressor branch from 3f8aefe to e6cdf15 Compare March 31, 2021 08:53
@okkez
Copy link
Contributor Author

okkez commented Apr 1, 2021

@repeatedly @ganmacs @kenhys @ashie Can you merge this PR and release the new version?

@ashie
Copy link
Member

ashie commented Apr 1, 2021

I'll merge & release it after waiting for comments from other maintainers for a few days.

@ashie ashie self-assigned this Apr 2, 2021
Copy link
Member

@ashie ashie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sample configuration for parquet compressor.

<match>
  @id s3-parquet
  @type s3

  s3_region ap-northeast-1
  s3_bucket xxx
  <compress>
    parquet_compression_codec snappy
    record_type msgpack
    schema_type avro
    schema_file /path/to/log.avsc
  </compress>

  <format>
    @type msgpack
  </format>
</match>

@okkez store_as parquet is also required, isnt it?

README.md Show resolved Hide resolved
@okkez
Copy link
Contributor Author

okkez commented Apr 6, 2021

@okkez store_as parquet is also required, isnt it?

Yes. It is required.

@okkez okkez force-pushed the add-parquet-compressor branch from 7920ef4 to 527cb7a Compare April 7, 2021 02:30
lib/fluent/plugin/s3_compressor_parquet.rb Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
okkez added 2 commits April 7, 2021 17:11
Signed-off-by: Kenji Okimoto <[email protected]>
@ashie
Copy link
Member

ashie commented Apr 8, 2021

We'll refine the document at #373

@ashie ashie merged commit ab61912 into fluent:master Apr 8, 2021
@ashie
Copy link
Member

ashie commented Apr 8, 2021

Thank you for your work!

@sunnymeska-w
Copy link

the compressor dosnt support list type in schema file. please add support for input type list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants