batching policy confusion #2967

pmak-852 · 2024-10-28T16:09:39Z

pmak-852
Oct 28, 2024

i have a simple pipeline that pull messages from kafka in batch and persist to s3. The input batch size is 50 messages while the output is 20. i am expecting the output batch size is 20, which is not the case, it says 50. When i remove the input batching policy, the output batch size is 20 as expected.

hope i understood the concept of batching in the context of redpanda connect correctly.

say the input has 100 records, redpanda cuts them into 2 parts evenly in this case.
each batch undergoes the processor after which we still have 2 batches with 50 messages in each of them
when a batch of 50 messages arrives the output, it will be further cut into smaller batches with 20 message
the processors in the batching is handling 20 messages each time, so the batch_size() should return 20 instead of 50

Many thanks!

my connect.yaml as follow

input:
  kafka:
    addresses: [ "redpanda-0:9092" ]
    topics: [ "example" ]
    consumer_group: "test4"
    checkpoint_limit: 1024
    auto_replay_nacks: true
    batching:
      count: 50
      period: "5s"



pipeline:
  processors:
    - mapping: |
        root.data = this
        root.meta = metadata()



output:
  label: "unittest"
  aws_s3:
    bucket: "raw-zone-general"
    path: ${!meta("kafka_topic")}-${!meta("kafka_partition")}-${!meta("kafka_offset").from(-1)}-${!timestamp_unix_nano()}.zip
    endpoint: "http://minio:9000"
    region: "local"
    force_path_style_urls: true
    batching:
      count: 10
      period: "10s"
      processors:
        - log:
            level: INFO
            fields_mapping: |
              root.kafka_topic = meta("kafka_offset").from(-1)
              root.sz = batch_size()
        - archive:
            format: tar
            path: ${!meta("kafka_partition")}-${!meta("kafka_offset")}

    credentials:
      id: "testing_account"
      secret: "testing_pwd"

Answered by mihaitodor

Oct 31, 2024

Hey @pmak-852 👋

The input batch size is 50 messages while the output is 20. i am expecting the output batch size is 20, which is not the case, it says 50. When i remove the input batching policy, the output batch size is 20 as expected.

This is by design. Connect never shrinks batches, it only merges them if, for example, you have small batches which you want to batch together into larger batches (they get concatenated basically) and there's usually no reason to configure batching both at the input and output level.

If you wish to shrink batches, then you can use a processor like group_by, group_by_value, split etc.

View full answer

mihaitodor · 2024-10-31T18:44:32Z

mihaitodor
Oct 31, 2024
Collaborator

Hey @pmak-852 👋

The input batch size is 50 messages while the output is 20. i am expecting the output batch size is 20, which is not the case, it says 50. When i remove the input batching policy, the output batch size is 20 as expected.

This is by design. Connect never shrinks batches, it only merges them if, for example, you have small batches which you want to batch together into larger batches (they get concatenated basically) and there's usually no reason to configure batching both at the input and output level.

If you wish to shrink batches, then you can use a processor like group_by, group_by_value, split etc.

2 replies

pmak-852 Nov 1, 2024
Author

@mihaitodor got it!

just an additional question, if i want to have a different handle on each redpanda topic i.e. topic A persists to S3 and topic B do some data transformation and persist to a SQL database. is it recommended to have single redpanda connect running and use the switch processor OR to have multiple redpanda connect to handle topics separately?

mihaitodor Nov 2, 2024
Collaborator

In general, I think the switch output should be OK, but ultimately you should test it and see how it works for your use case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batching policy confusion #2967

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

batching policy confusion #2967

pmak-852 Oct 28, 2024

Replies: 1 comment · 2 replies

mihaitodor Oct 31, 2024 Collaborator

pmak-852 Nov 1, 2024 Author

mihaitodor Nov 2, 2024 Collaborator

pmak-852
Oct 28, 2024

Replies: 1 comment 2 replies

mihaitodor
Oct 31, 2024
Collaborator

pmak-852 Nov 1, 2024
Author

mihaitodor Nov 2, 2024
Collaborator