CDC log stream state (cdc$time) persisted via connect topic `connect-offsets` #16

hartmut-co-uk · 2021-12-02T14:37:01Z

Hi, when looking at the data published to connect-offsets table I noticed the latest window state is tracked by

'connector name'
array (per table..?) of tuple4
- keyspace_name
- table_name
- vnode_id
- generation_start

Why is this at the vnode_id level and where does this information come from?
When querying the table the vnode_id is not used as a query condition, right?

Further implication (maybe?):
The topic connect-offsets is created by kafka connect (not the scylla connector) and is not a compacted topic.
While running a simple test (scylla.query.time.window.size: 2000) for 1 connector, 1 task, 1 table - resulted in ~1M messages on the docker-connect-offsets topic.
@pkgonan may I ask if you've got numbers to confirm this for a more comprehensive setup?

@haaawk how is this topic consumed upon connector (re)start / task/consumer rebalancing? From beginning?

Update 2021-12-15:

ℹ️ For reference: the part on connect-offsets already has been well described and addressed in a section in the repo README:

scylla-cdc-source-connector/README.md

Lines 601 to 605 in ecbeb1d

    
           #### Offset (progress) storage 
        
           Scylla CDC Source Connector reads the CDC log by quering on [Vnode](https://docs.scylladb.com/architecture/ringarchitecture/) granularity level. It uses Kafka Connect to store current progress (offset) for each Vnode. By default, there are 256 Vnodes per each Scylla node. Kafka Connect stores those offsets in its `connect-offsets` internal topic, but it could grow large in case of big Scylla clusters. You can minimize this topic size, by adjusting the following configuration options on this topic: 
        
           1. `segment.bytes` or `segment.ms` - lowering them will make the compaction process trigger more often. 
        
           2. `cleanup.policy=delete` and setting `retention.ms` to at least the TTL value of your Scylla CDC table (in milliseconds; Scylla default is 24 hours). Using this configuration, older offsets will be deleted. By setting `retention.ms` to at least the TTL value of your Scylla CDC table, we make sure to delete only those offsets that have already expired in the source Scylla CDC table.

Offset (progress) storage

Scylla CDC Source Connector reads the CDC log by quering on Vnode granularity level. It uses Kafka Connect to store current progress (offset) for each Vnode. By default, there are 256 Vnodes per each Scylla node. Kafka Connect stores those offsets in its connect-offsets internal topic, but it could grow large in case of big Scylla clusters. You can minimize this topic size, by adjusting the following configuration options on this topic:

segment.bytes or segment.ms - lowering them will make the compaction process trigger more often.

cleanup.policy=delete and setting retention.ms to at least the TTL value of your Scylla CDC table (in milliseconds; Scylla default is 24 hours). Using this configuration, older offsets will be deleted. By setting retention.ms to at least the TTL value of your Scylla CDC table, we make sure to delete only those offsets that have already expired in the source Scylla CDC table.

The text was updated successfully, but these errors were encountered:

haaawk · 2021-12-02T15:55:29Z

In a single query, the connector queries all the streams that belong to a given vnode. That's why the offset is tracked by vnode_id.

Does that answer your question @hartmut-co-uk ?

hartmut-co-uk · 2021-12-02T17:37:56Z

thanks!
How is this topic consumed upon connector (re)start / task/consumer rebalancing? From beginning?

How do 'generation_start' and streams relate?
Are there Scylla system topics where all of this is maintained?

haaawk · 2021-12-02T17:59:13Z

@avelanarius Could you please answer with details here?

hartmut-co-uk · 2021-12-14T01:20:28Z

I've been playing with the TableBackedProgressManager of scylla-cdc-go and I think it might be a good alternative candidate on how to persist current CDC log stream state (cdc$time)...

Are there plans to add similar functionality to either this repo or scylla-cdc-java?

hartmut-co-uk mentioned this issue Dec 2, 2021

Detected performance impact query #15

Open

hartmut-co-uk changed the title ~~Why are offsets tracked by vnode_id (docker-connect-offsets)?~~ CDC log stream state (cdc$time) persisted via connect topic docker-connect-offsets Dec 14, 2021

hartmut-co-uk changed the title ~~CDC log stream state (cdc$time) persisted via connect topic docker-connect-offsets~~ CDC log stream state (cdc$time) persisted via connect topic connect-offsets Dec 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDC log stream state (cdc$time) persisted via connect topic `connect-offsets` #16

CDC log stream state (cdc$time) persisted via connect topic `connect-offsets` #16

hartmut-co-uk commented Dec 2, 2021 •

edited

Loading

Offset (progress) storage

haaawk commented Dec 2, 2021

hartmut-co-uk commented Dec 2, 2021

haaawk commented Dec 2, 2021

hartmut-co-uk commented Dec 14, 2021

CDC log stream state (cdc$time) persisted via connect topic connect-offsets #16

CDC log stream state (cdc$time) persisted via connect topic connect-offsets #16

Comments

hartmut-co-uk commented Dec 2, 2021 • edited Loading

Update 2021-12-15:

Offset (progress) storage

haaawk commented Dec 2, 2021

hartmut-co-uk commented Dec 2, 2021

haaawk commented Dec 2, 2021

hartmut-co-uk commented Dec 14, 2021

CDC log stream state (cdc$time) persisted via connect topic `connect-offsets` #16

CDC log stream state (cdc$time) persisted via connect topic `connect-offsets` #16

hartmut-co-uk commented Dec 2, 2021 •

edited

Loading