You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
implement a dlt resource that retrieves messages from google cloud pub sub.
The challenge here is that dlt is processing messages (data) in batches going through extract (getting messages from pubsub), normalize (creating relational schema) and load (ie. to bigquery). The pubsub mechanism is a simple message broker where each messages is acknowledged with a certain deadline. It is then quite hard to do exactly once delivery - where no messages are lost but we also have no duplicates.
Note that our Kinesis and Kafka sources are using offsets for robust exactly once delivery. The offsets from the previous extract are committed just before next extract from dlt state which is stored together with the data - when we move the offset we have a guarantee that previous package is fully stored in the destination. This behavior may be hard to replicate with message ack mechanism.
Nevertheless, we need to have a good idea how to achieve such thing for pubsub. Such simple brokers are quite popular and we should give our users the best possible way to read from them.
Current Status
I plan to write it
What source does/will do
Typical use case is a stream of telemetry messages with json based payload that are delivered to a pub sub subscription point and then pulled from cloud function, normalized and stored in relational database.
Typically messages are dispatched to different tables depending on message type (each type has a different schema) and data contracts (ie Pydantic models) are used to filter out "bad data".
Typical message feeds:
github events
dlt anonymous telemetry tracker
Test account / test data
I'd like dltHub to create a test account before we start
Additional context
Implementation Requirements
must be able to work inside and outside of google cloud (authentication via GcpCredentials like ie. in google ads / google sheets / bigquery sources and destinations)
suitable for running on cloud function where the max run time and local storage are limited so max consume time and max messages should be observed.
"no more messages" situation must be detected - we do not want to wait for max consume time
implement a mechanism as close to exactly once delivery as possible by manipulating ask delivery deadlines, time spend consuming vs. total time. duplicated messages are better than lost ones.
pubsub metadata should be attached to the message payload data like we do it for Kinesis / Kafka (user can always remove it via dlt transform)
look at Kinesis/Kafka and Scrapy sources. We may need a helper / wrapper functions to run dlt pipeline in a way that does ack on messages.
Like for any other sources: demo pipelines and documentation must be provided
Test Requirements
tests should create a unique subscription, test messages and drop it all at the end (see Kafka / Kinesis)
mind that many test runs may happen at the same time on CI so tests must be isolated
local pubsub emulator and real service are an option
We have community provided consumers for pub sub which may be a starting point for this implementation
The text was updated successfully, but these errors were encountered:
Quick source info
implement a dlt resource that retrieves messages from google cloud pub sub.
The challenge here is that dlt is processing messages (data) in batches going through extract (getting messages from pubsub), normalize (creating relational schema) and load (ie. to bigquery). The pubsub mechanism is a simple message broker where each messages is acknowledged with a certain deadline. It is then quite hard to do exactly once delivery - where no messages are lost but we also have no duplicates.
Note that our Kinesis and Kafka sources are using offsets for robust exactly once delivery. The offsets from the previous extract are committed just before next extract from dlt state which is stored together with the data - when we move the offset we have a guarantee that previous package is fully stored in the destination. This behavior may be hard to replicate with message ack mechanism.
Nevertheless, we need to have a good idea how to achieve such thing for pubsub. Such simple brokers are quite popular and we should give our users the best possible way to read from them.
Current Status
What source does/will do
Typical use case is a stream of telemetry messages with json based payload that are delivered to a pub sub subscription point and then pulled from cloud function, normalized and stored in relational database.
Typically messages are dispatched to different tables depending on message type (each type has a different schema) and data contracts (ie Pydantic models) are used to filter out "bad data".
Typical message feeds:
Test account / test data
dltHub
to create a test account before we startAdditional context
Implementation Requirements
Test Requirements
We have community provided consumers for pub sub which may be a starting point for this implementation
The text was updated successfully, but these errors were encountered: