google cloud pubsub verified source #489

rudolfix · 2024-06-09T18:42:56Z

Quick source info

Name of the source: google_cloud_pubsub

implement a dlt resource that retrieves messages from google cloud pub sub.

The challenge here is that dlt is processing messages (data) in batches going through extract (getting messages from pubsub), normalize (creating relational schema) and load (ie. to bigquery). The pubsub mechanism is a simple message broker where each messages is acknowledged with a certain deadline. It is then quite hard to do exactly once delivery - where no messages are lost but we also have no duplicates.

Note that our Kinesis and Kafka sources are using offsets for robust exactly once delivery. The offsets from the previous extract are committed just before next extract from dlt state which is stored together with the data - when we move the offset we have a guarantee that previous package is fully stored in the destination. This behavior may be hard to replicate with message ack mechanism.

Nevertheless, we need to have a good idea how to achieve such thing for pubsub. Such simple brokers are quite popular and we should give our users the best possible way to read from them.

Current Status

I plan to write it

What source does/will do

Typical use case is a stream of telemetry messages with json based payload that are delivered to a pub sub subscription point and then pulled from cloud function, normalized and stored in relational database.

Typically messages are dispatched to different tables depending on message type (each type has a different schema) and data contracts (ie Pydantic models) are used to filter out "bad data".

Typical message feeds:

github events
dlt anonymous telemetry tracker

Test account / test data

I'd like dltHub to create a test account before we start

Additional context

Implementation Requirements

must be able to work inside and outside of google cloud (authentication via GcpCredentials like ie. in google ads / google sheets / bigquery sources and destinations)
suitable for running on cloud function where the max run time and local storage are limited so max consume time and max messages should be observed.
"no more messages" situation must be detected - we do not want to wait for max consume time
implement a mechanism as close to exactly once delivery as possible by manipulating ask delivery deadlines, time spend consuming vs. total time. duplicated messages are better than lost ones.
pubsub metadata should be attached to the message payload data like we do it for Kinesis / Kafka (user can always remove it via dlt transform)
look at Kinesis/Kafka and Scrapy sources. We may need a helper / wrapper functions to run dlt pipeline in a way that does ack on messages.
Like for any other sources: demo pipelines and documentation must be provided

Test Requirements

tests should create a unique subscription, test messages and drop it all at the end (see Kafka / Kinesis)
mind that many test runs may happen at the same time on CI so tests must be isolated
local pubsub emulator and real service are an option

We have community provided consumers for pub sub which may be a starting point for this implementation

The text was updated successfully, but these errors were encountered:

rudolfix added the verified source dlt source with tests and demos label Jun 9, 2024

github-project-automation bot added this to Verified Sources Jun 9, 2024

github-project-automation bot moved this to Planned in Verified Sources Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

google cloud pubsub verified source #489

google cloud pubsub verified source #489

rudolfix commented Jun 9, 2024

google cloud pubsub verified source #489

google cloud pubsub verified source #489

Comments

rudolfix commented Jun 9, 2024

Quick source info

Current Status

What source does/will do

Test account / test data

Additional context