Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google cloud pubsub verified source #489

Open
2 tasks done
rudolfix opened this issue Jun 9, 2024 · 0 comments
Open
2 tasks done

google cloud pubsub verified source #489

rudolfix opened this issue Jun 9, 2024 · 0 comments
Labels
verified source dlt source with tests and demos

Comments

@rudolfix
Copy link
Contributor

rudolfix commented Jun 9, 2024

Quick source info

  • Name of the source: google_cloud_pubsub

implement a dlt resource that retrieves messages from google cloud pub sub.

The challenge here is that dlt is processing messages (data) in batches going through extract (getting messages from pubsub), normalize (creating relational schema) and load (ie. to bigquery). The pubsub mechanism is a simple message broker where each messages is acknowledged with a certain deadline. It is then quite hard to do exactly once delivery - where no messages are lost but we also have no duplicates.

Note that our Kinesis and Kafka sources are using offsets for robust exactly once delivery. The offsets from the previous extract are committed just before next extract from dlt state which is stored together with the data - when we move the offset we have a guarantee that previous package is fully stored in the destination. This behavior may be hard to replicate with message ack mechanism.

Nevertheless, we need to have a good idea how to achieve such thing for pubsub. Such simple brokers are quite popular and we should give our users the best possible way to read from them.

Current Status

  • I plan to write it

What source does/will do

Typical use case is a stream of telemetry messages with json based payload that are delivered to a pub sub subscription point and then pulled from cloud function, normalized and stored in relational database.

Typically messages are dispatched to different tables depending on message type (each type has a different schema) and data contracts (ie Pydantic models) are used to filter out "bad data".

Typical message feeds:

  • github events
  • dlt anonymous telemetry tracker

Test account / test data

  • I'd like dltHub to create a test account before we start

Additional context

Implementation Requirements

  1. must be able to work inside and outside of google cloud (authentication via GcpCredentials like ie. in google ads / google sheets / bigquery sources and destinations)
  2. suitable for running on cloud function where the max run time and local storage are limited so max consume time and max messages should be observed.
  3. "no more messages" situation must be detected - we do not want to wait for max consume time
  4. implement a mechanism as close to exactly once delivery as possible by manipulating ask delivery deadlines, time spend consuming vs. total time. duplicated messages are better than lost ones.
  5. pubsub metadata should be attached to the message payload data like we do it for Kinesis / Kafka (user can always remove it via dlt transform)
  6. look at Kinesis/Kafka and Scrapy sources. We may need a helper / wrapper functions to run dlt pipeline in a way that does ack on messages.
  7. Like for any other sources: demo pipelines and documentation must be provided

Test Requirements

  1. tests should create a unique subscription, test messages and drop it all at the end (see Kafka / Kinesis)
  2. mind that many test runs may happen at the same time on CI so tests must be isolated
  3. local pubsub emulator and real service are an option

We have community provided consumers for pub sub which may be a starting point for this implementation

@rudolfix rudolfix added the verified source dlt source with tests and demos label Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
verified source dlt source with tests and demos
Projects
Status: Planned
Development

No branches or pull requests

1 participant