Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streaming data sources #1198

Open
rcrichton opened this issue Sep 27, 2024 · 2 comments
Open

Support streaming data sources #1198

rcrichton opened this issue Sep 27, 2024 · 2 comments

Comments

@rcrichton
Copy link

As it stands the pipeline is configured to support batches and achieves the incremental mode using micro batches. True support for streaming data into the pipeline would allow for a more real time ingestion of data and also allow for data unbounded data sources such as Kafka (or any event/messaging/subscription system).

@bashir2
Copy link
Collaborator

bashir2 commented Jan 28, 2025

Thanks @rcrichton for filing this issue and also PR #1243 which shows how KafkaIO can be used to integrate Kafka sources. Below, I am trying to summarize some of the discussions we have had in other places:

I think the core issue around a truly streaming approach is the Parquet file structure itself. Parquet by design has a page based structure. This means that it is hard to have a truly streaming Parquet output. Also having too small Parquet pages can cause performance problems later at query time.

I think in PR #1243 too, new resources do not instantly appear in the output Parquet files. This is because in the streaming mode, we flush the content of Parquet files every secondsToFlush which I think by default is 10 minutes.

I am wondering if it makes sense that even for Kafka sources, we add a micro-batch based support in our pipelines, instead of a truly streaming one.

@bashir2
Copy link
Collaborator

bashir2 commented Jan 28, 2025

As a related note, we like to support the FHIR Subscription API at some point in the future. That will have similar challenges and trade-offs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants