Support streaming data sources #1198

rcrichton · 2024-09-27T10:55:51Z

As it stands the pipeline is configured to support batches and achieves the incremental mode using micro batches. True support for streaming data into the pipeline would allow for a more real time ingestion of data and also allow for data unbounded data sources such as Kafka (or any event/messaging/subscription system).

bashir2 · 2025-01-28T18:35:08Z

Thanks @rcrichton for filing this issue and also PR #1243 which shows how KafkaIO can be used to integrate Kafka sources. Below, I am trying to summarize some of the discussions we have had in other places:

I think the core issue around a truly streaming approach is the Parquet file structure itself. Parquet by design has a page based structure. This means that it is hard to have a truly streaming Parquet output. Also having too small Parquet pages can cause performance problems later at query time.

I think in PR #1243 too, new resources do not instantly appear in the output Parquet files. This is because in the streaming mode, we flush the content of Parquet files every secondsToFlush which I think by default is 10 minutes.

I am wondering if it makes sense that even for Kafka sources, we add a micro-batch based support in our pipelines, instead of a truly streaming one.

bashir2 · 2025-01-28T18:37:39Z

As a related note, we like to support the FHIR Subscription API at some point in the future. That will have similar challenges and trade-offs.

fredhersch added this to Open Health Stack Community Roadmap Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support streaming data sources #1198

Support streaming data sources #1198

rcrichton commented Sep 27, 2024

bashir2 commented Jan 28, 2025

bashir2 commented Jan 28, 2025

Support streaming data sources #1198

Support streaming data sources #1198

Comments

rcrichton commented Sep 27, 2024

bashir2 commented Jan 28, 2025

bashir2 commented Jan 28, 2025