You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As it stands the pipeline is configured to support batches and achieves the incremental mode using micro batches. True support for streaming data into the pipeline would allow for a more real time ingestion of data and also allow for data unbounded data sources such as Kafka (or any event/messaging/subscription system).
The text was updated successfully, but these errors were encountered:
Thanks @rcrichton for filing this issue and also PR #1243 which shows how KafkaIO can be used to integrate Kafka sources. Below, I am trying to summarize some of the discussions we have had in other places:
I think the core issue around a truly streaming approach is the Parquet file structure itself. Parquet by design has a page based structure. This means that it is hard to have a truly streaming Parquet output. Also having too small Parquet pages can cause performance problems later at query time.
I think in PR #1243 too, new resources do not instantly appear in the output Parquet files. This is because in the streaming mode, we flush the content of Parquet files every secondsToFlush which I think by default is 10 minutes.
I am wondering if it makes sense that even for Kafka sources, we add a micro-batch based support in our pipelines, instead of a truly streaming one.
As it stands the pipeline is configured to support batches and achieves the incremental mode using micro batches. True support for streaming data into the pipeline would allow for a more real time ingestion of data and also allow for data unbounded data sources such as Kafka (or any event/messaging/subscription system).
The text was updated successfully, but these errors were encountered: