-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parquet
columnar format in the aws_s3
sink
#1374
Comments
One issue with that parquet library is that it requires nightly rust https://github.com/sunchao/parquet-rs/blob/master/src/lib.rs#L123 and https://github.com/apache/arrow/blob/master/rust/parquet/src/lib.rs#L18, we can't use these features because we are on the stable compiler. |
Any idea what the timeline is for those to be on stable? |
Well really |
I think that it still requires nightly, but just throwing into the mix https://github.com/apache/arrow/tree/master/rust/parquet |
Progress in Apache Arrow project:
|
IIRC Apache Arrow now works on stable Rust 🎉 (unless SIMD feature is used). Would it be possible to re-evaluate this one? |
Hi @Alexx-G . This hasn't been scheduled yet, but you wanted to pick it up, I'd be happy to advise/review! |
I've checked the |
Is there an update on this? I'm very interested in JSON input to parquet output to S3. |
Any update on this? I am also interested in JSON input and Parquet output to s3 |
Any chance this can be looked at ? |
This codec isn't currently on our roadmap, but we could help a community PR get merged 👍 |
ProposalAdd This can be achieved with official Rust crate, through Each batch would be transformed to a single Parquet file with a single row group. With that, configuration of batch can be used to define desired size of row group/Parquet file. SchemaSince Parquet requires schema we need to derive one. Each passing event has it's own implicit schema and by joining them we get a unified one. This unified schema can be:
While 2. option can still result in exported schemas to be different from batch to batch they would have tendency to change less than with 1. option. This is relevant for streams that have events with varying schemas, while for consistent ones both options behave the same. When joining schemas we can get a conflicting situation. When there are multiple types used for the same field/column some resolution is needed:
In my opinion, 2. option is better. It's more reliable and we can document this behavior. OptionsAdd options:
By default plain encoding and no compression are used. AlternativesWe can expose an option for users to define their own static schema for passing events which would try to cast or filter out conflicting values. |
Hey @ktff. @spencergilbert and I discussed this a bit. I don't think anyone on the team is really an expert in Parquet, so we have a couple questions.
|
Hey @fuchsnj Regarding 1. question, no, a Parquet batch can't be built from already encoded events. It's necessary to intercept them before that, or process them in a suitable way for Parquet. Fortunately there is a way to to that, by implementing vector/src/sinks/util/encoding.rs Line 9 in 018637a
Vec<Event> .
There are similar cases:
Current vector/src/sinks/util/encoding.rs Line 18 in 018637a
ParquetEncoder when it's configured.
Regarding 2. question, Parquet is a file format that is usually the final/long term destination for storing data that can later be read for queries, by ,for example, Amazon Athena. Such systems will/do encounter differences in schema with time, especially if they are performing federated queries, so they usually have the means to reconcile different schemas. That said, I came to this conclusion via research so comments from those with experience would be much appreciated. So my main argument is that it's better to have the event reach it's destination and leave it to the user to transform the event into desired schema before or after the sink if they so wish/need to, then to require configuration of fixed schema that drops events which forces those that do have events with varying schema to yet again define the schema in some transformer before this sink to transform the events or to change the fixed schema. |
|
For point 1, I'm in agreement with @fuchsnj . It'd be really nice if we could update the codec model to do the batch encoding in a generic way to work across all sinks using codecs. The lack of this came up recently with the |
Adding support for "batched" codecs is also discussed here: #16637 |
For point 1., while implementation of Going by #16637 for If batch encoding was already implemented that would simplify things. Going by For point 2. @fuchsnj can you provide some docs for |
We currently collect logs and store as parquet-on-s3, using fluentd and an internal plugin we manage for the parquet conversion on our aggregators. We are very interested in migrating to Vector and this issue is currently the remaining blocker. FWIW, in our setup we configure a schema to use for each type of log and we would hope that the Vector implementation would at least support an option for specifying schemas. |
@fuchsnj I found the Instead of determining schema during runtime, add option to specify schema for passing events. |
Happy to see this discussion being continued. We are currently using Vector for Kubernetes cluster log collection but would have other use cases with much higher throughput where Vector is an interesting option. I would like to add some thoughts which would be mandatory to us:
|
Yeah, that's interesting - I certainly appreciate wanting to migrations easier. I imagine that could be "manually" done today with
Definitely an interesting feature to keep in mind, we'd definitely want to get the basic implementation in before adding additional tooling to it though 👍
It's been a while but I feel as though I remember seeing that discussion in the past, if it's not currently the behavior I expect there are arguments that it could be. @ktff is this still something you're planning to work on? |
@spencergilbert I do. I was on vacation hence the silence. The 1. point raised by @fuchsnj remains uresolved. Simplified, there are two ways forward:
In both cases, my plan is to implement codec first and in the case of 2. submit it once batched codecs have landed. |
Hope you had a nice vacation! Sounds good to me. |
@ktff I happened to find a similar solution. If you need help anywhere, I'd be happy to contribute together! If I can figure it out, that is. 🚀 |
@Kikkon the draft contains functioning
I'm currently not in the position to add support for batched codecs hence the limbo state of #17395. If this is something you feel confident to add to Vector then reach out to @jszwedko. Once that lands I'll be able to finish the PR and get it merged. |
@ktff I have some experience using Parquet, but am not familiar with Vector. The issue with this PR is: Parquet does not support append writes. If appending, it may require merging new and old files which has performance costs. However, batch codecs implementation does not yet exist in Vector. So for now the PR is pending. if right? @jszwedko The Vector community has plans for a proposal to support batch codecs. Perhaps once I become more familiar with Vector's architecture, I can discuss with everyone how we could add batch codecs. |
Hey! Yes, we would like to add the concept of batches to codecs but haven't been able to prioritize it on our side just yet. We'd be happy to help guide a contribution for it. I believe @lukesteensen would have some thoughts about what it could look like and also be able to answer questions. |
Can you give me some advice? @lukesteensen 🫡 |
@Kikkon it's not doable with regular codecs, but if it would be then yes it would have performance cost. Also one thing to note, currently the PR does add limited support for batch codecs to support Parquet but it's hacky and so not something that can be merged. The goal is to replace this with proper support. |
Got it, thank you! |
@Kikkon that's correct, you can't really do this right now with the current design of our codecs crate. It's something I would love to enable, but I haven't had time to figure out a good path forward. It will likely take quite a bit of refactoring and design work. I'm hoping we'll be able to tackle that before long, as I agree this is a feature we should have! |
@lukesteensen Is there a corresponding roadmap currently? Are there any parts that I can participate in? |
This is a really big blocker for us. Any plans to make this happen? |
My understanding from the discussion above (and others may correct me if I'm over-simplifying) is that @ktff 's patch only works for S3, and only works in a certain way that isn't necessarily applicable to other output destinations. Because it can't be generalized to other destinations it's not really 'production ready'. |
@rstml We have not been able to prioritize it codec batches at this point, nor is it likely in the near future, unfortunately. |
I am also very interested in this feature. Going to be on roadmap anytime soon? |
@jszwedko - any possibility to get this in roadmap? |
Any update on this feature? |
We are aware there is a lot of demand for this feature but it's not something we currently have in our roadmap. We would be happy to review a community contribution. |
Similar to #1373 we should support the Parquet format. The
parquet
format is a columnar format that enables faster and more efficient data access schemes such as column selection and indexing.Implementation
Unfortunately, I do not have deep experience with this format as I do with ORC, but like everything else, we should start very simple. Fortunately, there appears to be a Rust library that supports basic writing of this data.
The text was updated successfully, but these errors were encountered: