-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a FileExporter to allow export to arrow or feather (or parquet)??? #250
Comments
Yes, implementing such an exporter was part of our plan. Storing the Arrow records as they are would be easy, but in my opinion, it wouldn’t be ideal or optimal. The current Arrow schema used in the protocol has been optimized for transport, taking into account the need to regularly close streams (and therefore reset states on the receiver side) to make the protocol more load-balancer-friendly. Some changes or transformations need to be applied first to optimize records for long-term storage. Unfortunately, I don’t have an ETA to provide at this time. |
Thanks for the reply. There's strong perceived value here in otlp-flavor data stored as arrow/feather to allow quick tactical queries via DuckDB and similar, etc. Any thoughts you have on how to enact your best solution, or do a mostly durable tactical one, would be very much appreciated, if you have the cycles. Happy to try to help towards either/both. |
I've started progress on something similar here, with the idea of loading the otel data directly into a DuckDB database via its arrow-native functions for zero-copy. I don't think I'll have much time to work on it in the immediate future but contributions are welcomed! |
Hiya Abhi,
Great thoughts and great start... and exactly what I was thinking (albeit
in Golang)...
My concerns with such an approach (while still being a bit new to duck and
delta lake) were:
- Historical data is fine in parquet but I think real-time needs delta
lake
- How do you manage what duck queries comes from which of these
sources?
- When/how do parquet files get written/updated?
- duckdb concurrency
<https://duckdb.org/docs/connect/concurrency.html> allowing only
EITHER multiple readers OR a single writer/reader process.
- If using parquet, I think creating a "New" table (e.g.
"new_traces") and swapping it for the "Current" table (e.g.
"current_traces") works while still allowing non-blocking duckdb readers.
- If using delta lake, how do you prevent blocking of readers (e.g.
for grafana or just duck queries) while updates are happening?
Thanks!
Rick
…On Sat, Nov 30, 2024 at 1:12 PM Abhi Agarwal ***@***.***> wrote:
I've started progress on something similar here, with the idea of loading
the otel data directly into a DuckDB database via its arrow-native
functions for zero-copy. I don't think I'll have much time to work on it in
the immediate future but contributions are welcomed!
https://github.com/abhiaagarwal/otelarrow-treasury
—
Reply to this email directly, view it on GitHub
<#250 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANSMBGS3LHB5AFZZEYWMOD2DH5YJAVCNFSM6AAAAABOSXH4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBZGEZTAMRXGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@rickFanta go is probably the better choice here, I just am just much better at rust. To be clear, the data is stored in a DuckDB database, which doesn't use parquet at all. Delta Lake is certainly an option, but it would require maintaining 10+ different tables, or normalizing it down to 3 tables for each otel type (which I think would lose some of the benefit) |
Ok. The duckdb concurrency issue (allowing only EITHER multiple readers
OR a single writer/reader process) sure seems like a big issue.
Reading through your code but not yet seeing how you solve this?
Namely, how do you avoid locks on read-only processes while writing (with
another process)? [I *want* to be missing something but don't yet see and
am hitting the the locks quite a bit.]
…On Mon, Dec 2, 2024 at 10:09 AM Abhi Agarwal ***@***.***> wrote:
@rickFanta <https://github.com/rickFanta> go is probably the better
choice here, I just am just much better at rust. To be clear, the data is
stored in a DuckDB database, which doesn't use parquet at all. Delta Lake
is certainly an option, but it would require maintaining 10+ different
tables, or normalizing it down to 3 tables for each otel type (which I
think would lose some of the benefit)
—
Reply to this email directly, view it on GitHub
<#250 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANSMBDWX7ETWQEN6M5HGWT2DRZ4JAVCNFSM6AAAAABOSXH4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJRG44TOMRZGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Might it be possible to add a FileExporter to this to allow export to arrow or feather (or parquet)???
Or is there an obvious way to do this that I'm missing?
The text was updated successfully, but these errors were encountered: