Spike: Historical Data Lake Ideation #4186

sydneynotthecity · 2022-01-18T18:52:35Z

Create an initial proposal, or set of options, for what we want to provide.

Complete a research pass of other chains.
What is a reasonable thing to provide? At one extreme: nothing. At the other: immediate API responses to any question of interest.
Pursue idea of TxMeta as raw form (could we have an unstructured db?)
Prototype data lake and pipeline in Dataflow to move to Data Warehouse. Can we try and build a framework that makes our prototype repeatable. Make it config based
Consumption/usage model - we want to make this easy for people in the community to run on their own (sub question: based on usage, what would pricing model look like?)
Research dependencies, with our solution would want stellar-core as a dependency but potentially no other code (Horizon not needed?)
Come up a list of modules/frameworks/tools that SDF or community could contribute back to open source
Pub/Sub model? What would guarantees look like?
Data Validation - what does this look like? This is the more difficult part to the problem, follow up with Graydon about idea for downloading hashes and compare with stellar-core. Come up with options for solution but does not have to be completely solved for prototype
File Format? JSON, avro. Partitioning method?
File size tradeoffs? small files frequently or wait and have larger files written?
Processing times for querying and running DataFlow jobs? Experiment: answer the question for a single point in time question and an analysis/aggregation question

Week 1:

Finish research on other chains - Tamir to research Polygon; Syd to research Polkadot, Filecoin
Proposal written for design of the prototype - asynchronously work together on doc
Include file details (saving in XDR? JSON? size of files saved?)
Week 2:
Build a prototype for storing raw data in lake
One pipeline
Figure out point in time question (What are all the transactions that occurred at sequence number)
Week 3:
Figure out aggregation question (Fee stats, Trading volume for USDC)
Define assumptions, pros/cons, risks as prototype is today
Recommendation for how to proceed

Write a library like blockchain-etl that is available in commonly used ETL languages

The text was updated successfully, but these errors were encountered:

sydneynotthecity · 2022-01-25T23:03:35Z

Proposal Doc Note: other research is linked from the proposal

mollykarcher · 2023-11-14T23:07:49Z

Closing this in favor of new ingestion work. Please reopen if you disagree

sydneynotthecity added feature request horizon spike labels Jan 18, 2022

mollykarcher closed this as completed Nov 14, 2023

Provide feedback