Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Historical Data Lake Ideation #4186

Closed
sydneynotthecity opened this issue Jan 18, 2022 · 2 comments
Closed

Spike: Historical Data Lake Ideation #4186

sydneynotthecity opened this issue Jan 18, 2022 · 2 comments

Comments

@sydneynotthecity
Copy link

sydneynotthecity commented Jan 18, 2022

Overview

Create an initial proposal, or set of options, for what we want to provide.

  • Complete a research pass of other chains.
  • What is a reasonable thing to provide? At one extreme: nothing. At the other: immediate API responses to any question of interest.
  • Pursue idea of TxMeta as raw form (could we have an unstructured db?)
  • Prototype data lake and pipeline in Dataflow to move to Data Warehouse. Can we try and build a framework that makes our prototype repeatable. Make it config based
  • Consumption/usage model - we want to make this easy for people in the community to run on their own (sub question: based on usage, what would pricing model look like?)
  • Research dependencies, with our solution would want stellar-core as a dependency but potentially no other code (Horizon not needed?)
  • Come up a list of modules/frameworks/tools that SDF or community could contribute back to open source
  • Pub/Sub model? What would guarantees look like?
  • Data Validation - what does this look like? This is the more difficult part to the problem, follow up with Graydon about idea for downloading hashes and compare with stellar-core. Come up with options for solution but does not have to be completely solved for prototype
  • File Format? JSON, avro. Partitioning method?
  • File size tradeoffs? small files frequently or wait and have larger files written?
  • Processing times for querying and running DataFlow jobs? Experiment: answer the question for a single point in time question and an analysis/aggregation question

Timeline:

Week 1:

  • Finish research on other chains - Tamir to research Polygon; Syd to research Polkadot, Filecoin
  • Proposal written for design of the prototype - asynchronously work together on doc
  • Include file details (saving in XDR? JSON? size of files saved?)
    Week 2:
  • Build a prototype for storing raw data in lake
  • One pipeline
  • Figure out point in time question (What are all the transactions that occurred at sequence number)
    Week 3:
  • Figure out aggregation question (Fee stats, Trading volume for USDC)
  • Define assumptions, pros/cons, risks as prototype is today
  • Recommendation for how to proceed

Out of Scope, but cool

  • Write a library like blockchain-etl that is available in commonly used ETL languages

Research Doc for Off Chain Analysis

@sydneynotthecity
Copy link
Author

Proposal Doc Note: other research is linked from the proposal

@mollykarcher
Copy link
Contributor

Closing this in favor of new ingestion work. Please reopen if you disagree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants