Skip to content

Large-scale time series modeling of Ethereum transaction fees with AWS and PySpark ML

Notifications You must be signed in to change notification settings

mingxuan-he/eth-fee-time-series

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Large-scale time series modeling of Ethereum transaction fees with AWS and PySpark ML

Author: Mingxuan (Ming) He
Email: [email protected]

Introduction

With the rising popularity of blockchain technologies and cryptocurrencies, there has been increasing interest from various social scientific disciplines to study the dynamics of blockchain ecosystems. As one of the core mechanisms of the blockchain economic system, transaction fee is a significant economic indicator, reflecting the supply and demand for block transactions and providing hints for the underlying costs and benefits of using cryptocurrencies as a medium of exchange. I choose Ethereum, the largest proof-of-stake blockchain today and the second largest overall, as the main subject of analysis.

Blockchain transactions data features both large volume and high velocity. Ethereum, for example, generates 7200 blocks documenting ~1 million new public transactions every day. Because of this big and always-on character, traditional social scientific methods are inadequate for analyzing this dataset.

In this project, I build a large-scale computing and machine learning pipeline using AWS and PySpark ML to conduct time series modeling for transaction fees on the Ethereum blockchain. Large-scale and cloud computing tools enable me to analyze this immense dataset in a timely and cost-effective manner. I use AWS's EMR clusters to provide scalable computing power, and the PySpark framework for leveraging distributed processing capabilities and reproducibility.

This project is a milestone towards my thesis research, which focuses on dynamic economic modeling for optimal cryptoeconomic policies. In particular, the time series model parameters estimated here will enter my calibration process for a larger model on the macro-cryptoeconomy, and help dynamically optimizing protocol policies in the form of staking and burning (more details on my personal website).

Data

Blockchain transactions data is publicly available from many sources. I choose the AWS Public Blockchain Data available in the AWS Public Data Catalog. One benefit of using this data source is its great integrability with AWS tools. All data is accessible as compressed parquet files stored in a public S3 bucket (s3://aws-public-blockchain), which can be read into Spark directly. In addition, the database allows for querying data on different blockchains (Bitcoin and Ethereum) under the same environment, which is often difficult for raw blockchain data.

The datasets are partitioned by date. I select data from 2021-09-01 (first month after implementation of EIP-1559: fee market change) to 2023-05-01*.
* Note: This period covers "the Merge".

Implementation details

Packages Used

pyspark (ml, sql), numpy, pandas, matplotlib, scipy, statsmodels, graphviz
For complete list of dependencies with version numbers see the Spark Configuration section of Spark_EDA_TSTests.ipynb

Part 1: Set up S3 and EMR Cluster:

See AWS_setup.ipynb

  • Initialize an EMR Cluster with 4 m5.xlarge EC2 instances and a S3 bucket for storage
  • Configure the cluster to work with PySpark in the JupyterHub enviornment.

Part 2: Exploratory analysis and time series tests:

See Spark_EDA_TSTests.ipynb

  • Preprocessing: Load data from parquet, compute transaction fees from gas and gas price
  • Exploratory analysis: plotting time series for transaction fee and gas price
  • Time series tests: stationarity test and serial correlation test (functions implemented in time_series_tests.py)

Part 3: Time series modeling pipeline with Spark ML:

See Spark_Pipeline_Modeling.ipynb
(For this section I increased the EMR Cluster size to 8 m5.xlarge instances)

3.1 Build a custom pipeline for preprocessing & model estimation

  • Pipeline structure: see direct acyclic graph below. Contains 3 custom transformers and 3 transformer/estimators from ml library

  • The model: $AR(p)$ model with non-zero constant:
    $\varphi_t = c + \Sigma_{j=1}^p \rho_j \cdot \varphi_{t-j} + \varepsilon_t,\quad \varepsilon_t \sim N(0,\sigma_\varepsilon^2),$
    where $\varphi_t$ : is the transaction fee (txn_fee) for a period $t$, and
    $\Theta:={(c,\rho_1,...,\rho_p, \sigma_\varepsilon^2): c\in\mathbb{R},\ \rho_j\in\mathbb{R} \ \forall\ 1\leq j\leq p,\ \sigma_\varepsilon>0}$
    is the parameter space, with $\sigma_\varepsilon^2$ as a nuisance parameter.

  • Direct acyclic graph for the parameterized pipeline:
    pipeline

3.2 Hyperparameter tuning with grid search

Since Spark ML's CrossValidation and TrainValidationSplit executes random splitting which is not applicable to time series data, I built my own custom functions:

  • TimeSeriesSplit: Custom index-based train-test split function suitable for time series data
  • LRTuning: Custom grid search function for model evalation/tuning

Hyperparmeters fixed: maxIter=10, lags=5
Hyperparameters tuned: regParam $\lambda\in[0.01,0.1]$, elasticNetParam $\alpha\in{0,1}$

3.3 Forecasting
I run sample forecasts using the tuned hyperparameters, and compare the results with a baseline OLS model.

Numerical results (more details in notebook)

Time series plot: transaction fees by block and date

txn_fee_by_block txn_fee_by_date

Average gas price by date (in wei)

gas_price_by_date

Test for stationarity

Augmented Dickey-Fuller test results (frequency=daily):

Statistic Value
ADF Statistic -2.293400
p-value 0.174105
Critical Value (1%) -3.4423174665535385
Critical Value (5%) -2.866818952732754
Critical Value (10%) -2.569581505602171

Test for serial correlation

Sample correlogram (frequency=daily)
correlogram

Ljung-Box Q test results (frequency=daily):

Lag (order) Ljung-Box Q Statistic p-value
1 212.34017225797706 0.00
5 890.9232986158576 0.00
10 1661.1058361378193 0.00
20 2920.6317981596794 0.00
30 4003.365183837403 0.00

Model training & forecasting:

Baseline model: AR(1) with OLS (frequency=daily)
RMSE: 1673
Model coeficients:

Coefficient Value
$c$ 2631.3209
$\rho_1$ 0.5940

forecast_baseline

Tuned model: 5-lag Lasso model (frequency=daily)
RMSE: 1210
Hyperparameters: regParam=0.01, elasticNetParam=1
Model coefficients:

Coefficient Value
$c$ 1033.6
$\rho_1$ 0.2953
$\rho_2$ 0.1669
$\rho_3$ 0.1208
$\rho_4$ 0.1256
$\rho_5$ 0.1240

forecast_tuned

A side-by-side comparison of the baseline OLS model and the tuned Lasso model: forecast_comparison

Future work

  • Improving computational performance for conducting time series tests:
    Since the test functions were written to work with numpy arrays, currently the spark dataframes must be converted to pandas dataframes befor testing, resulting in performance slowdown, as well as memory insufficiency issues when aggregating on the block level. The solution might be to migrate the testing part to a Dask framework instead of Spark for better performance with numpy functions.
  • Accounting for serial correlation in OLS estimation with Gordin's Central Limit Theorem:
    Since the sample exhibits serial correlation, it is more appropriate to estimate the asymptotic variance of the OLS estimator (therefore the standard errors) using the Gordin conditions (see Hayashi (2011) Ch. 6.5-6.6). This will require a new model class built up on PySpark's LinearRegression class, which is beyond the scope of this project. Spark's ML library does not provide a Generalized Method of Moments (GMM) estimator.
  • Parallelize hyperparameter tuning function:
    Currently not a serious bottleneck since the grid size is small and functions within each loop are parallel, but might benefit from further parallelization similar to Spark's CrossValidator.

References

Miscellaneous

This project is submitted as the final project for MACS-30123 Large Scale Computing for the Social Sciences, taught by Prof. Jon Clindaniel at UChicago in Spring 2023. Huge thanks to Jon and the TAs for all the help related to class material and this project.

About

Large-scale time series modeling of Ethereum transaction fees with AWS and PySpark ML

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.7%
  • Python 0.3%