“The 2022 DEBS Grand Challenge focuses on real-time complex event processing of real-world high-volume tick data provided by Infront Financial Technology (https://www.infrontfinance.com/). In the data set, about 5000+ financial instruments are being traded on three major exchanges over the course of a week. The goal of the challenge is to efficiently compute specific trend indicators and detect patterns resembling those used by real-life traders to decide on buying or selling on the financial markets” (https://2022.debs.org/call-for-grand-challenge-solutions/)
Given the problem of analyzing market data in real-time, a streaming system is the obvious choice. We will be using Flink for the data analysis and Kafka for the PubSub part of the problem. For data visualization, we will look into solutions such as Python or Javascript due to the vast array of pre-built libraries for visualization.
A real-time data analysis and visualization system that could be used in the real world. The primary metric, EMA, could easily be expanded or replaced with different ones, making our system modular, extensible, and horizontally scalable.
The DEBS challenge provides a test dataset along with a benchmarking platform to test our performance. The DEBS Challenge platform also provides VMs for deployment of our code and testing there.
Per DEBS, “The data set contains 289 million tick data events covering 5504 equities and indices that are traded on three European exchanges: Paris (FR), Amsterdam (NL), and Frankfurt/Xetra (ETR). All tick data events for security types equities (e.g., Royal Dutch Shell, Siemens Healthineers) and indices are contained as they have been captured on these days. However, some event notifications appear to come with no payload. This is due to the fact that this challenge requires only a small subset of attributes to be evaluated; other attributes have been eliminated from the data set to minimize its overall size while keeping the amount of events to process unchanged.”
This data set was captured by Infront Financial Technology for a complete week from 11/8/2021-11/14/2021.
The project’s success should first be measured by correctness. Later, we want to challenge ourselves by making the throughput higher and latency lower. Latency would be measured by how long our system takes to process a single window of data. Throughput would be the size of the window the system can process.
Implementation of an operator that calculates the EMA for a given symbol
Implementation of an operator that uses Query 1 to track 2 EMAs for a given symbol and detect patterns (bullish/bearish) in crossovers of EMA values
Everytime a bullish or bearish breakout pattern is detected for a given symbol, reply to the subscriber and notify
Here is a tentative checklist of assignments for this project. We will be meeting on at least a weekly basis physically.
There are two main parts to this project:
- Getting VMs working
- Example code running
- Ask for sample correct output
- EMA calculation through Flink
- Breakout pattern detection
- Subscription handling, filter the query 2 breakout events to be only for the symbols that are subscribed to
- Testing and Benchmarking
- Work on optimizing query 1
- Work on optimizing query 2
- Sample data visualization drafts
- EMA visualization
- Breakout visualization
- Documentation
- Testing and Benchmarking
- Short Paper draft
Here is an initial task distribution
- EMA calculation through Flink
- Ask organizers for sample input and output
Breakout pattern detection
Breakout pattern detection
EMA calculation through Flink
Example code running
keyBy()
→ we will partition the data by symbol, which is uniquewindow()
→ takes in a keyed stream and allows tumbling windows, our exact applicaiton. We also need to operate on the entire window once it is completed and retain until trigger.
var windowedStream = dataStream
.keyBy(value -> value.symbolID)
.window(TumblingEventTimeWindows.of(Time.seconds(300))); // 300 = 5mins
// some function that incrementally updates the latest price seen for the window
// if at a trigger/window end, calculate the EMA and throw away last price seen and update previous EMA
// ??? unknown function ??? We are not sure which function we will be using here, yet.
- EMAs for every symbol/window, we keep this until the next window’s EMA is calculated.
- We need to keep two EMAs for each symbol and the latest price seen in a window for a symbol until that EMA is calculated.
- We must do this EMA calculation for all stocks.
There are two things to update:
- We want to update the last price observed for a symbol on addition to a window incrementally iff it is later than the latest event-time seen for the window (not system-time, when you receive the event).
- For the actual EMA calculation, you must wait until window is complete (triggered), the last price is actually seen for the window.
No, EMA calculations are unique to a window, but you use the last window’s EMA value to calculate the current.
Because the symbol’s EMA values are independent of each other, these calculations can be parallelized. Flink allows any keys to be parallelized. Because we are keying by symbol, we are parallelizing by them as well.
Since we have 5,000 unique symbols, we could have as much parallelization as 5,000 at once. However, it may be better to “categorize” or group symbols in a different way so the parallel processes are running batches of symbols. This is because each thread causes lots of overhead and only calculating one symbol.