Skip to content

Latest commit

 

History

History
100 lines (79 loc) · 6.23 KB

DESIGN.org

File metadata and controls

100 lines (79 loc) · 6.23 KB

Problem Statement

“The 2022 DEBS Grand Challenge focuses on real-time complex event processing of real-world high-volume tick data provided by Infront Financial Technology (https://www.infrontfinance.com/). In the data set, about 5000+ financial instruments are being traded on three major exchanges over the course of a week. The goal of the challenge is to efficiently compute specific trend indicators and detect patterns resembling those used by real-life traders to decide on buying or selling on the financial markets” (https://2022.debs.org/call-for-grand-challenge-solutions/)

Proposed Solution

Given the problem of analyzing market data in real-time, a streaming system is the obvious choice. We will be using Flink for the data analysis and Kafka for the PubSub part of the problem. For data visualization, we will look into solutions such as Python or Javascript due to the vast array of pre-built libraries for visualization.

Expectations

A real-time data analysis and visualization system that could be used in the real world. The primary metric, EMA, could easily be expanded or replaced with different ones, making our system modular, extensible, and horizontally scalable.

Experimental Plan

The DEBS challenge provides a test dataset along with a benchmarking platform to test our performance. The DEBS Challenge platform also provides VMs for deployment of our code and testing there.

Per DEBS, “The data set contains 289 million tick data events covering 5504 equities and indices that are traded on three European exchanges: Paris (FR), Amsterdam (NL), and Frankfurt/Xetra (ETR). All tick data events for security types equities (e.g., Royal Dutch Shell, Siemens Healthineers) and indices are contained as they have been captured on these days. However, some event notifications appear to come with no payload. This is due to the fact that this challenge requires only a small subset of attributes to be evaluated; other attributes have been eliminated from the data set to minimize its overall size while keeping the amount of events to process unchanged.”

This data set was captured by Infront Financial Technology for a complete week from 11/8/2021-11/14/2021.

Success Indicators

The project’s success should first be measured by correctness. Later, we want to challenge ourselves by making the throughput higher and latency lower. Latency would be measured by how long our system takes to process a single window of data. Throughput would be the size of the window the system can process.

Deliverables are:

Query 1: Quantitative Indicators Per Symbol

Implementation of an operator that calculates the EMA for a given symbol

Query 2: Breakout Patterns via Crossovers

Implementation of an operator that uses Query 1 to track 2 EMAs for a given symbol and detect patterns (bullish/bearish) in crossovers of EMA values

Reply to subscriptions

Everytime a bullish or bearish breakout pattern is detected for a given symbol, reply to the subscriber and notify

Task Assignments

Here is a tentative checklist of assignments for this project. We will be meeting on at least a weekly basis physically.

There are two main parts to this project:

Initial Proof of Concept

  • Getting VMs working
  • Example code running
  • Ask for sample correct output
  • EMA calculation through Flink
  • Breakout pattern detection
  • Subscription handling, filter the query 2 breakout events to be only for the symbols that are subscribed to
  • Testing and Benchmarking

Optimizing

  • Work on optimizing query 1
  • Work on optimizing query 2
  • Sample data visualization drafts
  • EMA visualization
  • Breakout visualization
  • Documentation
  • Testing and Benchmarking
  • Short Paper draft

Task Distribution

Here is an initial task distribution

Xavier

  • EMA calculation through Flink
  • Ask organizers for sample input and output

Quang

Breakout pattern detection

Quan

Breakout pattern detection

Shekhar

EMA calculation through Flink

Ryte

Example code running

Q&A

Which flink operators are you going to use to implement windows?

  • keyBy() → we will partition the data by symbol, which is unique
  • window() → takes in a keyed stream and allows tumbling windows, our exact applicaiton. We also need to operate on the entire window once it is completed and retain until trigger.
var windowedStream = dataStream
    .keyBy(value -> value.symbolID)
    .window(TumblingEventTimeWindows.of(Time.seconds(300)));  // 300 = 5mins


// some function that incrementally updates the latest price seen for the window
// if at a trigger/window end, calculate the EMA and throw away last price seen and update previous EMA
// ??? unknown function ??? We are not sure which function we will be using here, yet.

What state do you need to keep and for how long?

  • EMAs for every symbol/window, we keep this until the next window’s EMA is calculated.
  • We need to keep two EMAs for each symbol and the latest price seen in a window for a symbol until that EMA is calculated.
  • We must do this EMA calculation for all stocks.

Are you going to compute the window aggregate incrementally or on trigger?

There are two things to update:

  • We want to update the last price observed for a symbol on addition to a window incrementally iff it is later than the latest event-time seen for the window (not system-time, when you receive the event).
  • For the actual EMA calculation, you must wait until window is complete (triggered), the last price is actually seen for the window.

Can you share any partial results across windows?

No, EMA calculations are unique to a window, but you use the last window’s EMA value to calculate the current.

How are you going to parallelize the computation? Can you even parallelize the computation?

Because the symbol’s EMA values are independent of each other, these calculations can be parallelized. Flink allows any keys to be parallelized. Because we are keying by symbol, we are parallelizing by them as well.

Since we have 5,000 unique symbols, we could have as much parallelization as 5,000 at once. However, it may be better to “categorize” or group symbols in a different way so the parallel processes are running batches of symbols. This is because each thread causes lots of overhead and only calculating one symbol.