This repository consists an anonymized version of six datasets taken from IBM's DataStage™ production systems and used for frequent subgraph mining in the paper Refactoring ETL Flows in The Wild.
If you are using this dataset in publications, please cite:
Dolev Adas, Ohad Eytan, Guy Khazma, Josep Sampé, and Paula Ta-Shma.
"Refactoring ETL Flows in The Wild."
In 2023 IEEE International Conference on Big Data (BigData), pp. 1581-1590. IEEE, 2023.
We also have a companion blog post to the paper, and a blog post describing the dataset creation and motivation.
Similar to the format used here and here, each dataset is a text file, where each line contains one of three options:
t # n
- represents the start of flow numbern
.v x l
- represents vertex with idx
and label ofl
(below we explain the process of derivingl
).e x y l
- represents an edge from vertexx
to vertexy
with labell
(in our case, all the labels are1
).
As we explained in more detail in the paper, each stage in a flow of DataStage™ has parameters, and we have different options for deriving the label of the stage depending on which patterns we are looking for. We call this process lifting.
Here, we are publishing two types of lifting:
- Simple: We only take the stage type as the label. This could be used to find general patterns and help create tools to help flow authoring.
- Detailed: Take parameters into account, aiming to take all the parameters that would provide an option to refactor flows to use common subflows. Notice that as this is a WIP prototype, this might not be completely accurate (e.g., we take parameters that we shouldn't or vice versa).
These values hashed into unique integers to preserve our users' anonymity while keeping the structure of the flows and the ability to find common subgraphs using FSM algorithms. The hashes are not consistent between different datasets.
We thank the DataStage™ team for providing us the data and allowing us to share it with the community.
Although this Github repository is under the Apache-2.0 license
, the actual datasaets are released under the CDLA-Sharing-1.0 license. By downloading or using them, you agree to the terms of this license.