Simulation code for the study "The Upper Bound of Information Diffusion in Code Review"
The underlying idea of our in-silico experiment is simple: We simulate an artificial information diffusion process in empirical communication networks emerging from code review and measure the minimal paths among all participants, the upper bound of information diffusion. The cardinality of reachable participants indicates how far (RQ 1) and minimal distances between participants indicate fast (RQ 2) information can spread following the communication channels that code review provide under best-case assumptions.
Yet, since communication, and, therefore, information diffusion, is (1) inherently a time-dependent process that is (2) not necessarily bilateral—often more than two participants exchange information in a code review—, traditional graphs are not capable of rendering information diffusion without dramatically overestimate information diffusion (Dorner et al. 2022). Therefore, we use time-varying hypergraphs to model the communication network and measure the minimal paths of all vertices. Since a hypergraph is a generalization of a traditional graph, traditional graph algorithms (i.e., Dijkstra's algorithm) for determining minimal distances between vertices can be used.
The connotation of minimal is two-fold in time-varying hypergraphs: A distance in time-varying hypergraphs between two vertices can be topological or temporal. This means a minimal path in time-varying hypergraphs can be the shortest, fastest, and foremost distance between vertices. Those different notions of a minimal path enable us to understand how fast and how far information can spread through code review.
For more details on time-varying hypergraphs in general and modelling communication networks that emerges from code review with time-varying hypergraphs, have a look into Dorner et al. 2022.
The simulation requires Python 3.10 and higher. Due to the significant performance improvements in Python 3.11 and the heavy CPU workload in the simulation, Python 3.11 is highly recommended!
The project depends on two external libraries: tqdm
and pandas
. Install via
python3 -m pip install -r requirements.txt
For a faster initial loading of the communication network, you can optionally install orjson
via pip:
python3 -m pip install orjson
If orjson
is not installed, built-in json
encoder is used.
To run the full simulation, use
python3 -m simulation.run
Please notice that depending on your hardware, the complete simulation may run several days and max out the CPU power. On a Apple MacBook M1 Max, it takes about three full days to complete. The simulations is highly parallelized which means: The more cores, the better/faster. We also recommend at least 64 GB of RAM and at least 12 GB available storage for storing the results.
The simulation provides options
--select <name 1> <name 2> ...
to select a subset of available code review networks--vertex_dijkstra
to use a vertex-based implementation of Dijkstra's algorithm (which tends to be slower),--num_processes
to limit the number of processes
For an overview of all options, use python3 -m simulation.run --help
.
The code review communication networks are in the subfolder data/networks
, the simulation results are stored in data/minimal_paths
So far, the simulation provides only a rudimentary test setup. You can run all tests via
python -m unittest discover
The tests run also via GitHub Actions.
To verify the your results with our results, compare the MD5 hashes of your results (for example, via md5 ./data/minimal_distances/.*bz2
on macOS or md5sum ./data/minimal_distances/.*bz2
on Linux) with the following MD5 hashes.
trivago.pickle.bz2 64c97c8ddb1e67cb70bfe297ad81c4ed
trivago.csv.bz2 a5e1a6d5230ac8c1888a711bd91f0420
spotify.pickle.bz2 c434b887fcf449dc7195cc428260b35c
spotify.csv.bz2 259532c46779df2702bcff0fa6c7932f
microsoft.pickle.bz2 f5b0beb747705fe3fcf4a84191bba937
microsoft.csv.bz2 08e93558473fb2b0a00de90e608901a3
We also provide a minimal unittest that compares the hashes from Zenodo. It requires requests
(install via pip3 install requests
) and a Zenodo access token. Run the unit test with the following command:
export ZENODO_TOKEN=<insert token here>
python3 -m unittest tests/test_results.py
Please notice: This simulation uses Pickle Protocol version 5. Future protocol versions may produce different hashes if the internals change. .csv
files, however, must produce always the same hashes.
Because of the large runtime of the simulation, we provide precomputed results of the simulation via Zenodo. You can download the results and place the .pickle
and .csv
files in the subfolder data/minimal_paths
. Consider verify the .pickle
and .csv
files (see Verification).
To visualize the results and reproduce the tables and figures of the publication, see the Jupyter notebooks in the subfolder notebooks/
.
Thanks a lot
- Andreas Bauer for your valuable feedback in countless discussion.
- Students of the course Software Testing in 2023 for their extraordinary efforts on developing a test suite for this project.
Copyright © 2023 Michael Dorner
This work is licensed under MIT license.