-
Notifications
You must be signed in to change notification settings - Fork 2
Initial development of Amdahl's Law plotting #28
Conversation
May consider just using Numpy, keep it simple https://matplotlib.org/stable/gallery/lines_bars_and_markers/simple_plot.html#sphx-glr-gallery-lines-bars-and-markers-simple-plot-py |
I can go either way on that. If the plotting code is intended to be black box like the Amdahl script is, then I don't mind making it more legible and showing the theoretical speedup line. If they're expected to code it up on their own, then a simpler plotting code is fine. One other reason I added the extra line was because my timing results were so far off from the expected values. Not sure if that's due to conda's mpirun or something present on both my Mac and my HPC. Adding the theoretical line could lead to discussions of "yeah, and your results are right on that line because this is a somewhat synthetic problem," "yeah, your results are off the line because measurements aren't perfect," or others rather than explaining why different learners' graphs look visibly different. |
It's possible, but I believe it would be more complicated to build the NumPy arrays than it is to build the Pandas DataFrame. The input data for the plot exists as a number of JSON files written by the compute nodes to disk, essentially one "row" per file. The process of building the DataFrame, as Mike programmed it, is relatively straight-forward to explain. If you can come up with a competitive solution with NumPy, I'd certainly be interested to see it. The plotting script is intended to be a gray box: learners can open the file and see what's going on inside, but it is not necessary to do so, and we will certainly not spend time in a Snakemake lesson on writing a Python plotting script. |
Possibly writing out a csv file rather than json would make it easier, but if Python will be required for this, then using Pandas is ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! I added some suggestions, I hope they are helpful. The config.yaml
looks great with the comments.
code/Snakefile
Outdated
conda: | ||
"amdahl" | ||
shell: | ||
"python3 plot_amdahl.py {input} > {output}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snakemake does have a specific feature to handle scripts, so I think may be worth mentioning that. I'm not saying that this rule should be change to use the script
directive, but rather that users should be aware of that, maybe a callout box would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! I'm not sure how script
interacts with cluster
, though, and we want the plotting to be done on a compute node.
@bkmgit if you would like to offer a complete solution using NumPy, we will consider it, but that is outside the scope of this PR. |
Done tinkering with this for now. Here's what the scaling plot currently looks like. There's also a PR on amdahl that improves the random jitter code (to accept a jitter proportion on the CLI and only increase runtime, unless the proportion is negative, in which case only decrease runtime). If you're in a reviewing mood, please take a look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I might be biased.
I think it's useful to get a version of this code into the repo. In the spirit of not letting the perfect be the enemy of the good, I favor merging. |
For #13. May still need some work on this, but here's a start for the Snakefile and Python code to convert the run*.log files into a speedup graph like the one shown below. Other items of note:
mpirun
, especially in an HPC environment, asmpirun
may query the scheduler to know how many processes it "should" run.mpirun
is deficient or if I'm missing something else. Some runs have been better than others, but for a synthetic example, I'd have expected my values to be practically perfect on every run. -- Followup: Trevor's results were basically perfect, so it's possibly a local problem. Will check later with current Amdahl code to verify.