-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Performance Tracking for v0.5 #13893
Comments
Will this include tracking performance on all three major operating systems (mac, linux distro,windows)? To identify possible regressions affecting only one system? |
No, performance tracking will be limited to running on Ubuntu Linux (similar to Travis). We don't have the manpower / resources to do cross platform performance testing at least initially. Part of the goal is to make all this testing infrastructure lightweight / modular enough that it can be run on a user's computer with minimal effort (another reason to use Julia and not something like CodeSpeed). This way volunteer's (or organizations) could plug gaps in systems we don't support through automated CI while using the same benchmarking stack. |
+1 I think I can use this infrastructure to track OpenBLAS performance, too. |
In the future, if we had something like a periodic (e.g. weekly) benchmark pass separate from the "on-demand" CI cycle being discussing in this issue, we might consider doing a full OS sweep on occasion (not per-commit, though). But as @jakebolewski pointed out, we're not really concerned with cross-platform tracking at the moment, especially given our current resource limitations. |
Let's get it working on Linux. Running it regularly on other OSes can be a later goal. |
We can turn the webhook on now if you know how it'll need to be configured yet. |
I'm not sure yet what the payload URL is going to be, but I'll keep you posted with the details once we figure it out. |
Hey guys, sorry to be MIA for the last week or two. @jrevels asked me in private a little while ago to write up my plan about performance testing that I am 30% through enacting, and so I am going to data-dump it here so that everyone can see it, critique it, and help shape it moving forward to make something equally usable by all. I'm personally not so concerned with the testing methodology, statistical significance etc... of our benchmarking. We have much more qualified minds to duke that out, what I'm interested in is the infrastructure; how do we make this easy to setup, easy to maintain, and easy to use. Here's my wishlist/design doc for performance eval of Base; this is completely separate from package performance tracking which is of lower priority IMO.
Right now, 90% of what I do with Julia revolves around creative abuse of buildbot. I have a system setup where my Perftests.jl package gets run on every commit of
Right now, that includes 0.4 and 0.5, but could possibly include 0.3. Obviously, there will be tests that we don't want to run on older versions, or even tests that we will drop as APIs change. But having our test infrastructure independent of This is why I made the
I like making pretty visualizations. But I am nothing compared to what the rest of the Julia community is capable of, and I'd really like to make getting at and visualizing our data as easy as possible. To me, that means storing the data in something robust, public-facing, and easily queried. For our use cases, I think InfluxDB is a reasonably good choice, as I don't think reinventing the database wheel is a good use of our time, and it provides nice, standard ways of getting at the data. In my nanosoldier/Perftest.jl world, my next step would be to write a set of Julia scripts that parse the That server, being designed for timeseries and publicly available, would likely function better than anything we would cobble together ourselves, and would open the path to writing our own visualization software (a la codespeed) to even using something that someone else has already written (Kibana, Grafana, etc...). That's been my plan, and I'm partway toward it, but there are some holes, and I'm not married to my ideas, so if others have alternative plans I'd be happy to hear them and see how we can most efficiently move from where we are today, to where we want to be. I am under no illusions that I will be able to put a significant amount of work toward any proposal, so it's best if the discussion that comes out of this behemoth of a github post is centered around what others want to do, rather than what I want to do. Either that, or we just have patience until I can get around to this. |
Sounds about right. I'd prefer a tagged comment listener hook ( |
Another option might be a "run perftests" github label. Edit: ah, but you wouldn't be able to specify foo. |
Thanks for the write-up, @staticfloat. I've definitely been keeping in mind the things we've discussed when working on BenchmarkTrackers. I'd love it if you could check out the package when you have the time.
I've been advocating the trigger-via-comment strategy because I think it will encourage more explicitly targeted benchmark cycles that will make better use of our hardware compared to a trigger-via-push strategy. One is still be able to trigger per-commit runs by commenting on the commit with the appropriate trigger phrase, and that way you don't have to clutter up your commit messages with benchmark-related jargon.
The logging component that BenchmarkTrackers uses for history management is designed to be swappable so that we can support third-party databases in the future. It currently only supports JSON and JLD serialization/deserialization, but there's nothing stopping us from extending that once we get the basic CI cycle going. |
Yes, please. The whole push-triggered model is so broken. Just because I pushed something doesn't mean I want to test it or benchmark it. And if I do, posting a comment is not exactly hard. I do think that we should complement comment-triggered CI and benchmarking with periodic tests on master and each release branch. |
I added a benchmark tag so we can tag performance related PR's that need objective benchmarks. |
@jrevels I don't have the bandwidth to meaningfully contribute to this, but my BasePerfTests.jl package serves a similar purpose to @staticfloat's, in that it was a thought experiment for disconnecting performance tests from the Julia version, and what a culture of adding a performance regression test for performance issues would look like (analogous to adding regression tests for bugs) |
CI performance tracking is now enabled! There are still some rough edges to work out, and features that could be added, but I've been testing the system on my Julia fork for a couple of weeks now and it's been stable. Here's some info on how to use this new system. The Benchmark SuiteThe CI benchmark suite is located in the BaseBenchmarks.jl package. These benchmarks are written and tagged using BenchmarkTrackers.jl. Triggering JobsBenchmark jobs are submitted to MIT's hardware by commenting in pull requests or on commits. Only repository collaborators can submit jobs. To submit a job, post a comment containing the trigger phrase
The allowable syntax for the tag predicate matches the syntax accepted by Examining ResultsThe CI server communicates back to the GitHub UI by posting statuses to the commit on which a job was triggered (similarly to Travis). Here are the states a commit status might take:
Failure and success statuses will include a link back to a report stored in the BaseBenchmarkReports repository. The reports are formatted in markdown and look like this. That's from a job I ran on my fork, which compared the master branch against the release-0.4 branch (I haven't trawled through the regressions caught there yet). Note that GitHub doesn't do a very good job of displaying commit statuses outside of PRs. If you want to check the statuses of a commit directly, I usually use GitHub.jl's Rough Edges/Usage Tips
Finally, I'd like for anybody who triggers a build in the next couple of days to CC me when you do so, just so that I can keep track of how everything is going server-side and handle any bugs that may arise. P.S. @shashi @ViralBShah @amitmurthy and anybody else who uses Nanosoldier: |
Sounds great. How easy will it be to associate reports in https://github.com/JuliaCI/BaseBenchmarkReports to the commit/pr they came from? Posting a nanosoldier response comment (maybe one per thread with edits for adding future runs?) might be easier to access than statuses, though noisier. |
The reports link back to the triggering comment for the associated job, and also provide links to the relevant commits for the job. Going the other way, clicking on a status's "Details" link takes you to the report page (just like clicking on the "Details" link for a Travis status takes you to a Travis CI page). That only works in PRs, though. I'm onboard for getting @nanosoldier to post automated replies on commit comments (that's the last checkbox in this issue's description). I'm going to be messing around with that in the near future. |
Exciting stuff, @jrevels! |
This is really cool @jrevels, so glad you've taken this up. |
I just updated the CI tracking service to incorporate some recent changes regarding report readability and regression detection. I've started tagging BenchmarkTrackers.jl such that the latest tagged version corresponds to the currently deployed version. The update also incorporates the recent LAPACK and BLAS additions to BaseBenchmarks.jl (they're basically the same as the corresponding benchmarks in Base). Similar to BenchmarkTrackers.jl, I've started tagging the repo so that it's easy to see what versions are currently deployed. I depluralized the existing tags (e.g. |
Responding here to discussion in #14623.
This is exactly the intent of BenchmarkTrackers. Recently, my main focus has been setting up infrastructure for Base, but the end-goal is to have package benchmarks be runnable as part of PackageEvaluator. If they want to get a head start on things, package authors can begin using BenchmarkTrackers to write benchmarks for their own package.
After we use the existing infrastructure for a while, we could consider folding some unified version of the benchmarking stack (BaseBenchmarks.jl + Benchmarks.jl + BenchmarkTrackers.jl) into Base, and have a
|
That's a lot of code to bring into base, and the wrong direction w.r.t. #5155. I think we can add more automation and levels of testing that aren't exactly Base or its own tests or CI, but would run and flag breakages frequently. |
So it would be |
PackageEvaluator now lives at https://github.com/JuliaCI/PackageEvaluator.jl, a little bit of refactoring would be needed there to accept commits to test against programmatically. I've been patching that manually for my own runs but shouldn't be too bad to make more flexible. |
It wouldn't be out of the question to separate the infrastructure from the benchmark-specific stuff in BenchmarkTrackers.jl, and put it in a "Nanosoldier.jl" package that could be used to handle multiple kinds of requests delivered via comment. That way all job submissions to @nanosoldier could easily share the same scheduler/job queue. |
That might mean less new semi-duplicated code to write. You apparently already had BenchmarkTrackers set up to be able to use the same nanosoldier node I've been using manually, right? |
For testing new versions of the package, yeah. The master/slave nodes it uses are easily configurable. |
I'll be doing a ForwardDiff.jl sprint next week, but after that I'd be down to work on this - I pretty much know how to do it on the CI side of things. The challenging part might be learning how PackageEvaluator works under the hood, but that doesn't seem like it will be overly difficult. |
I'll help on that side since I'll want to use this right away. |
I'll also help in terms of providing advice drawn on any lessons I've learnt |
@jrevels Does runbenchmarks against a branch (e.g. master) run against the merge-base (e.g. of master and the current commit) or the tip of master? |
If the job is triggered in a PR, benchmarks will run on that PR's merge commit (i.e. the result of the head commit of the PR merged into the PR's base). If there's a merge conflict, and the merge commit doesn't exist, then the head commit of the PR is used instead. Comparison builds (specified by the |
We really need to be running this against master on a regular schedule and saving the results somewhere visible. Only getting a comparison when you specifically request one is not a very reliable way to track regressions. |
It's definitely been the plan for the on-demand benchmarking service to be supplemented with data taken at regularly scheduled intervals. Armed with the data we have from running this system for a while, I've been busy rewriting the execution process to deliver more reliable results, and that work is close to completion (I'm at the fine-tuning and doc-writing phase of development). After switching over to this new backend, the next step in the benchmarking saga will be to end our hacky usage of GitHub as the public interface to the data and set up an actual database instance, as @staticfloat originally suggested. We can then set up a cron job that benchmarks against master every other day or so and dumps the results to the database. |
Ref #16128, I'm reopening until this runs on an automated schedule. |
An update: @nanosoldier will be down for a day or two while I reconfigure our cluster hardware. When it comes back up, the CI benchmarking service will utilize the new BenchmarkTools.jl + Nanosoldier.jl stack I've been working on for the past couple of months. The BenchmarkTools package is a replacement for the Benchmarks + BenchmarkTrackers stack, while the Nanosoldier package provides an abstract job submission framework that we can use to add features to our CI bot (e.g. we can build the "run pkgeval by commenting" feature on top of this). A practical note for collaborators: Moving forward, you'll have to explicitly at-mention @nanosoldier before the trigger phrase when submitting a job. For example, instead of your comment containing this:
...you'll need this:
More @nanosoldier documentation can be found in the Nanosoldier.jl repo. |
What needs to be done to get this running nightly and putting up a report somewhere people can see it? |
The easiest thing to do would just be to set up a cron job that causes @nanosoldier to submit CI jobs to itself on a daily basis. My work during the week has to be devoted to paper-writing at the moment, but I can try to set something up this weekend. |
Starting today, @nanosoldier will automatically execute benchmarks on master on a daily basis. The generated report compares the current day's results with the previous day's results. All the raw data (formatted as JLD) is compressed and uploaded along with the report, so you can easily clone the report repository and use BenchmarkTools to compare any day's results with any other day's results. |
The first daily comparison against a previous day's build has executed sucessfully, so I'm going to consider this issue resolved. There is definitely still work to be done here - switching over to a real database instead of abusing git, adding more benchmarks to BaseBenchmarks.jl, and making a site that visualizes the benchmark data in a more discoverable way are things that I'd love to see happen eventually. Any subsequent issues - errors, improvements, etc. - can simply be raised in the appropriate project repositories in JuliaCI. As before, we can still use this thread for PSAs to the wider community when user-facing changes are made. |
Does it make sense to run against previous release as well. I'm thinking else, regressions could be introduced bit by bit were each part is small enough to dissappear in the noise. |
That does make sense to me, though if we're collecting absolute numbers and expect the hardware to remain consistent, probably no need to run the exact same benchmarks against the exact same release version of Julia every day? Maybe re-run release absolute numbers a little less often, once or a handful of times per week? |
Let's continue this discussion in JuliaCI/Nanosoldier.jl#5. |
As progress moves forward on v0.5 development (especially JuliaLang/LinearAlgebra.jl#255), we'll need an automated system for executing benchmarks and identifying performance regressions.
The Julia group has recently purchased dedicated performance testing hardware which is hosted at CSAIL, and I've been brought on to facilitate the development of a system that takes advantage of this hardware. I'm hoping we can use this issue to centralize discussion/development efforts.
Desired features
Any implementation of a performance tracking system should:
parallel
benchmarks, or onlystring
benchmarks)Feel free to chime in with additional feature goals - the ones listed above just outline what I've been focusing on so far.
Existing work
In order to make progress on this issue, I've been working on JuliaCI/BenchmarkTrackers.jl, which supplies a unified framework for writing, executing, and tracking benchmarks. It's still very much in development, but currently supports all of the goals listed above. I encourage you to check it out and open up issues/PRs in that repository if you have ideas or concerns. Just don't expect the package to be stable yet - I'm in the process of making some drastic changes (mainly to improve the testability of the code).
Here are some other packages that any interested parties will want to be familiar with:
@time
.Eventually we will want to consolidate "blessed" benchmarking/CI packages under an umbrella Julia group on Github (maybe JuliaCI)?
I saw that Codespeed was used for a while, but that effort was abandoned due to the burden of maintaining the server through volunteer effort. I've also been told that Codespeed didn't integrate well with the Github-centric CI workflow that we've become accustomed to.
Resolving this issue
Taking into account the capabilities previously mentioned, I imagine that a single CI benchmark cycle for Base would go through these steps:
the buildbotsa CI server to build Julia at that commitThe buildbots call out toBenchmarkTrackers is used to process an external package of performance tests (similar to Perftests, but written using BenchmarkTrackers)If we can deliver the workflow described above, I think that will be sufficient to declare this issue "resolved."
Next steps
The following still needs to get done:
Regression Examples
Here are some examples from regression-prone areas that I think could be more easily moderated with an automated performance tracking system:
@parallel
#12794, Regression in task switching performance? #12223)The text was updated successfully, but these errors were encountered: