-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify Computational Bottlenecks in Zgoubi #37
Comments
This has been at least partially done in SOW1. The takeaway is: A full 10.5% of the runtime is devoted to computing
What additional information are we looking for to consider this issue resolved? Some work can be done on a different platform to look at metrics like L2 cache miss rate, % of cycles stalled waiting on resources, or a memory BW/latency problem identification with ParaTools ThreadSpotter. However, these profiling results agree pretty well with the first study. |
There are many things we need to look at here. This is for SOW2, which has yet to begin. |
I agree, I’m just looking to better define the problem. |
@robnagler Are we authorized to start work on SOW 2? If so, @dtabell shall we plan to talk at the usual time next Tuesday? |
SOW2 conversation is happening offline. The acceptance requirements for this Issue are still undecided. Here are my thoughts about what was already provided in #4. When I look at the inclusive graph, I see that transf.f consumes 88 seconds to __sincos's 32 seconds. Maybe I'm not reading it right. I can't "scroll right" on the png, but there are many unknown and direct libc calls from transf.f directly. One thing I would like is more raw data and fewer graphs. I like being able to sort the results. I don't know what TAU can output, but a simple CSV or other text output would make analysis much easier. This would also allow us to measure improvements programmatically. What I would expect is less discussion about cache utilization, and more understanding of what algorithms are being used that possibly could be replaced by more modern approaches. For example, in the __sincos case, would a table lookup or a memoized solution be more efficient? What's doing all those sin/cos calls and why? Is there a better algorithm at the outer scope? Once we understand the algorithms, we'll have a better idea of how parallelization can proceed. |
@robnagler and @dtabell @zbeekman is ready to start working on this issue. This is a good time for a teleconference on this to better define the objectives, an important outcome of which would be checkboxes that specify what accomplishments would represent the completion of this task. |
NOTE: This post is currently a WIP Performance Study of ZgoubiNOTE: To make images larger, please click on them. AbstractTAU Commander was used to gather performance data for zgoubi. TAU Commander provides and abstraction of the performance engineering workflow, and organizes instances of performance Methodology and BackgroundTAU and TAU Commander Basics and BackgroundTAU Commander is a front end for the TAU Performance System ®️ developed at the University of Oregon. TAU Commander simplifies the TAU Performance System ®️ by using a structured workflow approach that gives context to a TAU user’s actions. This eliminates the troubleshooting step inherent in the traditional TAU workflow and avoids invalid TAU configurations. TAU Commander reduces the unique steps in the workflow by ~50% and the commands a user must know from around eight to exactly one. TAU Commander activities are associated around three basic components: Target, Application and Measurement (TAM). This is illustrated in the figure below. The first, target, describes the environment where data is collected. This includes the platform the code runs on, its operating system, CPU architecture, interconnectivity fabric, compilers, and installed software. The second is the application. The application consists of the underlying items associated with the application - whether the application uses MPI, OpenMP, threads, CUDA, OpenCL and such. The measurements define what data will be collected and in what format. Even though an application uses OpenMP or MPI, the measurements may or may not measure those items In addition to that basic structure there are a couple of more components to complete the TAU Commander interface. The first is a project. A project is the container for the developers grouping of defined activities, settings and system environments. Last is the experiment. An experiment consists of one target, one application and one measurement. One experiment is active and that is what will be executed when developers collect data. When an experiment is run and data is collected and that completed data set is a trial. The project configuration, including any applications, targets, measurements, experiments and their associated trials are stored together in a database managed by TAU Commander. A high level snapshot of this database and associated entities is provided by the Project Specific TAU Commander Usage and SetupTAU Commander was installed into the user's home directory from the cd ~/src
git clone --branch=unstable https://github.com/ParaToolsInc/taucmdr
cd taucmdr
make INSTALLDIR=${HOME}/taucmdr USE_MINICONDA=false install
export PATH="${HOME}/taucmdr/bin:$PATH" Once installed, a new project can be initialized for zgoubi: cd ~/src/zgoubi
tau init --linkage dynamic --compilers GNU This will set the target name to tau application copy zgoubi z-ptcl where By default, TAU Commander will create 5 measurements:
The three measurements that are anticipated to be of value to the daily development of zgoubi are: In general, whether using sampling or instrumentation, TAU will throttle instrumentation of procedures which are called more than 100,000 times, and take less than 10 ms to execute. This still results in calls to TAU's API, but then TAU returns immediately without pushing or popping a timer on the stack. Despite this, this can still incur significant overhead, which is why the addition of a selective instrumentation file will improve the accuracy of the generated profiles and minimize artificial runtime dilation. The time (or hardware counter metrics) associated with a procedure that is throttled or skipped end up being attributed to the first parent call that is not excluded or throttled. For the purpose of the present study additional measurements were created and used, however they are either unavailable for daily use or not anticipated to be of much additional benefit to the zgoubi developer. Some are unavailable because VirtualBox does not allow direct access to CPU hardware performance counters, and, even if it did, these are CPU specific and often only available on server-grade CPUs. The others are not anticipated to be useful on a daily basis because they may incur additional overhead and cause the results to be difficult to interpret. ResultsQuantification of Measurement OverheadFirst, the most general results are presented, and the measurement overhead is quantified for sampling and compiler based instrumentation for each of the 4 test cases, across the development version and the new particle update version of the code. The table below lists the total runtime for each case and the measurement induced runtime dilation,
The first four rows of the table represent the 4 test cases running the un-updated development branch of the code. The second 4 rows are those same test cases running the version of zgoubi with the new particle update algorithm. The first column of data gives the total program runtime as measured by the The "comp-inst" (compiler instrumentation) data and the "no I/O comp-inst" is less straightforward. A selective instrumentation file was used to limit instrumentation to the most time consuming procedures, but different test cases spend time in different locations. The "comp-inst" measurement excludes all procedures that are throttled by TAU (procedures that are called more than 100,000 times and return within 10 ms), except for some IO procedures that account for a significant portion of time in some of the tests. The total runtimes reported for the development branch version of zgoubi seem relatively respectable: very good agreement with the corresponding uninstrumented runtimes, with the exception of the spin18GeV case, in which computation outweighs I/O, but a significant, non-trivial quantity of I/O still occurs. Here the instrumentation overhead appears to be about 7.5%. For the updated particle tracking algorithm, instrumentation appears to come at a bigger cost: 3 out of 4 cases took over 20% longer. In order to combat this apparent inflation of measured runtime, an additional measurement was created, using a different selective instrumentation file which also excludes procedures that are dominated by I/O. The result of this was that the total runtime of the new particle update cases seemed to be within the measurement noise, however the tests using the old particle update appeared to run faster than the un-instrumented program. Despite these noisy findings, relative percentages of time spent in different program regions may be examined. One should keep in mind the following points when analyzing results:
Sampling ProfilesSampling is usually the fastest path to performance data and can give you line numbers for expensive lines and call path information. However, it will also pickup system libraries and can be a bit difficult to interpret. The plot below shows exclusive time for the four test cases, warmSnake, spinESRF_18GeV, spinSaturne and solenoid in that order, from top to bottom. These profiles are using the development branch The times are listed in seconds. Time spent in From inspecting compiler-instrumentation profiles, presented further below, it is clear that the first case (window 0, corresponding to warmSnake) and the last case (window 3, corresponding to solenoid) are completely dominated by I/O heavy routines. Looking at the sampling based profiles above, we see that these two cases appear to spend a lot of time in The updated tracking algorithm's sampling based profile for the top six most sampled procedures matches closely to the findings above for the two I/O dominated cases shown in windows 0 and 3, warmSnake and solenoid. However, for the two computationally intensive spin calculations the new particle update procedures make their way into the top six most sampled procedures. The sampling profiles for these cases is shown below. In particular line 70 of derivB(:,k) = matmul( dnB(:, 1:orderEnd(k)), monomsU(1:orderEnd(k)) ) There are three potential issues with this line:
Instrumentation Based ProfilesWhile automatic source based instrumentation with TAU often yields the most accurate results with the least overhead, the parsers used by PDT and TAU have difficulty with both the very old deprecated constructs such as Direct instrumentation, whether compiler based, manual or automatic using PDT and parsers, results in calls to TAU's API being inserted at the entrance to--and exit from--every procedure within zgoubi's source code. As a consequence of this, it is both the most accurate form of profiling, but also poses the greatest risk to artificially inflate runtimes if calls to the TAU instrumentation API are made too frequently and/or by procedures which do not contain any measurable work. As such, the performance engineer will typically start by instrumenting everything, and then narrowing the files and procedures instrumented by use of a selective instrumentation file. This file consists of a list of procedures and/or files to exclude (blacklist) or include (whitelist). Using this approach, only the innermost kernels responsible for the majority of program runtime are instrumented, while procedures called with great frequency, but containing little work can be excluded. (Any time spent in such a procedure will end up attributed to the first instrumented parent caller on the current call-stack.) Because testing was conducted across multiple cases with different paths taken through the code, two selective instrumentation files were generated for the development version of zgoubi and two were generated for the new particle update version. The first one excludes all but the most time expensive computational and I/O procedures for both versions of the app. The second excludes everything except for the most computationally expensive procedures, including I/O procedures that may be important for the two most I/O heavy test cases. Below profiles are presented where case 0 and 3 (warmSnake and solenoid) are presented using the selective instrumentation file that includes I/O. The two middle cases, 1 and 2 (spin_ESRF18GeV and spinSaturne) are not dominated by I/O, and therefore in order to accurately profile the computationally intensive procedures the more restrictive instrumentation file was used that excludes I/O to minimize inaccuracy and scrutinize the particle update procedures. The profile above shows the compiler-based instrumentation profile for zgoubi from the main development branch. The profile shows exclusive time in each procedure as a percentage of total program runtime. Unlike the sampling based profiles shown earlier, here the
Examining the profiles for the two computationally intensive spin cases (case 1 and 2) it becomes evident that the majority of time is attributed to the If we compare the profiles for these two computational intensive cases to the profiles produced using the same methodology for the new particle update code, shown below, a larger fraction of total runtime is spent in Performing a direct comparison between the old, development branch code, and the new particle update algorithm shows that Despite this, the new particle update algorithm still shows promise. In addition to the 3 potential optimizations identified above concerning line 70 of While instrumenting procedures that would normally be throttled by TAU and have very little work in them is a bad idea if one cares about accurate timings, such an approach is somewhat viable for comparing relative timings, and is completely plausible for comparing hardware counter metrics. In this vein additional experiments were performed on a HPE SGI 8600 system equipped with Intel Xeon Platinum 8168 (Skylake) CPUs. First, TAU throttling was turned completely off, and instrumentation was created for The statistics table view for the spinSaturne case is shown below. The portions of the call graph which have been instrumented are shown. Times are given in microseconds. Note the number of calls for each procedure, the Next StepsAn obvious first step is to increase the aggressiveness of compiler optimizations. Enabling link-time optimizations and more aggressive inlining may reduce some of the overhead associated with making many procedure calls, each having very little work. In addition, some if statements are present in the new particle update code. Since this code represents and inner kernel and is called many times, these if statements have the potential to create conditional branch instructions which may hinder efficient execution. Beyond testing additional cases and continuing to work with the zgoubi development team, an additional study using ThreadSpotter may yield some additional insight and indicate potential fixes. ThreadSpotter uses runtime sampling (often accompanied by extreme runtime dilation, but this doesn't matter since you're not timing the code) paired parsing the x86 opcodes issued by the program to analyze memory and threading issues. It produces a report, annotating the original program source code ranking problems in order of importance, characterizing them, and suggesting fixes explicitly. (e.g., fuse the loop on line 27 with the loop on line 43, or try implementing loop tiling/blocking on the outermost loop at line 76.) This can typically be done relatively quickly and easily and the report offers the programmer good, actionable suggestions to improve performance. In addition, iterative performance experiments, guided by profiles produced using TAU commander can help guide optimization efforts. The suggestions offered for improving the line calling Fundamentally, the algorithms, both the original zgoubi particle update procedure and the new particle update seem to be viable, however their implementations should be further examined. Further work collecting hardware performance counters may yield additional insight into the impact of conditional branch instructions, and how efficient cache utilization is. Transforming the code so that the innermost loop is over the first index in an array of particles (or multiple arrays for different properties of the particles) might lead to improved utilization of SIMD instructions on modern processors, better prefetching accuracy and cache utilization. This is analogous to transforming the code to use a Structure of Arrays rather than an Array of Structures (or scalar loops over structures) and will fundamentally change how data is packed in memory. I have attached the TAU Commander project directory here, which contains the TAU Commander project database, as well as the TAU profile files. The profile files are stored in |
No description provided.
The text was updated successfully, but these errors were encountered: