Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Internal Benchmark tooling #1

Open
1 task done
wildintellect opened this issue Feb 7, 2024 · 13 comments
Open
1 task done

Add Internal Benchmark tooling #1

wildintellect opened this issue Feb 7, 2024 · 13 comments
Assignees

Comments

@wildintellect
Copy link
Collaborator

wildintellect commented Feb 7, 2024

To facilitate tracking of benchmarks internal to the job runs the MAAP team is recommending Scalene for python code.

  • Wrap existing algorithm in Scalene
@chuckwondo
Copy link
Contributor

I implemented this on a new branch: https://github.com/MAAP-Project/get-dem/tree/scalene

I deployed a new version of the algorithm as GET-DEM:scalene and ran it with the following inputs:

  • bbox: -156.0 18.8 -154.7 20.3
  • compute: true
  • scalene_args: (left empty to use defaults)

I copied the outputs to /projects/shared-buckets/dschuck/get-dem/

I also tested that I can run it successfully from the ADE.

@chuckwondo
Copy link
Contributor

Closed too soon. Need PR approved first.

@chuckwondo chuckwondo reopened this Mar 4, 2024
@chuckwondo
Copy link
Contributor

@nemo794, I made adjustments to produce a JSON profile by default, rather than HTML. (I'm continuing to make adjustments to the scalene branch, which we can pick apart for submitting incremental PRs once #2 lands.)

I also modified the notebook in a manner that should aid our eventual effort to collate/aggregate profiling metrics once we're ready to kick off a slew of jobs. The primary change was to associate names with the sample bboxes to allow us to use the names as part of the tags for the jobs we submit, because a job's tag is used to create a directory along the path structure to the job's output directory.

More specifically, in the notebook, a job's tag (the identifier argument to submitJob, but labeled "Tag" in the Jobs UI) is structured as {bbox_name}__compute__{queue} when compute is True, or {bbox_name}__no-compute__{queue} when compute is False.

As an example, I ran "Italy", and I copied the output to /projects/shared-buckets/dschuck/get-dem/scalene/Italy__compute__maap-dps-worker-32vcpu-64gb/2024/03/08/07/57/58/466407/ so you can see the results.

Note that the path from get-dem onward is the full structure I copied from /projects/my-private-bucket/dps_output. Also note that the directory Italy__compute__maap-dps-worker-32vcpu-64gb is the job tag.

The information we want to extract from profile.json is within the files/<python filename>/functions arrays (only 1 python file in our case).

Unfortunately, Scalene does not produce an HTML file corresponding to the JSON file, so it's not immediately and completely clear which stats from the JSON file are the ones that are nicely rendered in the HTML file, so my next step is to run another job with the same inputs, but passing arguments to scalene to produce an HTML file. Then I'll compare what I see in the browser for the HTML profile against the stats in the JSON file to match things up. (Obviously, the numbers will be different, but should be close enough for me to decipher things.)

@chuckwondo
Copy link
Contributor

@nemo794, it looks like these are the 2 primary values we want to pull out of the JSON profile:

  • elapsed_time_sec (time)
  • max_footprint_mb (space)

What other information might be needed for the large spreadsheet you presented in the recent meeting about gathering get-dem performance figures from both NASA and ESA jobs?

@chuckwondo
Copy link
Contributor

@nemo794, for comparison between the 2 Italy runs (one producing profile.json and the other profile.html), I've copied the second run to /projects/shared-buckets/dschuck/get-dem/scalene/Italy__compute__maap-dps-worker-32vcpu-64gb/2024/03/08/10/03/45/609726/

@nemo794
Copy link
Collaborator

nemo794 commented Mar 13, 2024

Hi @chuckwondo , This is a great progress! Thank you! We discussed this offline, but for the record, I really like how you modularized each algorithm step into its own function, in order to wrap each step individually in the scalene wrapper.

The DPS outputs are really helpful! A few thoughts:

  • Is Scalene able to generate both html and JSON outputs at the same time? JSON has higher priority because of the need to parse algorithmically, but as a human the html is much friendlier for spot checking.
  • max_footprint_mb (space)
    • Good call -- this overview metric is helpful. If the number of cores remains constant, I don't see any reason this value should change from run to run... but it will help to select correct instance types, see impacts from changing the number of cores, and also as a sanity check that the algorithm isn't doing anything fishy. For now, I don't think we need a finer level of granularity.
  • elapsed_time_sec (time)
    • This appears to be the total time of the algorithm. This is definitely a good metric to have; the MAAP platforms' DPS metrics should actually capture the same metric, so now we'll be able to compare that the platforms are correctly capturing that info. Nice!
    • For benchmarking, we'll need a finer level of granularity. It'd be great to have timings for each step. Looking at the JSON, I'm only seeing the percentages of time and total time, but maybe I'm missing something? Here are a few screenshots to help clarify:
      Screenshot 2024-03-13 at 7 32 55 AM
      Screenshot 2024-03-13 at 7 27 36 AM

After this, hopefully there's a way to get even finer granularity by digging into the sardem step via calling it directly from Python, but one step at a time. :)

@chuckwondo
Copy link
Contributor

Hi @nemo794, thanks for the great annotated screenshots. They are very helpful.

This perhaps got buried in one of my earlier comments:

Unfortunately, Scalene does not produce an HTML file corresponding to the JSON file

That is, Scalene won't produce both formats for a given run. It's one or the other, sadly.

Regarding, "finer granularity by digging into the sardem step via calling it directly from Python," I forgot to mention that I've already done that. The following shows the replacement of the system call:

image

That's within the get_dem function, so now we are getting the mem+cpu+io information that we were not seeing when we were using the system call.

Regarding the absolute time values vs. the percentages, it seems strange that the profile does not contain the individual elapsed time values. Unfortunately, I think that leaves us with having to do the extra calculations ourselves, but at least we have the necessary numbers to do so.

@chuckwondo
Copy link
Contributor

Fixed by #6.

@nemo794
Copy link
Collaborator

nemo794 commented Apr 23, 2024

Let's keep this open until the json parser is integrated. Thanks!

@nemo794 nemo794 reopened this Apr 23, 2024
@chuckwondo
Copy link
Contributor

@nemo794 and @arthurduf, I did a bit more digging into the profiling metrics captured by Scalene.

Unfortunately, I think Scalene's metrics are not providing the information we're looking for, or at least I haven't deciphered it yet, if it's there. I'm going to dig a bit deeper.

The problem is this: neither the percentages shown in the previous screenshot, nor the percentages shown in the screenshot below are the percentages that allow us to compute the percentage of elapsed time taken by each function.

To illustrate, the timings printed to _stderr.txt are as follows:

2024-03-08 07:33:55 [get_dem] [INFO] get_dem: 628.8 seconds
2024-03-08 07:33:57 [get_dem] [INFO] read_dem_as_array: 2.5 seconds
2024-03-08 07:57:57 [get_dem] [INFO] do_computations: 1440.1 seconds
2024-03-08 07:57:57 [get_dem] [INFO] main: 2071.4 seconds

The last number, 2071.4, closely aligns with the value of ~2072 for elapsed_time_sec in profile.json. So far, so good. An elapsed time of 628.8 seconds for get_dem is ~30.3% of 2072, and an elapsed time of 1440.1 seconds for do_computations is ~69.5% of 2072, giving us 99.8% of the total, with read_dem_as_array accounting for the remainder.

Unfortunately, neither the percentages circled in the preceding diagram, nor those highlighted below align with these percentages.

I believe the reasons are as follows:

  • The percentages circled in the previous diagram are memory consumption percentages, so it makes sense that they do not correspond to runtime percentages
  • The percentages below are CPU time percentages, which is related to elapsed time, but not equal to elapsed time. The CPU percentages do not account for non-CPU (i.e., I/O) time.

Therefore, I'm going to do a bit more digging into the other parts of the profiling metrics to see if the I/O time is accounted for somewhere.

If the I/O time cannot be accounted for within the profile metrics, where does that leave us?

I'm thinking that we could use Scalene's profile simply to obtain max memory usage, but then rely on the output from our own code for capturing the runtime metrics. If that's the case, I recommend we tweak our output to write the runtime values to a separate JSON file because we don't want to attempt to parse the values out of the _stderr.txt file.

image

@arthurduf
Copy link
Collaborator

I'm thinking that we could use Scalene's profile simply to obtain max memory usage
--> Agreed
we don't want to attempt to parse the values out of the _stderr.txt file.
And also that file does not exists on ESA DPS.

@chuckwondo
Copy link
Contributor

Per discussions w/ @nemo794, I have created a branch named gather-profile-metrics for working on gathering metrics from a batch of jobs.

Here's what I've done so far on that branch:

  • Modified get_dem.py to write function elapsed times to a file named elapsed.json (in addition to logging), since Scalene, unfortunately, does not appear to capture this specific metric for each function.
  • Added simplify_profile.py to extract max_footprint_mb and elapsed_time_sec from profile.json and combine them with the elapsed function times from elapsed.json into a file named simple_profile.json.
  • Added aggregate_profiles.py to aggregate (specifically compute the mean) values from a list of paths to simple_profile.json files. It currently prints the mean values to stdout.
  • Added esa/README.md and nasa/README.md with instructions for running the algorithm locally, simulating the ESA env and the NASA env, respectively.

If you want to test things out, you can run things locally by following the instructions in esa/README.md and nasa/README.md. After successfully following the instructions in both of those files, you should end up with the following files in your output directory:

$ tree --filesfirst output/
output/
├── dem.tif
├── elapsed.json
├── profile.json
├── simple_profile.json
└── esa
    ├── dem.tif
    ├── elapsed.json
    ├── profile.json
    └── simple_profile.json

At this point, you can compute the means by running the following:

./aggregate_profiles.py output/simple_profile.json output/esa/simple_profile.json

This will output the mean values to stdout, which should give you something like the following:

$ ./aggregate_profiles.py output/simple_profile.json output/esa/simple_profile.json 
{
  "max_footprint_mb": 814.0371088981628,
  "total_elapsed_sec": 31.518759727478027,
  "download_and_stitch_sec": 19.908729314804077,
  "compute_sec": 4.638479709625244
}

@chuckwondo
Copy link
Contributor

@nemo794, okay to consider this closed now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants