-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As a developer, I want to investigate the performance benefits (if any) of feature-batched retrieval #270
Comments
Original Redmine Comment One of the possibilities opened up by the refactoring for #95438 is to allow feature-batched retrieval. This is practically the same as multi-feature pooling, only the batched pool is then decomposed into its constituent 1-features pools. It is almost certainly true that retrieval of time-series data for N features at once does not take N times as long as retrieval of N features separately - it is probably closer to 1:1. Thus, there may be a performance advantage for multi-feature evaluations w/r to retrieval time. But there's no free lunch and batching would necessarily increase the risk of an oome in the same way that feature pooling increases that risk (only the batch sizes would be calibrated to reflect that - edit: perhaps even dynamically if we were clever, but then cleverness means extra code, more brittleness). |
Original Redmine Comment This is analogous to feature-batched retrieval from nwis or wrds on the reading side, I guess. |
Original Redmine Comment Consider this retrieval script from #96951, which contains a feature group of 130 features (not in prepared statement form).
Warmed up, an average of 5 retrievals is around ~60ms. Now consider the same script with a single location. The average-across-five retrieval time after warm-up is about 40ms. No real surprise, but emphasizes the dramatic speed-up in retrieval times that can be expected for multi-feature evaluations by feature grouping (regardless of whether the evaluation involves feature pooling or singleton feature tuples). |
Original Redmine Comment In other words, think wpod-style evaluations. With feature-batched retrieval already enabled for feature pooling, the time to implement this in a minimal way would be fairly small. However, in order to avoid arbitrarily large evaluations in memory, we'd probably want to extend the @retriever@ api to allow for retrieval from a batched-retriever by feature tuple, thereby providing an interface to the underlying @ResultSet@ to acquire only some of the data from it, as needed. This is a little different from the solution to #95488, which is true streaming or deferred reading because, in this case, we only want to read part of the @ResultSet@ (edit: whereas #95488 is about reading all time-series, one by one, each one generating a pull from the @ResultSet@). |
Original Redmine Comment I want to look at this for a couple of reasons. First, the wpod (large-in-space, short-in-time) evaluations are a significant bottleneck in cowres, some of them taking up to 10-12 hours and disrupting other (e.g., hefs) evaluations through interleaving ingest and retrieval, thereby reducing throughput overall and contributing to outliers. Around one-third to one-half of the total time is spent in retrieval, the rest in reading/ingest of which most seems to be wrds. Second, this should be relatively easy to achieve with the way retrieval is now abstracted and now that feature pooling is implemented because it essentially amounts to feature pooling on the front-end followed by unpooling/decomposition on the backend. This type of optimization can be implemented in the @poolfactory@, which decides how best to retrieve and cache data for pooling with the aid of a @RetrieverFactory@. Necessarily, this involves a trade-off between efficient retrieval (scripts that retrieve more features at once) and increased memory usage (more features in memory at once), so it will need to be configurable on the basis of what we actually see in production, with conservative initial assumptions. In general, large-in-space evaluations are short in time and vice versa, simply because large-in-space and long-in-time evaluations are impractical/intractable. |
Original Redmine Comment The idea would be to have a threshold at which feature-batched retrieval occurs. Let's say an evaluation that contains 10 or more singleton feature groups (groups that are not singletons are already feature-batched), above which features are batched. Then there would be a batch size, say 10 features per batch (or the smaller of that number and the total features in the evaluation, since the number would be configurable). These two things would be configurable in @wresconfig.xml@ and overridable with runtime system properties and my expectation is that they would be quite easy to set, in practice, and would probably never/rarely get touched, but could be in the mix with ram increments as "what shall we tweak to make best use of this extra ram?". |
Original Redmine Comment #99040 |
Original Redmine Comment Made a decent amount of progress with this. Indeed, it was relatively easy to achieve. Need to do some sensibility checking of the debug logging on @wres.io.retrieval@ and elsewhere to make sure the expected feature-batched retrievals are happening without any duplication. Also need to expose the two parameters in the @wresconfig.xml@ and choose initial values. Then need to top-level-integration-test w/ some large-in-space evaluations to see what performance gains are accrued. I think the largest evaluation (edit: spatially) I have locally is scenario703, so I might need to get an explicit list of features for a wpod evaluation. Hank, any chance you could help with an explicit list of features from a wpod evaluation? I think there are around 2k, from memory. I'm not sure whether they use a wrds region definition or list them explicitly. Thanks! |
Original Redmine Comment James, Sure thing. I'll see if I can get to it later this morning, Hank |
Original Redmine Comment James, The list below is what I generated to support the dStore evaluation of NWM forecasts for the performance tests. Dave/Juzer, I believe, use the WRDS tag "usgs_gages_ii_ref_headwater" and this was constructed based on that tag. Good enough? Hank =============================================================
|
Original Redmine Comment Thanks, that looks ideal. I probably don't have sufficient nwm data locally to test this for a wpod-style evaluation (I think scenario703 has around a day, whereas wpod is 30 or 60 days, I think). Still, it should provide an indication of the speed-up, percentage wise - we will need to uat on the wpod style evaluation anyway, also to check that the batching parameters (specifically, the batch size) is reasonable. |
Original Redmine Comment Doing some test runs now, several with no batching, several with batches of 100 features to begin with. A 100-feature batch may not be conservative enough w/r to ram usage, tbd (we can use a wpod evaluation that contains 60 days of data or whatever as a better guide). |
Original Redmine Comment The retrieval portion of this test is pretty quick - around a minute or so - to reiterate, it would be good to test with something at the other end of the spectrum too, i.e., 60 days or so. Still, the spread around the results is pretty tight, so it should give an indication, percentage wise. |
Original Redmine Comment Some results for the evaluation portion of each execution (which includes retrieval). Three runs, no feature batching:
Three runs, batches of 100 features:
Mean without batching: PT86.36100833S In other words, it is around 35% faster with 100-feature batching when compared to v5.15 behavior. Not bad. How that scales to heavier retrievals (same number of features) and to other batch sizes is tbd. Likewise, ram usage is a variable/trade-off. Given the very small coefficient of variation (around 4% at most), one run is probably good enough to test the various possibilities, no need to average over N. |
Original Redmine Comment Trying with 50-feature batches. |
Original Redmine Comment Around 52 seconds for 50-feature batches (2 instances), so just as good. |
Original Redmine Comment Looking at the logging of the retrievals, it makes sense, I see the retrievals happening in clumps and they encapsulate the expected feature batches. One thing to bear in mind is that, just because pool tasks are submitted to an executor service in a particular order, that doesn't mean the underlying pull on the data will happen in exactly the same order, so it is possible that pools could hang around in memory longer than would be strictly necessary (were they completed in exactly the order they were submitted). A pool will only become eligible for gc when there are no more references to it. This could probably be tightened up, but most likely at the expense of throughput. However, it is also a reason to be conservative with the batch sizes, because there will probably be some outlier scenario that uses more memory than expected. |
Original Redmine Comment Can't detect any difference in memory profile for no batching vs. a 500-feature-batched instance of the same evaluation, but then these are very tiny retrievals, so that is not too surprising. Again, instrumentation of a 60-day evaluation will reveal more. The spaghetti after the quiescent period is the evaluation/retrieval portion. Unbatched: !no_batching.png! Batched (500 features per batch): !500_batching.png! |
Original Redmine Comment Adding the two parameters to the system settings. But, in looking at the system settings, I see that many of the supplied settings are only partially validated. For example, it looks like negative thread counts would be accepted. Separate ticket. |
Original Redmine Comment System settings and overrides work fine. |
Original Redmine Comment Initially setting the default to 10 or more features (an evaluation with more than this number of singletons will be feature-batched) and a feature batch size of 50. Trying the system tests. |
Original Redmine Comment Tests pass. |
Original Redmine Comment Pushed in commit:wres|ef9f7d7f1c3b9894855e1e86f355c651bb50cca4. I doubt this will make it into v5.16. Regardless, it will need UAT. The main thing is to experiment with a WPOD-style evaluation, both to see how much faster the retrieval portion of the evaluation might be and to tweak the batch size, if necessary. The experiments above suggested a performance gain of about 35% w/r to retrieval time for an evaluation that contained ~900 features, but only 24 valid times per pool; this could vary with shape of evaluation. On hold, pending UAT. |
Original Redmine Comment Probably 5.17, unclear. |
Original Redmine Comment See #99040-223 and #99040-224. James, You asked for this in #99040-225:
Check_MK isn't giving me access to -ti03, so I am limited to whatever I can get from the command line. I already provided a @top@ result:
You can see its using 9.5% of 32 GB, or about 3 GB. Its allocated "Max Memory: 2493MiB", per the logging I shared in #99040-223. I'm probably just doing the math wrong, but regardless it appears to be at its upper limit. Let me search for better tools to do a memory profile, Hank |
Original Redmine Comment We don't have @jstat@ available on the -ti03 and the @pmap@ provides no information. I know we've been using some other Java tools on these processes, so I'll search the tickets or wiki for the proper command. I don't run them often enough to remember, apparently. Given how slow it's progressing, after getting a memory profile, I think I should stop the evaluation, set the batch size to 1, and start it again to get a baseline for performance. If that runs as before, and it should, then we can discuss deploying the batch size of 1 to production for 5.16 to get it out the door, and then do a proper study of the optimal batch size for 5.17. Thoughts? Hank |
Original Redmine Comment If you look at the jfrs under: @/mnt/wres_share/heap_dumps/wres/@, you may find a temporary directory that contains chunked files that can be loaded into jmc (edit: that cover the evaluation in progress). For a longer series, you could copy a bunch of them and merge/finalize them. At the end of a run, these files are merged into a final jfr. As to whether the slow performance is feature batching or something else is really tbd. A memory profile will help. The loading of the db machine looks pretty low. High memory usage and high gc on the app machine is a possibility. Tweaking the batch-size for the wpod evaluations was a recognized thing to do as part of the uat (#95867-30), so I don't think we should defer that. After getting a baseline with a batch size of 1, we can increase to 10. How much data are we talking about here? 60 days, hourly valid times (aggregated to three-hourly), forecasts issued 4x per day, 7 members, edit: and a forecast horizon of 8.5 days (but these are pooled into 3 hours per pool, I think)? ( Then multiplied by 50 locations per retrieval, 6 pooling threads at once. ) |
Original Redmine Comment Your numbers look right. Add to it 204 hours of lead times for each forecast. There is no jmc on the path in -ti03. Ugh. Do we have a wiki somewhere explaining how to make use of those .jfr files? Looking, Hank |
Original Redmine Comment ( Right, edited to include the horizon after you read it. ) Can you do an ls in that jfr dir and report back? Do you see some subdirs that contain chunked jfrs? |
Original Redmine Comment Back of the envelope. 3 (valid times per lead duration pool) * 7 members * 4 (times per day) * 60 (days) * 50 (locations) * 6 (threads) = (up to) 1,512,000 forecast values in memory. 8 bytes per forecast value. All the observations on top of that as wrapped doubles, 24 bytes each. Plus some other overhead, of course (array definitions, instants etc.). Regardless, that's tiny. What factor(s) am I missing? |
Original Redmine Comment Forgot to share the declaration. With domains omitted:
|
Original Redmine Comment First, it appears that the evaluation I'm using may be a good one to use to test out batch size once we have this issue worked out. The production evaluation Anna started spent only about 5 minutes in retrieval, while I just performed using staging took about 18 minutes, as reported above. This is not apples to apples, but it does point to a need to do at least one run each of batch size 50 and batch size 1 for the Anna evaluation once we've figured out what's going on with the difference in outputs. Speaking of which... This appears to be a real problem. If I take the declaration I just shared and shorten it to a day,
I see this using production:
and this using the latest revision:
Again, it appears as though the baseline results are overwriting the right-side results. I'm going to report this in the 6.8 deployment ticket, since it will hold up deployment. I'll also ask which ticket this likely relates to. Hank |
Original Redmine Comment To recap, the remaining work is to do some extra testing in -ti with a feature batch size of 50 versus no feature batching. |
Original Redmine Comment Dear Hank, Don't forget about this! Yours truly, Hank |
Original Redmine Comment I've deployed @featureBatchSize=50@ to staging, confirmed it was 50 in a smoke test, and have started the evaluation that reproduces #108993. Then I realized it was the retro sim evaluation that I needed to run. D'oh! Hank |
Original Redmine Comment I've stopped the #108993 job. For the #108361 job, here was the official reported retrieval/computation time:
The staging job testing with @featureBatchSize=50@ is 4300312082155091194. The earlier job was posted to production, so I may need to run this in staging with @featureBatchSize=1@, later, for a proper comparison. Hank |
Original Redmine Comment I managed to post the job before my VPN connection went down. However, I won't be able to check on progress until it comes back. Hank |
Original Redmine Comment The job I posted (the retro sim evaluation from #108361) failed due to OOME:
I can increase the -Xmx as an experiment on Monday, but if there is anything you want me to look at before then, let me know. I'll check it out Monday morning or so. Have a great weekend! Hank |
Original Redmine Comment No, this is just the balance to strike, batch size vs. memory. This evaluation is on the larger size per feature/pool. Although they aren't forecasts, let alone ensemble forecasts, there is also no pooling by lead duration to make the pools smaller. |
Original Redmine Comment Currently, the @mem_limit@ for the workers is 3584 MB. There appears to be no @-Xmx@ option specified in the Java options for the worker, but I see this in a smoke test:
I'm not clear on where those computations come from. Some sort of default for Java? Anyway, it seems significantly lower than the 3584 MB @mem_limit@ for the container, though some of that has to be reserved for the worker-shim, of course. The total RAM on the COWRES machines is about 32 GBs. The services are assigned @mem_limit@ values as follows: Entry Machine: |container|mem_limit each (MB)|total (MB)| Workers-only Machine: |container|mem_limit each (MB)|total (MB)| I'm thinking about setting the @-Xmx@ to "4096m" for the workers and upping the worker container @mem_limit@ to 5120 MB, allowing for 1024 MB of overhead for the worker-shim. That will still fall well under the 32 GB RAM on the machines. Thoughts? Hank |
Original Redmine Comment Hank, Yes, we have default jvm args set for the wres app in the @build.gradle@:
Yes, there is an overhead for the worker shim container, including the os. As to whether we set these higher, we can consider that, certainly, but I guess that is a separate ticket. Of course, if you want to experiment with a larger setting for the worker in the context of this ticket in order to retain the feature batch size of 50, you could refresh the app default args in the docker options (@INNER_JAVA_OPTS@) temporarily, but we would not want to do it that way, eventually (not good to override the options in multiple contexts and rely on the last set being the set used, even though it typically is). Cheers, James |
Original Redmine Comment Just as an experiment, I wanted to override the settings and see what happens. Looks like I might have found an unexpected NPE:
The NPE happens when @somethingFound@ is set equal to @outputDirectoryWatchService.poll( 1, TimeUnit.SECONDS );@. I've never seen this before. I'll take a quick look to see if the problem is obvious, but otherwise, this might need to be a new ticket. Ugh. Hank |
Original Redmine Comment A new ticket needs to be created in general. There needs to be a null check on @outputDirectoryWatchService@. Hank |
Original Redmine Comment I'm backing out my change and trying to bring up the staging COWRES, again. I'm not sure how my change could cause the NPE, however. Hank |
Original Redmine Comment Appears to be the case. I had simply added, "-Xms=4096m -Xmx=4096m" at the beginning of @INNER_JAVA_OPTS@ for the worker. I also upped the @mem_limit@ to 5120m. Why would that cause the the NPE? Hank |
Original Redmine Comment I'm abandoning my experiment to increase the memory in staging, because my first attempt leads to errors in the worker, and I have other work I need to focus on. Would it be worthwhile for me to reduce the @featureBatchSize@ to, say, 20, and try it again? Anyway, turning my attention elsewhere for the next couple of hours at least, Hank |
Original Redmine Comment I don't think that exception in the worker shim is anything to worry about. If you look through the worker shim logs, you will see that the polling of the output dir is routinely interrupted. I see a ton of those interruptions when I deploy locally. It is documented in the @JobOutputMessenger@ as "routine". It may be ugly, but it doesn't mean the service is broken. |
Original Redmine Comment It was a syntax error in the @-Xmx@ and @-Xms@ settings. It doesn't use an '=' sign. To be clear, my mistake caused the WRES engine to fail to start, which likely lead to the issue attempting to "observe" the output folder for files. The output messaging from the worker-shim basically told me nothing useful:
After fixing my syntax, the smoke test passes. I'll start the test run, Hank |
Original Redmine Comment OK, that makes more sense. |
Original Redmine Comment Yeah, I've made that mistake multiple times over the years, too. The '=' sign just seems more intuitive to me. Anyway... Next experiment is job 224927122560749953 posted with additional memory:
Hank |
Original Redmine Comment I accidentally interrupted the job 224927122560749953 by bringing down its worker. The job was immediately picked up by another worker and has started from scratch. That means I may not have anything to say about this job until tomorrow morning given how long it usually takes. Hank |
Original Redmine Comment The job OOMEd again:
More experimentation tomorrow. Thanks, Hank |
Original Redmine Comment As an aside, I don't think you need more than 6*50 features to reproduce an oome because there are 6 pooling threads and 50 features per thread. That might make it easier to test. Also, you can probably do some kind of back-of-the-envelope calculation based on the number of pairs per pool and the cost in memory of each pair, roughly. Otherwise, this is going to be a very tedious experiment :-) |
Original Redmine Comment All that said, this is probably a great evaluation for setting the limits of memory and/or batch size. It's possible we will encounter bigger ones in reality and an oome is kind of a bad outcome in the wild, so we probably want to be conservative about the batch size, but this use case will help a lot, I think. |
Original Redmine Comment I don't want to have to refer to other tickets to find the declaration I'm using to stress test the COWRES in staging. I've pasted it below. This is the evaluation that has OOMEd in staging with a @featureBatchSize@ of 50 even with the memory upped to @-Xmx4096m@. I plan to discuss next experiments during today's meeting. Hank =====================================================================
|
Original Redmine Comment Notes from Dev Call: This style of evaluation, being a real use case, is reasonable to use to assess the balance. We do have significantly more memory that we can use; we can up the amount of RAM per worker making sure the container @mem_limit@ is upped accordingly. We can also bring down the feature batch size, but keep an eye on the compute/retrieval time improvement. First run using featureBatchSize 20 today errored out due to unrelated issues. I'll report a ticket before COB today. Hank |
Original Redmine Comment Using a @featureBatchSize@ of 20 also failed. Evidence from the start of the run: @featureBatchThreshold=10,featureBatchSize=20@ Result:
It did appear to complete about a 100 pools before bombing out, so maybe its not far from running. I'll try it with the increased memory amount in a bit. I have an evaluation running that will allow me to check if there is an option to select the metric displayed in the map and I don't want to interrupt it. The next experiment for this ticket should be later this morning. Hank |
Original Redmine Comment FYI... I just started another #108993 reproduction using the 20 feature batch size. I don't think that's really going to test much, since it had no problem at 50. Just saying. I'll start another batch size = 20 run with more memory on Tuesday, my only workday next week, and then continue the experiments after Thanksgiving. Hank |
Original Redmine Comment I was supposed to start the next experimental run, batch size 20 with more memory, before I left today. That won't happen. Staging is currently busy with #110224. Hank |
Original Redmine Comment I think this is blocked by #110660. |
Author Name: James (James)
Original Redmine Issue: 95867, https://vlab.noaa.gov/redmine/issues/95867
Original Date: 2021-09-03
Original Assignee: James
Given an evaluation that contains N singleton features, such as scenario703
When I consider how to evaluate it performantly
Then I want to consider feature-batched retrieval for N/M features at once
Related issue(s): #286
Redmine related issue(s): 98818, 99120, 99338, 99680, 99719, 99827, 99932, 99964, 99980, 109030
The text was updated successfully, but these errors were encountered: