-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch and interactive results differ #660
Comments
Thank you for the report Gene.
Unless you already verified, I'd make sure that the environments are indeed the same by printing them with |
Environments: I did check this earlier today, both for "set" and "setenv", and there are a lot of differences. I tried looking for things that were common across my batch tests, but different for my interactive tests. Here are the environment variable that match that pattern from what I saw:
An example of something that didn't match the pattern of interest: PATH is different for the CRS job, but is consistent between the condor, STAR scheduler, and interactive jobs. Nothing from "set" matched the pattern of interest. When running the job in condor, there is a patch of code that I found in the .csh files that the STAR Scheduler generates which properly sets up the $HOME environment variable. So you have options for running batch:
The STAR Scheduler places a whole bunch of other stuff in the csh file that you don't need for this test. So for option 2, you could chop off all the other stuff except the $HOME stuff and the few lines of user code (that's what I did when I submitted directly to condor). Regardless, here's a STAR-scheduler submission xml file,
Then simply execute |
Thank you for the additional information Gene. |
Hi, Dmitri
On Feb 27, 2024, at 1:00 PM, Dmitri Smirnov ***@***.***> wrote:
Could you remind me if we should expect numerical differences between optimized and non-optimized libraries?
I can't remind you because I can't remember that detail. But it seems plausible that optimization could lead to different rounding errors at the least significant bits from performing math differently (more efficiently), and that could cause some value to be on one side or the other of a threshold - I'm not sure to what degree we should see that in track counts.
…-Gene
|
I do see some significant difference in the counts when I set/unset NODEBUG. Specifically, using the same test as above, I get 9801 and 9812 respectively. In both cases I run on an interactive node and the NODEBUG variable is the only switch I toggled. According to the logs, the libraries are picked up from the expected One difference in the logs which I wouldn't expect is these lines:
Could this be the reason for the difference in observed counts? |
Following the observation from @plexoos , I checked on the differences in However, this modification results in no impact on the track counts. My investigation of |
Other than the environment variables, I would check contents of |
Yes, but what difference do you expect? A different architecture? 🙂 Also, I think it would be a bad joke if SDCC provided "incompatible" (in whatever sense...) machines for interactive and farm nodes. I already checked the system libs versions. libc c++ all appear to be identical... Here is a couple of other unlikely things I can think of...
|
That is to check if different microarchitecture is used, yes. |
Two things.... First, the /proc/cpuinfo is available here if you want to look - there are a lot of differences:
Second, I ran the chain with "debug2" and found that there are no differences in the TPC hits, but there are differences beginning in the CA track seed finding. By putting in a few print statements, I was able to conclude that...
The codes inside TPCCATracker are awash with |
I've conducted a brief review of the Vc library code and observed that it includes checks for CPU vectorization capabilities. Additionally, it appears that the code can distinguish between CPUs manufactured by AMD and Intel. While I'm uncertain whether this information is actually employed to generate distinct code at runtime or the rationale behind it, there's a possibility, given my attempt to eliminate other variables by running in a container. Specifically, my tests consistently yield differing results when executing the test job within the container /cvmfs/singularity.opensciencegrid.org/star-bnl/star-sw:SL23d on Intel and AMD CPUs, respectively. It may be worth mentioning that the Vc code we're currently using was released at least a couple of years before the CPU models used in the test. |
For the record, the command executing the test in the container:
|
I discussed running interactively on a node reserved for batch (the "spool" nodes) with some SDCC folks and we got this done. Conclusion: Running interactively on that node gave the same result as running in condor. That fits with the idea that vectorization is performed slightly differently for the Intel processors on the batch nodes than for the AMD processors on the interactive nodes. In that case, this is probably not worth pursuing much further, and I'll close the issue in a few days if no one has any further ideas/comments. Thanks, @plexoos , for spending some time on this too. |
I've found that running a test job in batch, which can be condor directly, or condor through the STAR Scheduler or through CRS (for starreco only), gets different results than running the job interactively, for things printed in the log files like tracks counts. The implication is that something is different in the batch environment. I should note that the batch jobs all get the same results as each other.
Things I've tested:
genevb
andstarreco
accounts using SL23e optimized. I see the same patterns regardless, andgenevb
andstarreco
are identical to each other.Summarizing these observations: there are various comparisons for which rounding differences cause some slight differences in results, but batch vs. interactive execution should not lead to such differences. Yet in only one comparison test did I see batch and interactive identical to each other (64-bit unoptimized).
Test job:
A simple thing to check:
...gives these results:
interactive:
condor:
STAR scheduler:
The text was updated successfully, but these errors were encountered: