Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The exchange probability gap is too large when using GPU for HREX #1177

Open
wu222222 opened this issue Jan 17, 2025 · 4 comments
Open

The exchange probability gap is too large when using GPU for HREX #1177

wu222222 opened this issue Jan 17, 2025 · 4 comments

Comments

@wu222222
Copy link

Dear plumed users:

This is my configuration:

GROMACS version:    2020.7-plumed-2.9.2
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-2.5.0
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 11.4.0
C compiler flags:   -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler:       /usr/bin/c++ GNU 11.4.0
C++ compiler flags: -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops SHELL:-fopenmp -O3 -DNDEBUG
CUDA compiler:      /usr/local/cuda-12.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver; Copyright (c) 2005-2023 NVIDIA Corporation; Built on Tue_Jun_13_19:16:58_PDT_2023; Cuda compilation tools, release 12.2, V12.2.91; Build cuda_12.2.r12.2/compiler.32965470_0
CUDA compiler flags: -gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_75,code=compute_75;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -fexcess-precision=fast -funroll-all-loops SHELL:-fopenmp -O3 -DNDEBUG
CUDA driver:        12.20
CUDA runtime:       12.20

When performing Hamiltonian Replica Exchange (HREX), I set the scaling for all replicas to be 1.0, so theoretically, the exchange probabilities should all be 1.0.
Indeed, when using CPU, the exchange probabilities are all 1.0, but when using GPU acceleration, the exchange probabilities vary significantly,
Replica exchange statistics

Repl  399 attempts, 200 odd, 199 even
Repl  average probabilities:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      .92  .92  .18  .25  .92  .92  .21  .22  .92  .91  .92
Repl  number of exchanges:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      185  177   39   56  182  187   43   48  176  182  189
Repl  average number of exchanges:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      .93  .89  **.19**  .28  .91  .94  .22  .24  .88  .91  .94

Is this normal? I recall that there might be insufficient computational precision on GPUs which can lead to significant errors, but would such large discrepancies in exchange probabilities affect the final results? If not, why?

The command I used is as follows:
nohup mpirun --use-hwthread-cpus -np 12 gmx_mpi mdrun -v -deffnm rest -nb gpu -pin on -ntomp 1 -replex 1000 -hrex -multidir rest0 rest1 rest2 rest3 rest4 rest5 rest6 rest7 rest8 rest9 rest10 rest11 -dlb no -plumed plumed.dat > nohup.out 2>&1 &

I am looking forward to your replies with great anticipation.

@GiovanniBussi
Copy link
Member

GiovanniBussi commented Jan 17, 2025

Hi. This is known. My understanding is that energy calculation on the GPU is not reproducible, I guess because of differences in the way energy terms are added. On a large system, even a tiny relative difference could translate to a different acceptance.

Empirically, I never identified problems due to this.

Formally, I am not aware of any justification. A couple of handwaving consideration.

First, the acceptances obtained by the Metropolis formula are sufficient to sample the correct distribution, but not necessary. For instance:

  • if you take the acceptances and multiple all of them by 0.5, you will still get the correct distribution. It will just take longer to converge
  • if you take all the acceptances and multiple all of them by a different random number between 0 and 1, again, you will get the right result.

Strictly speaking, you should check that the ration between the "GPU acceptances" and the "CPU acceptances" (=1.0) is independent of the coordinates of the system. I don't know how to do it and if it is possible.

Second, I suspect that any of such "GPU errors" will also be present when you integrate the equations of motion. So, my feeling is that even if you introduce some small errors in the exchange procedure it will be anyway negligible.

I am not sure how convincing these arguments are.

@wu222222
Copy link
Author

Thank you very much for your response. However, I'm sorry, I didn't quite understand. Isn't the exchange probability supposed to reach a certain level, for example, expected to be between 30%-40%, to consider the hrex successful?
Or, are you saying that the exchange probability cannot determine the success or quality of the hrex?

Hi. This is known. My understanding is that energy calculation on the GPU is not reproducible, I guess because of differences in the way energy terms are added. On a large system, even a tiny relative difference could translate to a different acceptance.

Empirically, I never identified problems due to this.

Formally, I am not aware of any justification. A couple of handwaving consideration.

First, the acceptances obtained by the Metropolis formula are sufficient to sample the correct distribution, but not necessary. For instance:

  • if you take the acceptances and multiple all of them by 0.5, you will still get the correct distribution. It will just take longer to converge
  • if you take all the acceptances and multiple all of them by a different random number between 0 and 1, again, you will get the right result.

Strictly speaking, you should check that the ration between the "GPU acceptances" and the "CPU acceptances" (=1.0) is independent of the coordinates of the system. I don't know how to do it and if it is possible.

Second, I suspect that any of such "GPU errors" will also be present when you integrate the equations of motion. So, my feeling is that even if you introduce some small errors in the exchange procedure it will be anyway negligible.

I am not sure how convincing these arguments are.

@GiovanniBussi
Copy link
Member

Sorry I just noted that these average acceptances are computed after 200 attempts. So I would not expect them to be so different from each other.

In addition, all replicas are identical except possibly for the initial coordinates right?

Can you please report:

  • if you are using a barostat or it is constant volume
  • in the latter case, if all replica volumes are identical (they should)

Then ideally could you plot the histogram of acceptance for each pair of replicas? It would be useful to know if problems are present at all attempts or if there are bimodal distribution. Finally, for each pair, also the time series of the acceptance could be useful (if it's not too messy)

Thanks!!

@wu222222
Copy link
Author

wu222222 commented Jan 18, 2025

Thank you very much for your response. However, I'm sorry, I couldn't understand your point. I ran another REST2 simulation with 12 replicas from 310K to 510K, and the exchange probability was also low:

Replica Exchange Statistics
Repl  399 attempts, 200 odd, 199 even
Repl  average probabilities:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      .13  .12  .10  .14  .12  .12  .19  .31  .14  .13  .10
Repl  number of exchanges:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl       26   26   19   26   28   24   38   63   26   25   22
Repl  average number of exchanges:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      .13  .13  .09  .13  .14  .12  .19  .32  .13  .13  .11

Additionally, here is the md.mdp file at the end.

Below is an image I plotted showing the replica traversal, where the y-axis indicates which position the replica is in.
I hope this can be helpful to you.

Image

Image

Image

Image

And here is md,mdp.

md.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants