Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel: Shake Error after couple minutes. Serial: No error. #9

Open
amrein opened this issue Mar 16, 2015 · 6 comments
Open

Parallel: Shake Error after couple minutes. Serial: No error. #9

amrein opened this issue Mar 16, 2015 · 6 comments
Assignees
Labels

Comments

@amrein
Copy link

amrein commented Mar 16, 2015

When running f0f004.inp, Qdyn5p will end with a Shake Error after a couple minutes.
When running with qdyn5, (serial version), there is no shake error.

I am not sure is this is a bug, or if this is very unlucky, of even if this is exptected.

Files and more information can be found in:
https://www.dropbox.com/sh/5oboetmc2mz2yyp/AADIVOCtw6WH3vigghBg4M4xa?dl=0
password: Bug9

@acmnpv
Copy link
Contributor

acmnpv commented Mar 16, 2015

Shake errors are as far as I experienced them always errors in your inputs, not getting them with one set-up just means you are (un-)lucky. I checked your stuff and see that you have hot atoms in your run -> check those and fix the parameter problems with them.

@amrein
Copy link
Author

amrein commented Mar 16, 2015

There is no error in the input. What you looked at is the f0f004.log which contains the logfile of the mpi run, which crashed.
If you run serial, there is no hot atom and no shake error.

The hot atoms occur ONLY when run in parallel. Hence I dont believe that this error has anything in common with what you describe.

@amrein amrein reopened this Mar 16, 2015
@acmnpv
Copy link
Contributor

acmnpv commented Mar 16, 2015

2865 warnings in your log file, starting with atom 5128. Run parallel with dcd at 10 steps per frame and look what is exploding. Then you can say if there is a parameter issue or not.

@acmnpv acmnpv closed this as completed Mar 16, 2015
@amrein amrein self-assigned this Mar 16, 2015
@amrein
Copy link
Author

amrein commented Mar 16, 2015

Obviously, there must be warnings in the Logfile, or it would not have crashed. Look into singletest (non parallel run), and check out f0f004.log, there is no shake error. Not a single warning until I killed the simulation at step 150000. Why should I get hot atoms in the parallel run, and nothing the in the serial version?

If this is expected, then we should add a test to make sure that all our parallel versions die, and all our serial version dont die, when restarted from f0f002.re.

I will reopen this bug a final time, so I can check it out later when I have time. Feel free to close and I will forget about it.

@amrein amrein reopened this Mar 16, 2015
@acmnpv
Copy link
Contributor

acmnpv commented Mar 16, 2015

I just ran the test with a different compiled version of qdyn5p (master) and also get no errors (I send you the file by mail), so it is not the difference between serial and parallel. I will leave this open for you, but I think there is nothing for me to fix.

@acmnpv
Copy link
Contributor

acmnpv commented Mar 23, 2015

OK, I have some addition to this: Running Qdyn5p (master) on Tintin gave me one segmentation fault and one shake failure, too, with the setup that should be working and ran ~1000 runs on both Abisko and Triolith. I will have to test the reproducibility of those errors and will look into the issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants