-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting exec-host with jobs in 'dr' state is allowed #9
Comments
I have no bugfix, but maybe a hint how to get the system up and running again. The SGE uses a database to store the runtime information. There are two possibilities: Files and BerkeleyDB. Make a backup, before you peek around in the config!!! The Files-DB can easily be edited and the nodes can be deleted manually with a text editor. With BerkeleyDB you can use some of the tools which are installed together with SGE in "utilbin", or generic BerkeleyBD utils (usually it is hard to find the correct version). Hope this helps to bring the SGE in a running state. |
Thanks... I managed to fix my problem by deleting the jobs db after backing
it up, restarting the master, re-adding the node whose deletion caused the
problem, then stopping the master, copying back the old jobs db, and
restarting the master. This would have confused clients, except I stopped
their daemons first.
Not ideal, of course.
I believe the bug is in allowing a node with 'dr' jobs to be deleted.
Shouldn't be hard to fix if someone knows the code.
…On Thu, Nov 15, 2018 at 2:02 AM Marco Schmidt ***@***.***> wrote:
I have no bugfix, but maybe a hint how to get the system up and running
again.
The SGE uses a database to store the runtime information. There are two
possibilities: Files and BerkeleyDB.
*Make a backup, before you peek around in the config!!!*
The Files-DB can easily be edited and the nodes can be deleted manually
with a text editor.
With BerkeleyDB you can use some of the tools which are installed together
with SGE in "utilbin", or generic BerkeleyBD utils (usually it is hard to
find the correct version).
Hope this helps to bring the SGE in a running state.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuySBX_P7tvjXBrziS00slcmcwVJHks5uvRF0gaJpZM4YYn4d>
.
|
See email chain pasted below.
The basic issue, I believe, is that you can do
qconf -de some_host
when there are jobs in state 'dr' on that host. That crashes the gridengine master, and restarting it is not possible: message in /var/spool/gridengine/qmaster/messages is:
11/10/2018 16:23:27| main|deb8qmaster|C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!
I'm not sure which part of the code deals with this; it should probably be fixed.
The text was updated successfully, but these errors were encountered: