-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional lockups dumping planet #24
Comments
We had one good run but it looks like last week locked up again :-( |
Weird that this is happening so often now. 🤔 I wonder if something changed? Thanks very much for the thread backtraces, they were very helpful. It looks like one of the output writer threads has died at some point while outputting the relations and this isn't handled properly. It looks like the incron script that runs |
Unfortunately the wrapper script deletes the log after it has mailed it so it's no longer there. I assume it must have been empty though, or it should have been emailed to you as you say and I see no signs of that. |
There have been some lock-ups recently running planet-dump-ng in production (#24). Thanks to thread backtraces, it seems that a writer thread was dying (although there was no output?) and therefore no longer participating in the barrier to pump data from the reader, so the whole program was locking up. The new behaviour is for the dying thread to still participate in pumping messages, but without the calls to the output writer. If the reader thread encounters an exception, it will abort. Hopefully this means that if a single writer dies, we get all the other output, and if the reader dies then we get a crash instead of a hang.
Yeah, I thought the same, but the log file should never be empty. Until July 2020, I was getting weekly confirms of the form:
But then I got a few errors around the 14-16th July, and nothing since. (Clearly I wasn't doing an awesome job noticing these emails, or I'd have realised they'd stopped coming before now.) I think I made a change that could help (assuming this is how the thread exits...). Please could you try version 1.2.2 and see if that helps? https://github.com/zerebubuth/planet-dump-ng/releases/tag/v1.2.2 |
I've deployed that, and I think I've figured out the email problem and fixed it. I think you've broken something though because I get a stream of errors now if I try and start the dump:
I assume this is something to do with e2d9c70? |
Ooops, fail. Looks like the machine I was testing on had a truly ancient version PostgreSQL (9!). I reverted that commit and pushed a new version, v.1.2.3. https://github.com/zerebubuth/planet-dump-ng/releases/tag/v1.2.3 |
I ran a quick test to validate that a simulated runtime_error in 10000 = very first relation id in dump:
Result on 3e48263:
Result on b190303:
echo $? returns 0 in case of an issue. This may not be ideal for monitoring... |
I was wondering if the issue was reproducible, i.e. when processing the same db dump twice, it would show exactly the same behavior as with the previous run. Given that this happened in relations (which tend to be much larger than ways and nodes), I could imagine that a certain combination of large relations might trigger some rare bug in the pbf writer code, e.g. due to lack of space in a pbf block. |
So, do the mails with errors arrive now at least? It seems planet dump has failed again? Today is 2022-09-20, and last one is planet-220905.osm.pbf (created on 2022-09-10). |
Yes it failed, yes I got email, yes our alerts went off, yes we are continuing to investigate. |
As you can see if you look at #25 which is the actual ticket where we're currently dealing with this... |
Which is actually a totally separate issue - it's a crash rather than a lockup. |
We've had a couple of instances - one in July (openstreetmap/operations#552) and one this week (openstreetmap/operations#568) where planetdump-ng has experienced some sort of thread deadlock and stopped making progress.
I grabbed a full backtrace of all threads this time (https://gist.github.com/tomhughes/250e1504f4689fc31a0ca4e0ab4e029e) and from analysing the output it looks like it had completed nodes and ways and was in the middle of the relations when it stopped.
The text was updated successfully, but these errors were encountered: