parser "hang" found in 2020 historical data #352

philbudne · 2024-11-19T17:16:22Z

I noticed the parser queue for the 2020 historical reingest slowing down, and parser exits (show by dots on the "app max run time" granfana graph). docker ps -a showed exited parser containers, and all of them had the same pattern, the last url parsed was the same, and when the parser tried to forward the message on, it crashed due to rabbitmq having closed the connection due to processing taking over 30 minutes.

I scaled the hist-fetcher service to zero, and then the parser service to zero, and extracted two stories from the parser-in queue using ./run-qutil.sh dump_archives parser-in run by docker exec'ing into an importer container, and I moved the resulting archive to the data directory to preserve it.

I'm able to reproduce the hang with the attached warc file by sourcing my development venv and running:

./bin/run-parser.sh --test-file-prefix test-warcs/parser-hang-2024-11-19 --rabbitmq-url x

The text was updated successfully, but these errors were encountered:

philbudne · 2024-11-19T17:16:42Z

parser-hang-2024-11-19-in.warc.gz

philbudne · 2024-11-19T17:18:53Z

Hangs like this had been seen processing RSS files when doing "canonical url extraction", but unlike those cases, the offending file is HTML.

philbudne · 2024-11-19T18:56:55Z

One solution would be to wrap the process_message call in Worker._process_one_message in an SIGALARM based timeout, like:
https://github.com/mediacloud/feed_seeker/blob/main/feed_seeker/feed_seeker.py#L19 (but with a class that starts with a Capital Letter please!)

BUT this is not a thread-safe solution (there is one ITIMER_REAL timer per-process).
The libc timer_create call MIGHT be able to create multiple CLOCK_REALTIME timers per-process, but is not available in Python, AND the signal would need to be delivered to the thread that issued the call, and from what I can tell, all signals are processed in the main thread in Python. The only multi-thread App in story-indexer is tqfetcher.py; rss-fetcher handles this by doing all fetches in a subprocess with its own SIGALARM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser "hang" found in 2020 historical data #352

parser "hang" found in 2020 historical data #352

philbudne commented Nov 19, 2024

philbudne commented Nov 19, 2024

philbudne commented Nov 19, 2024

philbudne commented Nov 19, 2024

parser "hang" found in 2020 historical data #352

parser "hang" found in 2020 historical data #352

Comments

philbudne commented Nov 19, 2024

philbudne commented Nov 19, 2024

philbudne commented Nov 19, 2024

philbudne commented Nov 19, 2024