Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser "hang" found in 2020 historical data #352

Open
philbudne opened this issue Nov 19, 2024 · 3 comments
Open

parser "hang" found in 2020 historical data #352

philbudne opened this issue Nov 19, 2024 · 3 comments

Comments

@philbudne
Copy link
Contributor

I noticed the parser queue for the 2020 historical reingest slowing down, and parser exits (show by dots on the "app max run time" granfana graph). docker ps -a showed exited parser containers, and all of them had the same pattern, the last url parsed was the same, and when the parser tried to forward the message on, it crashed due to rabbitmq having closed the connection due to processing taking over 30 minutes.

I scaled the hist-fetcher service to zero, and then the parser service to zero, and extracted two stories from the parser-in queue using ./run-qutil.sh dump_archives parser-in run by docker exec'ing into an importer container, and I moved the resulting archive to the data directory to preserve it.

I'm able to reproduce the hang with the attached warc file by sourcing my development venv and running:

./bin/run-parser.sh --test-file-prefix test-warcs/parser-hang-2024-11-19 --rabbitmq-url x
@philbudne
Copy link
Contributor Author

@philbudne
Copy link
Contributor Author

Hangs like this had been seen processing RSS files when doing "canonical url extraction", but unlike those cases, the offending file is HTML.

@philbudne
Copy link
Contributor Author

One solution would be to wrap the process_message call in Worker._process_one_message in an SIGALARM based timeout, like:
https://github.com/mediacloud/feed_seeker/blob/main/feed_seeker/feed_seeker.py#L19 (but with a class that starts with a Capital Letter please!)

BUT this is not a thread-safe solution (there is one ITIMER_REAL timer per-process).
The libc timer_create call MIGHT be able to create multiple CLOCK_REALTIME timers per-process, but is not available in Python, AND the signal would need to be delivered to the thread that issued the call, and from what I can tell, all signals are processed in the main thread in Python. The only multi-thread App in story-indexer is tqfetcher.py; rss-fetcher handles this by doing all fetches in a subprocess with its own SIGALARM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant