refextract on next is crashing for high number of citations #617

DonHaul · 2024-11-25T12:21:36Z

for some workflows, namely the ones with a big number of citations 200+, besides taking quite a long time, the worker processing the workflow is running out of RAM and crashing with 137 exit code.

After some investigation I've found the issue comes from function extract_references_from_raw_refs
https://github.com/inspirehep/inspire-next/blob/615ca8053a0af72736389c07c3eb3648d15f09c8/inspirehep/modules/workflows/tasks/refextract.py#L251
where for every single reference the is doing the following processing:

download the kb file from the web and saving it as a temp file
doing some processing to read it into a dict - this function takes at times around 4seconds to executes per each individual reference.

As its is being read into a temp file one would guess that the memory would be deallocated, but for some reason unbeknownst to me, (maybe the way python does garbage collection?) ram usage keeps increasing.

Here are some possible recommended next actions:

start using the refextract service we have, by calling extract_references_from_text_data
- seems to take different inputs obj.extra_data['formdata.references'] instead of obj.extra_data, 'formdata.references' some some data massaging may be required
rework extract_references_from_raw_refs so that the kb is loaded one single time there, thus greatly reducing processing time and ram usage.

The text was updated successfully, but these errors were encountered:

DonHaul added project: next type: bug Something isn't working labels Nov 25, 2024

DonHaul changed the title ~~ref-extract on inspire-next is crashing for high number of citations~~ refextract on inspirehep is crashing for high number of citations Nov 25, 2024

DonHaul removed the project: next label Nov 25, 2024

DonHaul added the project: next label Dec 5, 2024

DonHaul changed the title ~~refextract on inspirehep is crashing for high number of citations~~ refextract on next is crashing for high number of citations Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refextract on next is crashing for high number of citations #617

refextract on next is crashing for high number of citations #617

DonHaul commented Nov 25, 2024 •

edited

Loading

refextract on next is crashing for high number of citations #617

refextract on next is crashing for high number of citations #617

Comments

DonHaul commented Nov 25, 2024 • edited Loading

DonHaul commented Nov 25, 2024 •

edited

Loading