You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
for some workflows, namely the ones with a big number of citations 200+, besides taking quite a long time, the worker processing the workflow is running out of RAM and crashing with 137 exit code.
doing some processing to read it into a dict - this function takes at times around 4seconds to executes per each individual reference.
As its is being read into a temp file one would guess that the memory would be deallocated, but for some reason unbeknownst to me, (maybe the way python does garbage collection?) ram usage keeps increasing.
Here are some possible recommended next actions:
start using the refextract service we have, by calling extract_references_from_text_data
seems to take different inputs obj.extra_data['formdata.references'] instead of obj.extra_data, 'formdata.references' some some data massaging may be required
rework extract_references_from_raw_refs so that the kb is loaded one single time there, thus greatly reducing processing time and ram usage.
The text was updated successfully, but these errors were encountered:
DonHaul
changed the title
ref-extract on inspire-next is crashing for high number of citations
refextract on inspirehep is crashing for high number of citations
Nov 25, 2024
DonHaul
changed the title
refextract on inspirehep is crashing for high number of citations
refextract on next is crashing for high number of citations
Dec 5, 2024
for some workflows, namely the ones with a big number of citations 200+, besides taking quite a long time, the worker processing the workflow is running out of RAM and crashing with
137 exit code
.After some investigation I've found the issue comes from function
extract_references_from_raw_refs
https://github.com/inspirehep/inspire-next/blob/615ca8053a0af72736389c07c3eb3648d15f09c8/inspirehep/modules/workflows/tasks/refextract.py#L251
where for every single reference the is doing the following processing:
As its is being read into a temp file one would guess that the memory would be deallocated, but for some reason unbeknownst to me, (maybe the way python does garbage collection?) ram usage keeps increasing.
Here are some possible recommended next actions:
The text was updated successfully, but these errors were encountered: