Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refextract on next is crashing for high number of citations #617

Open
DonHaul opened this issue Nov 25, 2024 · 0 comments
Open

refextract on next is crashing for high number of citations #617

DonHaul opened this issue Nov 25, 2024 · 0 comments
Labels
project: next type: bug Something isn't working

Comments

@DonHaul
Copy link
Collaborator

DonHaul commented Nov 25, 2024

for some workflows, namely the ones with a big number of citations 200+, besides taking quite a long time, the worker processing the workflow is running out of RAM and crashing with 137 exit code.

After some investigation I've found the issue comes from function extract_references_from_raw_refs
https://github.com/inspirehep/inspire-next/blob/615ca8053a0af72736389c07c3eb3648d15f09c8/inspirehep/modules/workflows/tasks/refextract.py#L251
where for every single reference the is doing the following processing:

As its is being read into a temp file one would guess that the memory would be deallocated, but for some reason unbeknownst to me, (maybe the way python does garbage collection?) ram usage keeps increasing.

Here are some possible recommended next actions:

  • start using the refextract service we have, by calling extract_references_from_text_data
    • seems to take different inputs obj.extra_data['formdata.references'] instead of obj.extra_data, 'formdata.references' some some data massaging may be required
  • rework extract_references_from_raw_refs so that the kb is loaded one single time there, thus greatly reducing processing time and ram usage.
@DonHaul DonHaul added project: next type: bug Something isn't working labels Nov 25, 2024
@DonHaul DonHaul changed the title ref-extract on inspire-next is crashing for high number of citations refextract on inspirehep is crashing for high number of citations Nov 25, 2024
@DonHaul DonHaul changed the title refextract on inspirehep is crashing for high number of citations refextract on next is crashing for high number of citations Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project: next type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant