Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixtures: profile upload peformance #88

Open
tiborsimko opened this issue Nov 22, 2024 · 1 comment
Open

fixtures: profile upload peformance #88

tiborsimko opened this issue Nov 22, 2024 · 1 comment
Assignees

Comments

@tiborsimko
Copy link
Member

Current behaviour

Seen on the QA instance on November 15th.

Updating ATLAS records from from cernopendata/opendata.cern.ch#3688 using cernopendata-portal image 0.1.11 works very fast, both locally and on PROD:

$ time docker exec -i -t opendatacernch-web-1 cernopendata fixtures records --mode insert-or-replace -f /content/data/records/atlas-CERN-EP-2024-159.json
...
docker exec -i -t opendatacernch-web-1 cernopendata fixtures records --mode    0.01s user 0.01s system 0% cpu 2.563 total

However, on QA the same upload process got stuck:

$ kubectl exec -i -t web-68-8vg94 /bin/bash
bash-5.1$ time cernopendata fixtures records --mode replace -f /tmp/data/records/atlas-CERN-EP-2024-159.json
/opt/invenio/var/instance/python/lib/python3.9/site-packages/invenio_config/default.py:77: UserWarning: Set configuration variable SECRET_KEY with random string
  warnings.warn(

/opt/invenio/var/instance/python/lib/python3.9/site-packages/invenio_rest/ext.py:30: FutureWarning: CSRF validation will be enabled by default in the version 1.3.x
  self.init_app(app)

/opt/invenio/var/instance/python/lib/python3.9/site-packages/flask_caching/__init__.py:145: DeprecationWarning: Using the initialization functions in flask_caching.backend is deprecated.  Use the a full path to backend classes directly.
  self._set_cache(app, config)

Loading records from /tmp/data/records/atlas-CERN-EP-2024-159.json (1/1)...

There was no reply for many minutes; the process seems to "run away".

I have interrupted it after about 6 minutes:

^C
Aborted!
/usr/lib64/python3.9/site-packages/XRootD/client/finalize.py:46: DeprecationWarning: Importing 'itsdangerous.json' is deprecated and will be removed in ItsDangerous 2.1. Use Python's 'json' module instead.
  if isinstance(obj, File) and obj.is_open():


real    6m20.950s
user    0m11.243s
sys    0m0.737s

Expected behaviour

The records should be updated fast, within 2-3 seconds, as with 0.1.11.

Notes

This is especially interesting because the change in the record JSON was only minimal:

$ git diff -p upstream/pr/3688~1..upstream/pr/3688 -- data/records/atlas-CERN-EP-2024-159.json | cat
diff --git a/data/records/atlas-CERN-EP-2024-159.json b/data/records/atlas-CERN-EP-2024-159.json
index 34aa42f83..63c349c3f 100644
--- a/data/records/atlas-CERN-EP-2024-159.json
+++ b/data/records/atlas-CERN-EP-2024-159.json
@@ -245,7 +245,8 @@
     "type": {
       "primary": "Dataset",
       "secondary": [
-        "Derived"
+        "Derived",
+        "Simulated"
       ]
     },
     "usage": {

That is, there was no change in attached files when performing this update, and the record itself has only about 34 files attached, all directly and not via index files... So waiting for 6 minutes seems excessive.

It would be good to profile the fixture loading command to see where this extra time was spent. (Perhaps some missing DB indexes and an inefficient DB query causing slow downs?)

@psaiz psaiz self-assigned this Nov 26, 2024
@psaiz
Copy link
Contributor

psaiz commented Nov 27, 2024

@tiborsimko: if the change is only in the metadata, doing it with the option of skip-files would definitely improve the time needed. Note that this record has 17 file indices. In versions 0.1.11 (and older), the file indices are stored as normal files. Starting with the 0.2, the file indices are read, and the files inside the file index are processed. This particular record has more than 2000 files that have to be deleted/reinserted (unless the --skip-files option is specified)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants