Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timing out on exporting documents & annotations #274

Open
twinkarma opened this issue Jan 23, 2023 · 14 comments
Open

Timing out on exporting documents & annotations #274

twinkarma opened this issue Jan 23, 2023 · 14 comments
Labels
bug Something isn't working

Comments

@twinkarma
Copy link
Collaborator

Sylvia reported 502 error when exporting her project (4500 docs and 13500 annotations), confirmed that this is happening.

It's most likely because the server is timing out due to the number of documents & annotation.

  • Short term fix
    • Increase timeout on the server
  • Fix proposed
    • Run export code on a background process (e.g. using celery). Show generation progress meter to the user.
@twinkarma twinkarma added the bug Something isn't working label Jan 23, 2023
@davidwilby
Copy link
Contributor

Note to self: Try gunicorn extra args approach in the first instance.

@johann-petrak
Copy link

I just tested importing and exporting the other day and I was surprised about how slow this is: is there a reason for it to take so long which might be fixable?
I tested this with around 8000 documents which only contain a (short) text and an id and it took orders of magnitude longer than (de-)serialization / zipping via python on the same machine which I found odd.

@davidwilby
Copy link
Contributor

We're looking into this soon so we'll have a look at what's slow and do some profiling. Thanks @johann-petrak

@johann-petrak
Copy link

johann-petrak commented Oct 2, 2023

OK I am apparently hitting this bug and we urgently need those 8000 annotated documents exported.
No matter what I do I get a 502 Bad Gateway response from the server.

Here are the last few lines I get on the log:

teamware01-backend-1    | 2023-10-02 14:10:15,650 backend.rpcserver INFO     Called get_project_annotators
teamware01-nginx-1      | 193.171.142.175 - - [02/Oct/2023:14:10:15 +0000] "POST /rpc/ HTTP/1.1" 200 883 "http://pflaume.ofai.at:8076/project/1" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0" "-"
teamware01-nginx-1      | 193.171.142.175 - - [02/Oct/2023:14:10:24 +0000] "POST /rpc/ HTTP/1.1" 499 0 "http://pflaume.ofai.at:8076/project/1" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0" "-"
teamware01-nginx-1      | 193.171.142.175 - - [02/Oct/2023:14:10:24 +0000] "POST /rpc/ HTTP/1.1" 499 0 "http://pflaume.ofai.at:8076/project/1" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0" "-"
teamware01-nginx-1      | 193.171.142.175 - - [02/Oct/2023:14:10:24 +0000] "POST /rpc/ HTTP/1.1" 499 0 "http://pflaume.ofai.at:8076/project/1" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0" "-"
teamware01-nginx-1      | 193.171.142.175 - - [02/Oct/2023:14:10:24 +0000] "POST /rpc/ HTTP/1.1" 499 0 "http://pflaume.ofai.at:8076/project/1" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0" "-"
teamware01-backend-1    | 2023-10-02 14:10:42,657 backend.rpcserver INFO     Called get_project_annotators
teamware01-backend-1    | 2023-10-02 14:10:42,677 backend.rpcserver INFO     Called get_possible_annotators
teamware01-backend-1    | 2023-10-02 14:10:42,698 backend.rpcserver INFO     Called get_possible_annotators
teamware01-backend-1    | 2023-10-02 14:10:42,718 backend.rpcserver INFO     Called get_possible_annotators
teamware01-backend-1    | [2023-10-02 14:11:13 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:11)
teamware01-nginx-1      | 2023/10/02 14:11:13 [error] 50#50: *23416 upstream prematurely closed connection while reading response header from upstream, client: 193.171.142.175, server: , request: "GET /download_annotations/1/all/json/raw/500/anonymize/ HTTP/1.1", upstream: "http://172.27.0.4:8000/download_annotations/1/all/json/raw/500/anonymize/", host: "pflaume.ofai.at:8076", referrer: "http://pflaume.ofai.at:8076/project/1"
teamware01-backend-1    | [2023-10-02 14:11:13 +0000] [11] [INFO] Worker exiting (pid: 11)
teamware01-nginx-1      | 193.171.142.175 - - [02/Oct/2023:14:11:13 +0000] "GET /download_annotations/1/all/json/raw/500/anonymize/ HTTP/1.1" 502 157 "http://pflaume.ofai.at:8076/project/1" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0" "-"
teamware01-backend-1    | [2023-10-02 14:11:13 +0000] [12] [INFO] Booting worker with pid: 12

We really need those annotated documents exported urgently, how can I work around this error?

@johann-petrak
Copy link

No combination of export options seems to help with this either.

@johann-petrak
Copy link

according to the docker container names, the version I am running is 2.1.0

Not sure why my compose files contains: version: "3.3" as the first line.

@twinkarma
Copy link
Collaborator Author

Version 3.3 is for the minimum version of docker compose file, so it's not related to the teamware version.

I'll see if I can put together a quick script for downloading the annotations for you. I'm assuming you have full access to the machine that's hosting this teamware instance?

@johann-petrak
Copy link

Thanks!
Yes, I have deployed this and I have sudo rights on that machine as well if needed

@ianroberts
Copy link
Member

The code that generates the download is DownloadAnnotationsView and the problem is that the way it's implemented the entire export has to be written to a temporary file before it can start sending any bytes back over the HTTP response, therefore the whole thing times out if the temp file generation takes longer than the HTTP response timeout duration. If there were some way to stream the zip file bytes out to the response as they're generated rather than buffering everything in a temporary file then it'd be less of an issue as there'd be chunks at regular intervals, not triggering the timeout.

Short term you could probably do something with a django admin command that creates an instance of DownloadAnnotationsView, calls the generate_download generator method and streams the resulting byte chunks to a file.

@twinkarma
Copy link
Collaborator Author

Yes I'm writing that script now

@twinkarma
Copy link
Collaborator Author

twinkarma commented Oct 2, 2023

Go into the container:

docker exec -it gate-teamware_backend_1 /bin/bash

Run for example to create a zip file testdownload.zip for project id 1 with all document type in json and not anonymised:

./manage.py download_annotations testdownload.zip 1 all json False

Then you should be able to run the following to download the file to outside of your container

docker cp gate-teamware_backend_1:/app/testdownload.zip testdownload.zip

@johann-petrak
Copy link

Still seeing this with a "larger" project (only 4k small documents really), this is a bit frustrating. I am running TW version 2.1.1

I am not sure why it is necessary to create a ZIP file anyways, why not stream to a gzip compressed jsonl file which should allow to immediately start responding?

@ianroberts
Copy link
Member

Even zip files can be streamed (c.f. Java’s ZipOutputStream), if we can find a suitable Python library capable of doing that.

@johann-petrak
Copy link

Ah ok! Seems there is e.g. https://github.com/sandes/zipfly which seems to work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants