Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document load under fleet #2003

Open
vondele opened this issue May 13, 2024 · 114 comments
Open

Document load under fleet #2003

vondele opened this issue May 13, 2024 · 114 comments

Comments

@vondele
Copy link
Member

vondele commented May 13, 2024

Just to keep some numbers for later reference, this is the average time in ms needed server side when the fleet is present (but not overloading) the server (50k cores), as obtained from the worker output (Post request https://tests.stockfishchess.org:443/api/request_task handled in 1719.69ms (server: 265.24ms)):

failed_task 13.6646
request_spsa 4.04172
request_task 149.143
request_version 0.241336
update_task 4.36213
upload_pgn 92.8058

corresponding observation on the worker side:

failed_task 742.879
request_spsa 319.939
request_task 756.641
request_version 627.497
update_task 539.884
upload_pgn 1213.44

Server-side sampling of the traces, for calls not leading to a lock looks like:

21
     File "rundb.py", line 432, in flush_buffers self.runs.replace_one(
7
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
          File "rundb.py", line 727, in request_task return self.sync_request_task(worker_info)
               File "rundb.py", line 738, in sync_request_task w = self.workerdb.get_worker(my_name)
                    File "workerdb.py", line 17, in get_worker r = self.workers.find_one(
5
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
          File "rundb.py", line 727, in request_task return self.sync_request_task(worker_info)
               File "rundb.py", line 902, in sync_request_task committed_games += (
4
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
          File "rundb.py", line 727, in request_task return self.sync_request_task(worker_info)
               File "rundb.py", line 870, in sync_request_task if connections >= connections_limit:
3
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
          File "rundb.py", line 727, in request_task return self.sync_request_task(worker_info)
               File "rundb.py", line 837, in sync_request_task task_name = worker_name(task["worker_info"], short=True)
                    File "util.py", line 47, in worker_name name = "{}-{}cores".format(username, cores)

Essentially all stack traces go via request_task.

@vondele
Copy link
Member Author

vondele commented May 13, 2024

And some api times of the server under load (IIRC 75k cores) but with various stages of 'distress'
medium:

# logging from [08/May/2024:18:26:11 +0000] to [08/May/2024:18:27:30 +0000] ########
# duration (seconds)            : 79
# calls in total                : 10000
# calls per second              : 126
# calls not reaching the backend: 1
# calls handled by the backend  : 9999
#                         route    calls      total    average    minimum    maximum
                      /api/beat     5523   2100.429      0.380      0.001      4.014
               /api/update_task     3154    723.048      0.229      0.002      4.014
              /api/request_task      376    138.410      0.368      0.038      3.648
           /api/request_version      371     62.196      0.168      0.000      3.232
                   /tests/view/       45     45.010      1.000      0.025      3.406
                /api/upload_pgn      367     19.217      0.052      0.007      0.289
                  /tests/tasks/       18      6.594      0.366      0.045      2.577
                         /tests        9      6.145      0.683      0.029      0.936
              /api/request_spsa       42      3.253      0.077      0.002      0.363
                   /tests/user/        2      1.172      0.586      0.540      0.632
                   /api/actions       28      0.840      0.030      0.003      0.161
                       /api/nn/       59      0.625      0.011      0.003      0.059
                  /contributors        1      0.293      0.293      0.293      0.293

5 request_task / s

high:

# logging from [08/May/2024:18:38:01 +0000] to [08/May/2024:18:39:06 +0000] ########
# duration (seconds)            : 65
# calls in total                : 10000
# calls per second              : 153
# calls not reaching the backend: 1
# calls handled by the backend  : 9999
#                         route    calls      total    average    minimum    maximum
                     /api/beat     4854  12774.715      2.632      0.002     24.684
              /api/update_task     2673   4104.362      1.535      0.002     24.689
          /api/request_version     1711   1581.257      0.924      0.001     23.757
             /api/request_task      583   1036.156      1.777      0.042     24.179
                 /api/get_elo/        7     29.748      4.250      0.108     24.629
                        /tests       10      8.301      0.830      0.040      1.138
             /api/request_spsa       31      7.610      0.245      0.005      0.739
               /api/upload_pgn       57      5.803      0.102      0.008      3.694
              /tests/live_elo/        4      3.639      0.910      0.223      2.038
                  /tests/user/        1      0.517      0.517      0.517      0.517
                  /api/actions       23      0.412      0.018      0.002      0.061
                      /api/nn/       41      0.288      0.007      0.003      0.047
                  /tests/view/        1      0.261      0.261      0.261      0.261
               /tests/finished        1      0.121      0.121      0.121      0.121

9 request_task / s

too much:

# logging from [08/May/2024:18:42:28 +0000] to [08/May/2024:18:43:19 +0000] ########
# duration (seconds)            : 51
# calls in total                : 10000
# calls per second              : 196
# calls not reaching the backend: 
# calls handled by the backend  : 10000
#                         route    calls      total    average    minimum    maximum
                      /api/beat     4662  23688.011      5.081      0.002     30.029
               /api/update_task     4074  20350.370      4.995      0.003     30.031
           /api/request_version      767   4114.809      5.365      0.003     29.990
              /api/request_task      341   1665.680      4.885      0.071     29.987
              /api/request_spsa       66     70.501      1.068      0.010     12.693
               /api/failed_task        9     48.770      5.419      0.012     29.984
                   /tests/view/        4      7.470      1.867      0.175      6.781
                         /tests        9      6.444      0.716      0.022      1.260
                     /tests/run        2      5.032      2.516      1.115      3.917
                /api/upload_pgn       32      1.834      0.057      0.008      0.272
                   /tests/user/        2      1.174      0.587      0.498      0.676
                   /api/actions       22      0.470      0.021      0.007      0.136

7 request_task / s

if we go with the above timing for request_task, we can handle 1000 / 149 ~ 7 of them per second, so that's consistent with the numbers we have here (without timeouts at various rate for the profile 2 and 3).

@vdbergh
Copy link
Contributor

vdbergh commented May 13, 2024

@vondele Can you explain why the internal throttling in Fishtest does nothing (at most five concurrent accesses to request_task)? I had the PR #1987 to set the number to 1 but I closed it since I did not understand what was going on.

@vondele
Copy link
Member Author

vondele commented May 13, 2024

no, I don't know why that happens. Admittedly, I don't quite know how the frameworks around our python code do things, if they would only allow one concurrent access to the same endpoint, it would be obvious.

@ppigazzini
Copy link
Collaborator

ppigazzini commented May 13, 2024

# Limit concurrent request_task
task_lock = threading.Lock()
task_semaphore = threading.Semaphore(4)

def request_task(self, worker_info):
if self.task_semaphore.acquire(False):
try:
with self.task_lock:
return self.sync_request_task(worker_info)
finally:
self.task_semaphore.release()
else:
print("request_task too busy", flush=True)
return {"task_waiting": False}

It should queue only 4 requests and drop the other until a slot in the semaphore is released...

@Disservin
Copy link
Member

Disservin commented May 13, 2024

The entire thing is basically single threaded from the start because of how python does thing. I think Pasquale has worked around that by configuring the nginx so that certain endpoints are forwarded to different python processes (I think we got 2 running)? So all requests to the same endpoint are handled sequentially without any concurrency iirc

@ppigazzini
Copy link
Collaborator

ppigazzini commented May 13, 2024

###
# wsgi server configuration
###
[server:main]
use = egg:waitress#main
host = 127.0.0.1
port = %(http_port)s
trusted_proxy = 127.0.0.1
trusted_proxy_count = 1
trusted_proxy_headers = x-forwarded-for x-forwarded-host x-forwarded-proto x-forwarded-port
clear_untrusted_proxy_headers = yes
# Match connection limit with max number of workers
connection_limit = 100
threads = 4

Each of the 3 instances should have 4 threads, so the semaphore should never overtook.

@ppigazzini
Copy link
Collaborator

ppigazzini commented May 13, 2024

PROD running with 10 threads per instance. I will check in the log for requests rejected.

@vdbergh
Copy link
Contributor

vdbergh commented May 13, 2024

PROD running with 10 threads per instance. I will check in the log for request rejected.

If Disservin is correct then the # of threads is irrelevant because of the GIL.

EDIT: that’s not quite right. Threads can switch between the execution of byte codes.

@vondele
Copy link
Member Author

vondele commented May 13, 2024

let me add some workers to check this, I need to pause the progression tests for that.

@ppigazzini
Copy link
Collaborator

import threading
import time

class Test:
    def __init__(self):
        self.task_lock = threading.Lock()
        self.task_semaphore = threading.Semaphore(4)

    def access_resource(self, thread_id):
        if self.task_semaphore.acquire(False):
            try:
                with self.task_lock:
                    print(f"Thread {thread_id} is accessing the resource.")
                    time.sleep(0.001)
            finally:
                self.task_semaphore.release()
                print(f"Thread {thread_id} has released the resource.")
        else:
            print(f"Thread {thread_id} could not acquire the semaphore, resource is too busy.")

test = Test()

threads = []
for i in range(1000):
    thread = threading.Thread(target=test.access_resource, args=(i,))
    threads.append(thread)
    thread.start()

# Wait for all threads to complete
for thread in threads:
    thread.join()

@ppigazzini
Copy link
Collaborator

ppigazzini commented May 13, 2024

May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy

@vondele
Copy link
Member Author

vondele commented May 13, 2024

just FYI, while the workers are ramping up, initial phase have the following server timings:

request_spsa 11.578
request_task 39.5495
request_version 0.349775
update_task 6.072

2-3min later:

failed_task 32.0517
request_spsa 9.76163
request_task 55.4572
request_version 0.594289
update_task 3.42941

about 9min later

failed_task 36.864
request_spsa 7.39204
request_task 61.0517
request_version 0.543516
update_task 3.66163
upload_pgn 15.84

22min later:

failed_task 37.8486
request_spsa 9.56024
request_task 72.8793
request_version 0.499205
update_task 10.0572
upload_pgn 23.6664

34min later:

failed_task 38.7616
request_spsa 8.77399
request_task 80.093
request_version 0.464707
update_task 9.20768
upload_pgn 27.9647

It is clear that request_task timing grows quickly with the server load:
typical worker:

Post request https://tests.stockfishchess.org:443/api/request_task handled in 148.79ms (server: 14.41ms)
Post request https://tests.stockfishchess.org:443/api/request_task handled in 356.46ms (server: 197.66ms)

or

Post request https://tests.stockfishchess.org:443/api/request_task handled in 171.36ms (server: 13.02ms)
Post request https://tests.stockfishchess.org:443/api/request_task handled in 351.89ms (server: 184.89ms)

so in the ramp-up phase, with few workers active we see about 14ms for a call, later this is closer to 190ms for a call.

@Disservin
Copy link
Member

PROD running with 10 threads per instance. I will check in the log for request rejected.

If Disservin is correct then the # of threads is irrelevant because of the GIL.

EDIT: that’s not quite right. Threads can switch between the execution of byte codes.

Well I only mentioned the bits I’ve collected, also a first I’m seeing the prod ini with the threads so take everything with a grain of salt from me

@ppigazzini
Copy link
Collaborator

It's the first proof that waitress threads are for real :)

@vondele
Copy link
Member Author

vondele commented May 13, 2024

as we've done the check, I'll take the workers down.

As shown in #2003 (comment) it looks clear that the request_task api is getting increasingly expensive under load from roughly 15 to 190ms per call between the first call of a worker and the second call.

@vdbergh
Copy link
Contributor

vdbergh commented May 13, 2024

I think the number of connections in waitress should be equal to the number of threads (so that waitress does not queue connections) and then the value of the request_task semaphore should be adjusted so that Fishtest does not become overloaded.

@ppigazzini
Copy link
Collaborator

ppigazzini commented May 13, 2024

For years we ran the server with 500 connection_limit and 16 threads, I didn't measure any improvement with 2000 machines 17000 cores in PROD, so I restored the waitress default 100 connection_limit and 4 threads in this b960e2b
https://docs.pylonsproject.org/projects/waitress/en/stable/arguments.html

In nginx we keep alive 64 connections for upstream to avoid the cost to close and open connections (not sure the 2 config are talking about the same "connection")

upstream backend_6543 {
    server 127.0.0.1:6543;
    keepalive 64;
}

@vdbergh
Copy link
Contributor

vdbergh commented May 13, 2024

The thing is that 100 is too big.

Waitress will simply queue connections without giving Fishtest a chance to reject them. That's why we are seeing 30s for api calls. They are just queued in waitress.

EDIT: I mean too big compared to the number of worker threads.

@ppigazzini
Copy link
Collaborator

We set 30 second in nginx

    location / {
        rewrite ^/$       /tests permanent;
        rewrite ^/tests/$ /tests permanent;
        proxy_set_header Connection "";
        proxy_set_header X-Forwarded-Proto  $scheme;
        proxy_set_header X-Forwarded-For    $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Host   $host:$server_port;
        proxy_set_header X-Forwarded-Port   $server_port;
        proxy_set_header X-Country-Code     $region;

        client_max_body_size        120m;
        client_body_buffer_size     128k;
        proxy_connect_timeout       5s;
        proxy_send_timeout          30s;
        proxy_read_timeout          30s;     # timeout
        proxy_max_temp_file_size    0;
        proxy_redirect              off;
        proxy_http_version          1.1;

        proxy_cache cache_zone;

        gunzip on;

        proxy_pass http://$backends;

        location /api/serialize_run/ {
            deny all;
        }
    }

@vdbergh
Copy link
Contributor

vdbergh commented May 14, 2024

Yes nginx caps the connection time. But either waitress or Fishtest should reject the connections. It should not be left to nginx.

@ppigazzini
Copy link
Collaborator

ppigazzini commented May 14, 2024

It's really the job of nginx. We can set connection limit and rate for host, uri, binary_remote_addr, globally or even for any single path.

limit_conn_zone $host zone=perhost:10m;
limit_conn perhost 10;
limit_conn_status 429;

@vdbergh
Copy link
Contributor

vdbergh commented May 14, 2024

It's really the job of nginx. We can set limits for host, uri, even for any single path.

limit_conn_zone $host zone=perhost:10m;
limit_conn perhost 10;

Well then nginx should limit the number of connections to waitress to the number of threads in Fishtest. Otherwise waitress will just queue the connections.

The point is we need to give Fishtest a chance to reject the request_task connections using the semaphore.

@vdbergh
Copy link
Contributor

vdbergh commented May 14, 2024

Note that Fishtest will reject the connection immediately. There will be no 30s timeout.

@ppigazzini
Copy link
Collaborator

ppigazzini commented May 14, 2024

set 10 connections in nginx (like the waitress threads), tested opening 50 connections concurrently.

2024/05/14 00:52:57 [error] 13428#13428: *790 limiting connections by zone "per_host_conn", client: ***.***.***.***, server: tests.stockfishchess.org, request: "GET /api/active_runs HTTP/2.0", host: "tests.stockfishchess.org"
2024/05/14 00:52:57 [alert] 13429#13429: ignore long locked inactive cache entry 522ba7397b573b51acc828305a6d6a06, count:7
2024/05/14 00:53:59 [error] 13428#13428: *1075 limiting connections by zone "per_host_conn", client: ***.***.***.***, server: tests.stockfishchess.org, request: "GET /api/active_runs HTTP/2.0", host: "tests.stockfishchess.org"
2024/05/14 00:53:59 [error] 13428#13428: *1073 challenging request, client: ***.***.***.***, server: tests.stockfishchess.org, request: "GET /api/active_runs HTTP/2.0", host: "tests.stockfishchess.org"

@ppigazzini
Copy link
Collaborator

ppigazzini commented May 14, 2024

set 30 connections in nginx and 10 connections in waitress (like the 10 threads).
IMO the web works with queue to deal with spikes, and we don't want the worker making request too fast when the server is overloaded....

@ppigazzini
Copy link
Collaborator

Commented out the nginx limit.
image

@vdbergh
Copy link
Contributor

vdbergh commented May 14, 2024

set 30 connections in nginx and 10 connections in waitress (like the 10 threads). IMO the web works with queue to deal with spikes, and we don't want the worker making request too fast when the server is overloaded....

That is of course true. But one can overdo it which leads to buffer bloat https://en.wikipedia.org/wiki/Bufferbloat . Currently we have an ngnix queue, a waitress queue and a queue in Fishtest (of threads waiting for locks to clear) and it becomes hard to understand how they interact.

I think the waitress queue should be eliminated since nginx is already queing so that Fishtest can take control.

@vdbergh
Copy link
Contributor

vdbergh commented May 14, 2024

Of course I may be wrong but it seems natural to me that if Fishtest can handle say 7 request task calls per second it should be given the opportunity to reject more. Now it does not have that opportunity since it seems to me that most connections are queued by waitress until they are closed by nginx after 30 seconds.

@vondele
Copy link
Member Author

vondele commented May 14, 2024

I really think we should understand and fix why fishtest can only handle that few request_task calls?

If we can go by the profile in #2003 (comment) we see that it spends most time in:

  • accessing the workerdb to get worker info (might depend on workload with workers), might be hard to avoid?
  • counting committed games by iterating overall tasks (grows with workers) to avoid assigning to many
  • counting connections to see if it reaches the limit by iterating over active_tasks (grows with workers)
  • checking if a worker name is already connected by iterating over active_tasks (grows with workers).

so, somehow the 3 of 4 hotspots are checks, can they be written or implemented differently / are they even needed at this point in the code?

@vondele
Copy link
Member Author

vondele commented May 14, 2024

if we can't fix request_task, probably, we should rate limit it at the nginx level, while at the same time modify the worker to catch that condition. Not really as an error (i.e. no other messages sent to the server), but just for the worker to go to sleep and retry a while later, i.e. not really an error condition, rather a signal.

@vondele
Copy link
Member Author

vondele commented May 15, 2024

certainly, it limits concurrent access to that api, but it seems not to hold that this declogs the waitress queue, at least not sufficiently, as also with this config we see the 30s api timeout.

While I don't know the exact cause of the api timeouts, the semaphore still allows 100% of server capacity to be consumed by request_task, it doesn't limit the number of calls per second, it just limits the concurrent access to the api. Queues can still grow with other api calls (such as heartbeat or update_task), these can still timeout?

@vdbergh
Copy link
Contributor

vdbergh commented May 15, 2024

But there weren’t any 30s timeouts in the latest log you posted #2003 (comment) (158 calls/s).

@vdbergh
Copy link
Contributor

vdbergh commented May 15, 2024

Compare this to the log of a similar load (153 calls/s) you posted in the beginning #2003 (comment).

@vondele
Copy link
Member Author

vondele commented May 15, 2024

any of these logs can happen.. with the current config, the calls/s doesn't mean so much, it depends on what is being called (cheap calls like get_version / heartbeat vs expensive calls like request_tasks), and the server max times can change quite drastically within seconds (e.g. different after a large test finishes for example).

@ppigazzini
Copy link
Collaborator

Here is the original commit introducing the semaphore.

2c9e381

This was the waitress configuration at the time of the semaphore commit, set with these commits c1c086d a932f79:

# Match connection limit with max number of workers
connection_limit = 500
threads = 16

With the distribution probability of the APIs and the timing of the APIs, we should be able to configure the proper queues lengths (at example, why semaphore 4?).

(The comment suggests that at the time fishtest was able to withstand max 500 workers :) )

@xoto10
Copy link
Contributor

xoto10 commented May 15, 2024

Could we reduce the waitress time limit now from 30s to something lower? 5s? (Maybe 10-15s first)

If there are hundreds (or more) workers asking for request_task within a second or two, presumably a queue of ~20 forms in each thread. Does the timeout apply to the ones in the queues, and waitress rejects further requests?
I think we want the waitress timeout to be hit at really busy times.

I imagine the fishtest "request_task too busy" is pretty quick, but all 5 threads will (trying to) execute request_task sometimes, so the arriving requests could fill the waitress connections pretty fast.

EDIT: I should have checked first - the semaphore means only 4 threads can be running request_task fully, 5 threads means one will be rejecting requests if they all get them. Hopefully this can do 100s / second.

@xoto10
Copy link
Contributor

xoto10 commented May 15, 2024

I guess we could increase that semaphore (and thread count to match) to try to work the server harder. It might be most efficient at less than full cpu load, but that could be a bit higher than 4. Try Semaphore(8) with 10 threads? 16 and 18?

@xoto10
Copy link
Contributor

xoto10 commented May 15, 2024

... primary instance CPU 55-85% with some short spikes at 95% (it seems just after a run finished).

Ah, I missed this bit. Maybe Semaphore(4) is good then if we're hitting 95% at times.

I would increase threads slightly (to 8?) to allow other things to happen when lots of request_tasks are arriving. Imagine if an upload happens at the same time.
Connections could prob increase but no idea what to, 200? 500? It would be useful to know how quickly fishtest can reject those extra request_task calls.
And the waitress timeout could come down so that it helps reduce the load if fishtest is strugglng and improves user experience, say 5s but maybe as low as 1s or 2s?

@vdbergh
Copy link
Contributor

vdbergh commented May 15, 2024

@xoto10

I would increase threads slightly (to 8?) to allow other things to happen when lots of request_tasks are arriving. Imagine if an upload happens at the same time.

At most one thread can enter request_task (I am not sure if this is good or bad). So there are plenty of other threads available.

The bad thing happens if all threads are queuing to enter request_task, but this is prevented by the semaphore.

ppigazzini added a commit to ppigazzini/fishtest that referenced this issue May 16, 2024
Tested working fine with a fleet of 75k cores 10k machines,
Waitress configured with threads=4 and connection_limit=100
HW usage: RAM 72%, primary Waitress instance CPU 85-95%

This is a good compromise for threads "privileged" for "request_task" API,
and threads free for any other APIs, while keeping Waitress threads=4.

Raising the number of Waitress threads uses more RAM, and can
requires to lower the MongoDB cache size to avoid the OOM killer.

see official-stockfish#2003
ppigazzini added a commit to ppigazzini/fishtest that referenced this issue May 16, 2024
Tested working fine with a fleet of 75k cores 10k machines,
Waitress configured with threads=4 and connection_limit=100
HW usage: RAM 72%, primary Waitress instance CPU 85-95%

This is a good compromise for threads "privileged" for "request_task" API,
and threads free for any other APIs, while keeping Waitress threads=4.

Raising the number of Waitress threads uses more RAM and can
requires lowering the MongoDB cache size to avoid the OOM killer.

see official-stockfish#2003
ppigazzini added a commit to ppigazzini/fishtest that referenced this issue May 16, 2024
Tested working fine with a fleet of 75k cores 10k machines,
Waitress configured with threads=4 and connection_limit=100
HW usage: RAM 72%, primary Waitress instance CPU 85-95%

This is a good compromise for threads "privileged" for "request_task" API,
and threads free for any other APIs, while keeping Waitress threads=4.

Raising the number of Waitress threads uses more RAM and can
requires lowering the MongoDB cache size to avoid the OOM killer.

see official-stockfish#2003
ppigazzini added a commit to ppigazzini/fishtest that referenced this issue May 16, 2024
Tested working fine with a fleet of 75k cores 10k machines,
Waitress configured with threads=4 and connection_limit=100
HW usage: RAM 72%, primary Waitress instance CPU 85-95%

This is a good compromise for threads "privileged" for "request_task" API,
and threads free for any other APIs, while keeping Waitress threads=4.

Raising the number of Waitress threads uses more RAM and can
requires lowering the MongoDB cache size to avoid the OOM killer.

see official-stockfish#2003

Drop old mongodb requirement.
ppigazzini added a commit that referenced this issue May 16, 2024
Tested working fine with a fleet of 75k cores 10k machines,
Waitress configured with threads=4 and connection_limit=100
HW usage: RAM 72%, primary Waitress instance CPU 85-95%

This is a good compromise for threads "privileged" for "request_task" API,
and threads free for any other APIs, while keeping Waitress threads=4.

Raising the number of Waitress threads uses more RAM and can
requires lowering the MongoDB cache size to avoid the OOM killer.

see #2003

Drop old mongodb requirement.
@vondele
Copy link
Member Author

vondele commented Jul 1, 2024

Coming back here with some more documentation, and latest server side timings. Since the initial issue, we have made a lot of progress, and also upgraded the server. We had a successful run with 190k+ cores working on tests. Some knowledge gained:

  • need to be able to deal with large number of pgn uploads, or somehow throttle the upload rate. Currently 2-3 pserves are needed to just deal with pgn upload when tests finish (but can be handled by leaving that to nginx).
  • Need to have 10 threads for the main pserve + mutex 5 for request_task (and/or significantly speed it up).
  • scheduled tasks might need to be scheduled somewhat differently (i.e. not every 1s but every 1s after completion), as the length of the work can exceed the scheduling period if locks are contended.
  • it might be that png_upload can lead to some kind of hang / deadlock under extreme load (i.e. all workers quit and upload at the same time).
  • profile needs to be analyzed and we need to see if probably, beat / update_task can be further speed-up (their validate part?).

Some typical timings:

# logging from [30/Jun/2024:13:48:53 +0000] to [30/Jun/2024:13:49:46 +0000] ########
# duration (seconds)            : 53
# calls in total                : 10000
# calls per second              : 188
# calls not reaching the backend: 
# calls handled by the backend  : 10000
#                         route    calls      total    average    minimum    maximum
               /api/update_task     5130   4432.285      0.864      0.001     30.002
                      /api/beat     3914   4138.031      1.057      0.001     30.002
              /api/request_task      360    164.171      0.456      0.002     30.001
           /api/request_version      370    127.952      0.346      0.001     30.000
                /api/upload_pgn      157      4.860      0.031      0.008      0.148
                         /tests        5      3.769      0.754      0.018      1.051
                     /tests/run        2      2.704      1.352      0.492      2.212

note the 30s maximum (which leads to failing workers), but fairly low average time.

While a sampled profile under load looks like:

===============================================================
Collecting 600 samples for pserve 6543 spaced 1.0 seconds apart
Total time needed 600 seconds 
Found pid 39924 for pserve 6543
Showing 10 most frequent samples traces: 
576 
     File "util.py", line 782, in __next_schedule self.__event.wait(delay) 
23 
     File "util.py", line 784, in __next_schedule next_task._do_work() 
          File "util.py", line 647, in _do_work self.worker(*self.args, *self.kwargs) 
               File "rundb.py", line 800, in flush_buffers self.runs.replace_one({"_id": oldest_run["_id"]}, oldest_run) 
23 
     File "api.py", line 314, in beat self.validate_request() 
          File "api.py", line 129, in validate_request run = self.request.rundb.get_run(run_id) 
               File "rundb.py", line 721, in get_run with self.run_cache_lock: 
16 
     File "api.py", line 244, in update_task self.validate_request() 
          File "api.py", line 129, in validate_request run = self.request.rundb.get_run(run_id) 
               File "rundb.py", line 721, in get_run with self.run_cache_lock: 
16 
     File "api.py", line 228, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 1052, in request_task with self.request_task_lock: 
12 
     File "api.py", line 245, in update_task result = self.request.rundb.update_task( 
          File "rundb.py", line 1400, in update_task with lock: 
7 
     File "api.py", line 245, in update_task result = self.request.rundb.update_task( 
          File "rundb.py", line 1401, in update_task return self.sync_update_task(worker_info, run_id, task_id, stats, spsa) 
               File "rundb.py", line 1404, in sync_update_task run = self.get_run(run_id) 
                    File "rundb.py", line 721, in get_run with self.run_cache_lock: 
6 
     File "api.py", line 245, in update_task result = self.request.rundb.update_task( 
          File "rundb.py", line 1401, in update_task return self.sync_update_task(worker_info, run_id, task_id, stats, spsa) 
               File "rundb.py", line 1531, in sync_update_task self.buffer(run, False) 
                    File "rundb.py", line 745, in buffer with self.run_cache_lock: 
5 
     File "api.py", line 228, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 1053, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 1136, in sync_request_task unfinished_runs = [self.get_run(run_id) for run_id in self.unfinished_runs] 
                    File "rundb.py", line 721, in get_run with self.run_cache_lock: 
5 
     File "api.py", line 228, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 1053, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 1065, in sync_request_task w = self.workerdb.get_worker(my_name) 
                    File "workerdb.py", line 17, in get_worker r = self.workers.find_one( 

@vdbergh
Copy link
Contributor

vdbergh commented Jul 1, 2024

576 
     File "util.py", line 782, in __next_schedule self.__event.wait(delay) 
23 
     File "util.py", line 784, in __next_schedule next_task._do_work() 
          File "util.py", line 647, in _do_work self.worker(*self.args, *self.kwargs) 
               File "rundb.py", line 800, in flush_buffers self.runs.replace_one({"_id": oldest_run["_id"]}

I am reading this as the thread handling scheduled tasks being mostly idle.

@vondele
Copy link
Member Author

vondele commented Jul 1, 2024

that's possible (but this was with a local patch applied self.__next_schedule = datetime.now(timezone.utc) + self.period + jitter

@vondele
Copy link
Member Author

vondele commented Jul 1, 2024

(also we had the flush_buffers period at a given point at 10s instead of 1s, which did improve the load, but that was prior to this above mentioned change).

@vdbergh
Copy link
Contributor

vdbergh commented Jul 1, 2024

profile needs to be analyzed and we need to see if probably, beat / update_task can be further speed-up (their validate part?).

Validating worker api's on the main instance should be cheap (it only invokes some cache lookups).

On secondary instances it is more complicated. This currently only applies to upload_pgn.

@vondele
Copy link
Member Author

vondele commented Jul 1, 2024

the secondary instances don't matter to much, the load on the primary instance is what limits.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 1, 2024

that's possible (but this was with a local patch applied self.__next_schedule = datetime.now(timezone.utc) + self.period + jitter

Ah. I would have liked to know how much difference this makes. In previous tests flush_buffers() was so fast on average that an occasional long tail didn't matter.

The issue with your patch is that it only applies to a single task. If hypothetically we were to split the flush_buffers some way into two tasks flush_buffers1 and flush_ buffers2 then they could still load the task thread 100% even with your patch.

I wonder if we should introduce an idle task which would just sleep 0.2s. In that way the task thread would never be 100% loaded.

EDIT. No I think that would be a bad idea.

@vondele
Copy link
Member Author

vondele commented Jul 1, 2024

so the honest answer is that this is very difficult to measure, as the load on the server fluctuates with the type of tests on fishtest, etc. However, visually at that point it seemed to drop from 100 to 50% the load, we did have two instances where the server seemed hanging, the only thing the sampling returned was this:

$ sudo bash sample_pserve.sh 6543 20
===============================================================
Collecting 20 samples for pserve 6543 spaced 1.0 seconds apart
Total time needed 20 seconds 
Found pid 14360 for pserve 6543
Showing 10 most frequent samples traces: 
19 
     File "util.py", line 776, in __next_schedule self.__event.wait(delay) 

Not some solid evidence, but well, not so easy to figure out. With that fix in place we didn't observe any hangs.

I think the main issue is that with the locks in place, we have no guarantee of what the timings can be. If the locks cause the timings to become longer than the period we have an issue.

@vondele
Copy link
Member Author

vondele commented Jul 1, 2024

We did measure at a given point (with the patch in place, and no issues on the server) the distribution of the time taken for _do_work:

$ sudo journalctl -u fishtest@6543 --since "60 minutes ago" | grep '_do_work' | grep -v print | awk '{printf("%8.1f\n",$NF)}' | sort -n | uniq -c
    165      0.0
     19      0.1
      2      0.2
      2      0.3
      1      0.4
      1      0.8
      1      2.1
      1      2.4

So, in most cases less than 0.1s, but there were outliers to 2s

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jul 1, 2024

A couple of times we got 100% CPU load on the pserve at 6543, with the count of cores dropping to zero. The hang was fixed only with a pserve restart. This hints to a race condition, not simply to the CPU being the bottleneck.
We didn't view the hang after raising the scheduled period of flush_buffers() to 10s, or when scheduling the work using a delay instead than a period.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 1, 2024

I see. I still think that long tails are harmless but I guess the average duration of flush_buffers() goes up under load.

One obvious effect that might amplify this is that the run_cache_lock is held while writing the selected run to the db. Only the active_run_lock (for the specific run being written), should be held.

@vondele
Copy link
Member Author

vondele commented Jul 1, 2024

Oh, that could make quite a difference?

@vdbergh
Copy link
Contributor

vdbergh commented Jul 1, 2024

I think so because the run_cache_lock is a kind of "stop the world" lock as the cache is accessed at many places.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 1, 2024

I made two PR's. One which releases the run_cache_lock before writing to the db (#2092) and one which introduces a min_delay parameter for create_task() which may be used to have a guaranteed minimum time interval between repeated tasks. I set it to 1.0 for flush_buffers() (#2093).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants