Improve Jobs API concurrency #2220

msrb · 2018-02-12T09:13:32Z

From @tuxdna on February 2, 2018 10:27

We are currently encountering Gateway Timeouts when the Jobs API calls take very long time to process. This is already a known issue fabric8-analytics/fabric8-analytics-jobs#164

Another issue that I realized is that the number of API workers are at present set to only 1. This causes other clients to wait until a request completes or fails. We can improve the concurrency of API workers to resolve this.

Copied from original issue: fabric8-analytics/fabric8-analytics-jobs#247

msrb · 2018-02-12T09:13:33Z

From @fridex on February 2, 2018 10:31

Another issue that I realized is that the number of API workers are at present set to only 1. This causes other clients to wait until a request completes or fails. We can improve the concurrency of API workers to resolve this.

BTW see https://github.com/fabric8-analytics/fabric8-analytics-jobs/blob/master/f8a-jobs.py#L109-L112

msrb · 2018-02-12T09:13:33Z

From @tuxdna on February 2, 2018 10:35

There are two options:

first is to increase the number of uwsgi workers here - https://github.com/fabric8-analytics/fabric8-analytics-jobs/blob/028e37e149d900dbade54d1529c89ad831b26a71/hack/run_jobs.sh#L7
second is to increase the number of replicas running inside OpenShift for Jobs Service - https://github.com/fabric8-analytics/fabric8-analytics-jobs/blob/master/openshift/template.yaml#L17

As @fridex mentioned above we also have to ensure that there is only one Jobs scheduler running in either case.

Interesting mix of constraints here :-)

msrb · 2018-02-12T09:26:48Z

Having scheduler running as part of the server process seems to be the limiting factor here. I think it would be much better to separate the two components (API and scheduler). This hidden scheduler is an easy way how to shoot ourselves in the foot.

tuxdna · 2018-02-19T10:15:34Z

With server='flask'

$ ab -n 500 -c 100 'http://localhost:34000/api/v1/bookkeeping/npm/serve-static/1.0.0'
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests


Server Software:        
Server Hostname:        localhost
Server Port:            34000

Document Path:          /api/v1/bookkeeping/npm/serve-static/1.0.0
Document Length:        1701 bytes

Concurrency Level:      100
Time taken for tests:   6.683 seconds
Complete requests:      500
Failed requests:        0
Total transferred:      887000 bytes
HTML transferred:       850500 bytes
Requests per second:    74.82 [#/sec] (mean)
Time per request:       1336.530 [ms] (mean)
Time per request:       13.365 [ms] (mean, across all concurrent requests)
Transfer rate:          129.62 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    7  15.3      0      50
Processing:    66 1197 302.4   1315    1380
Waiting:       66 1197 302.4   1315    1380
Total:        116 1205 289.3   1315    1381

Percentage of the requests served within a certain time (ms)
  50%   1315
  66%   1328
  75%   1335
  80%   1341
  90%   1352
  95%   1364
  98%   1375
  99%   1377
 100%   1381 (longest request)




$ ab -n 5000 -c 1000 'http://localhost:34000/api/v1/bookkeeping/npm/serve-static/1.0.0'
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            34000

Document Path:          /api/v1/bookkeeping/npm/serve-static/1.0.0
Document Length:        1701 bytes

Concurrency Level:      1000
Time taken for tests:   64.004 seconds
Complete requests:      5000
Failed requests:        889
   (Connect: 0, Receive: 0, Length: 889, Exceptions: 0)
Total transferred:      7292914 bytes
HTML transferred:       6992811 bytes
Requests per second:    78.12 [#/sec] (mean)
Time per request:       12800.797 [ms] (mean)
Time per request:       12.801 [ms] (mean, across all concurrent requests)
Transfer rate:          111.27 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   39 171.5      0    1064
Processing:    27 11743 22041.0   1340   60826
Waiting:        0 1493 4043.5   1320   60826
Total:        154 11782 22118.5   1340   61891

Percentage of the requests served within a certain time (ms)
  50%   1340
  66%   1359
  75%   1379
  80%   1407
  90%  59803
  95%  59821
  98%  60801
  99%  60804
 100%  61891 (longest request)



$ ab -n 10000 -c 1000 -g flask-out.data 'http://localhost:34000/api/v1/bookkeeping/npm/serve-static/1.0.0'
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            34000

Document Path:          /api/v1/bookkeeping/npm/serve-static/1.0.0
Document Length:        1701 bytes

Concurrency Level:      1000
Time taken for tests:   147.466 seconds
Complete requests:      10000
Failed requests:        1782
   (Connect: 0, Receive: 0, Length: 1782, Exceptions: 0)
Total transferred:      14578732 bytes
HTML transferred:       13978818 bytes
Requests per second:    67.81 [#/sec] (mean)
Time per request:       14746.643 [ms] (mean)
Time per request:       14.747 [ms] (mean, across all concurrent requests)
Transfer rate:          96.54 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   66 247.3      0    1071
Processing:    99 12236 22425.8   1351   84670
Waiting:        0 1672 4164.9   1309   68392
Total:        180 12302 22560.8   1351   84767

Percentage of the requests served within a certain time (ms)
  50%   1351
  66%   1589
  75%   2038
  80%   5362
  90%  60083
  95%  60999
  98%  61134
  99%  61142
 100%  84767 (longest request)

With server='gevent'

$ ab -n 500 -c 100 'http://localhost:34000/api/v1/bookkeeping/npm/serve-static/1.0.0'
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests


Server Software:        
Server Hostname:        localhost
Server Port:            34000

Document Path:          /api/v1/bookkeeping/npm/serve-static/1.0.0
Document Length:        1701 bytes

Concurrency Level:      100
Time taken for tests:   6.745 seconds
Complete requests:      500
Failed requests:        0
Total transferred:      887000 bytes
HTML transferred:       850500 bytes
Requests per second:    74.13 [#/sec] (mean)
Time per request:       1349.057 [ms] (mean)
Time per request:       13.491 [ms] (mean, across all concurrent requests)
Transfer rate:          128.42 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    4   8.5      0      31
Processing:   107 1218 286.7   1314    1483
Waiting:      107 1218 286.8   1314    1483
Total:        133 1222 279.2   1314    1499

Percentage of the requests served within a certain time (ms)
  50%   1314
  66%   1327
  75%   1334
  80%   1338
  90%   1390
  95%   1404
  98%   1418
  99%   1452
 100%   1499 (longest request)


$ ab -n 5000 -c 1000 'http://localhost:34000/api/v1/bookkeeping/npm/serve-static/1.0.0'
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            34000

Document Path:          /api/v1/bookkeeping/npm/serve-static/1.0.0
Document Length:        1701 bytes

Concurrency Level:      1000
Time taken for tests:   67.156 seconds
Complete requests:      5000
Failed requests:        839
   (Connect: 0, Receive: 0, Length: 839, Exceptions: 0)
Total transferred:      7381614 bytes
HTML transferred:       7077861 bytes
Requests per second:    74.45 [#/sec] (mean)
Time per request:       13431.259 [ms] (mean)
Time per request:       13.431 [ms] (mean, across all concurrent requests)
Transfer rate:          107.34 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   80 267.8      0    1053
Processing:    33 11929 22422.7   1301   63059
Waiting:        0 2004 6497.8   1283   56314
Total:         69 12009 22589.4   1301   63135

Percentage of the requests served within a certain time (ms)
  50%   1301
  66%   1320
  75%   1347
  80%   1399
  90%  60999
  95%  63044
  98%  63082
  99%  63107
 100%  63135 (longest request)



$ ab -n 10000 -c 1000 -g gevent-out.data 'http://localhost:34000/api/v1/bookkeeping/npm/serve-static/1.0.0'
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            34000

Document Path:          /api/v1/bookkeeping/npm/serve-static/1.0.0
Document Length:        1701 bytes

Concurrency Level:      1000
Time taken for tests:   145.984 seconds
Complete requests:      10000
Failed requests:        1748
   (Connect: 0, Receive: 0, Length: 1748, Exceptions: 0)
Total transferred:      14639048 bytes
HTML transferred:       14036652 bytes
Requests per second:    68.50 [#/sec] (mean)
Time per request:       14598.350 [ms] (mean)
Time per request:       14.598 [ms] (mean, across all concurrent requests)
Transfer rate:          97.93 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   23 138.7      0    1049
Processing:    31 12410 22938.4   1320   69129
Waiting:        0 2042 6902.7   1280   69129
Total:         86 12433 22984.8   1320   69130

Percentage of the requests served within a certain time (ms)
  50%   1320
  66%   1507
  75%   2082
  80%   5287
  90%  60260
  95%  66257
  98%  66303
  99%  66323
 100%  69130 (longest request)

tuxdna · 2018-02-19T10:16:04Z

In the comment above I switched between flask ( current config ) and gevent but there is no significant change in performance.

Next thing I am trying out is using a per-container Lockfile, that ensures exactly one process / thread has an active Scheduler.

msrb · 2018-02-19T10:47:15Z

@tuxdna I realized that we have a persistent volume attached to the service at the moment. That's something we will need to tackle separately, but unfortunately it means that even if we find a workaround for the scheduler here, we still won't be able to easily scale up (increase number of replicas) the service.

tuxdna · 2018-02-19T10:54:18Z

@msrb That's correct. The root cause for not being able to increase the replicas is the way containers are started and stopped. At the startup we can definitely acquire a lock ( this lock could be an arbitrary mechanism and not just files ), but at the shutdown ( the container died for example ), we can not guarantee a release of that lock. i.e.

Container Startup --> acquire_lock() --> ensure only one process/thread has the active scheduler
Container Shutdown --> release_lock() but unfortunately we cannot guarantee this will be called.

Another approach I am thinking is to completely separate out the scheduler service. Because scheduler and the APIs are coupled together into one service, that is causing this entanglement. I do believe this is the right thing to do -- separate them both.

msrb · 2018-02-19T10:58:28Z

@tuxdna yeah, decoupling scheduler seems to be the best approach. But that alone will not allow us to increase number of replicas, because the persistent volume can only be attached to a single pod. That's out of scope for this issue though.

msrb · 2018-03-07T08:59:44Z

We will not focus on this now. Scaling up this service means rewriting significant portion of it.

msrb · 2019-01-14T05:55:04Z

Jobs API is now deprecated so this issues is irrelevant.

msrb mentioned this issue Feb 12, 2018

Improve Jobs API concurrency fabric8-analytics/fabric8-analytics-jobs#247

Closed

msrb added this to the Analytics Backlog milestone Feb 12, 2018

msrb added type/SDD-feature sprint/next team/analytics team/analytics/core labels Feb 12, 2018

krishnapaparaju mentioned this issue Feb 13, 2018

Sprint plan for Fabric8 Analytics: #145 #2216

Closed

48 tasks

msrb modified the milestones: Analytics Backlog, Sprint 145 Feb 14, 2018

msrb removed the sprint/next label Feb 14, 2018

tuxdna self-assigned this Feb 15, 2018

krishnapaparaju mentioned this issue Mar 5, 2018

Sprint plan for Fabric8 Analytics: #146 #2433

Closed

41 tasks

msrb modified the milestones: Sprint 145, Analytics Backlog Mar 7, 2018

tuxdna mentioned this issue Mar 7, 2018

Introduce API gateway with authentication #2484

Closed

msrb unassigned tuxdna Apr 18, 2018

msrb closed this as completed Jan 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Jobs API concurrency #2220

Improve Jobs API concurrency #2220

msrb commented Feb 12, 2018

msrb commented Feb 12, 2018

msrb commented Feb 12, 2018

msrb commented Feb 12, 2018

tuxdna commented Feb 19, 2018 •

edited

Loading

tuxdna commented Feb 19, 2018 •

edited

Loading

msrb commented Feb 19, 2018

tuxdna commented Feb 19, 2018 •

edited

Loading

msrb commented Feb 19, 2018

msrb commented Mar 7, 2018

msrb commented Jan 14, 2019

Improve Jobs API concurrency #2220

Improve Jobs API concurrency #2220

Comments

msrb commented Feb 12, 2018

msrb commented Feb 12, 2018

msrb commented Feb 12, 2018

msrb commented Feb 12, 2018

tuxdna commented Feb 19, 2018 • edited Loading

With server='flask'

With server='gevent'

tuxdna commented Feb 19, 2018 • edited Loading

msrb commented Feb 19, 2018

tuxdna commented Feb 19, 2018 • edited Loading

msrb commented Feb 19, 2018

msrb commented Mar 7, 2018

msrb commented Jan 14, 2019

tuxdna commented Feb 19, 2018 •

edited

Loading

tuxdna commented Feb 19, 2018 •

edited

Loading

tuxdna commented Feb 19, 2018 •

edited

Loading