Speedup aiohttp web server #2779

asvetlov · 2018-02-27T17:38:32Z

The typical web handler looks like:

async def handler(request):
    ...  # do business logic
    return web.Response(text=..., content_type=...)

Unfortunately construction of `web.Response` is relative slow, `resp.prepare()` is even slower.
The code has too many checks for overlapping `content_type` with `headers['Content-Type']`, response length with `headers['Content-Length']`, HTTP chunk processing etc. etc.

We can speed up the situation if we assume that `web.Response` is constructed with well knows single text or binary data.

The suggestion is:
- Introduce `web.text_response()` and `web.bin_response()` like already present `web.json_response()`. Functions will just make `web.Response` instances on this first step. The main point is teaching people to use the new API.
- `web.Response` should be kept as is for backward compatibility.
- After that we can refactor our response classes hierarchy by making a `BaseResponse` as parent for `StreamResponse` and deriving private classes for binary and text responses from `BaseResponse`. Obviously `BaseResponse.prepare()` should exist, not sure about other API attributes.
- These private classes can do as minimal job as possible, they don't need to pass redundant checks for conflicting HTTP headers.

We should forbid the following HTTP headers in `web.text_response()` / `web.bin_response()`:
-  Connection
-  Keep-Alive
-  Trailer
-  Transfer-Encoding
-  Content-Length
-  Upgrade

The check for headers is quite expensive, let's do it in debug mode only (`loop.get_debug() is True` or `PYTHONASYNCIODEBUG=1`.

That's it.

Discussion is welcome.
PR for the first step (adding `web.text_response()` and `web.bin_response()`) is even more welcome.

The text was updated successfully, but these errors were encountered:

pfreixes · 2018-02-27T21:22:15Z

Also, the compression and the chunk features that are mixed between StreamResponse and StreamWriter doesn't help.

Allowing a clean happy path for some ad-hoc cases - that surely are the most used - as json_response, text_response, bin_resposne with a none prone error interface- as you mentioned - which each one will have a restricted set of parameters will help speeding up the response part.

I'm aligned with you, but I would try first to speed up the json_response one, taking into account that already has its own method. Hence, its a matter of start working on a new hierarchy of classes, and as a bonus/later move the compression and chunking entirely in the StreamResponse class.

hubo1016 · 2018-02-28T01:35:37Z

I think it's OK to ignore any content-length and transfer-encoding settings for these responses, and always use the content-type specified by user if there is any.

But I don't quite understand the speedup, I thought building the headers are always slower...

spcharc · 2018-03-30T16:06:59Z

the check for headers is quite expensive <- I don’t understand
It’s just some comparing work and it’s fast. Looking up in a multidict is also fast

asvetlov · 2018-03-30T16:31:47Z

response.prepare() (and response._start()) take much longer time for simple responses in comparison as it should be.
I've created a gist for demonstration my steps to get profile results: https://gist.github.com/asvetlov/13295efdc1301bea241d7d35496e6f81
I hope you can reproduce snakeviz output by following the gist instructions (just snapshot in not informative because of interactive nature of the tool).
Actually it points to StreamResponse._start() and StreamResponse.content_length.
StreamWriter.write_headers() should be optimized too.

P.S.
Multidict could be much faster as well, see aio-libs/multidict#97
But it is "good enough" for now.

asvetlov · 2018-03-30T16:44:37Z

Also web.Response.__init__ is relative expensive.
My striving to support both text and binary responses in the same class was a mistake, as result we have too many checks in a constructor, just read the code.

hubo1016 · 2018-03-31T08:22:56Z

How much do they cost comparing to a simple json encode?

asvetlov · 2018-03-31T08:33:37Z

How is json encoding related to the issue?

hubo1016 · 2018-03-31T10:00:59Z

If it costs far less than json encode, it may not worth it.

asvetlov · 2018-03-31T10:10:29Z

Say again, why should I care about JSON encoding cost at all?
JSON is not the onlly possible data structure for HTTP payload.
Following this way should I compare reponse sending with making a request to database as well? To which database? With what network connections? (The same is true for JSON BTW, encoding cost depends from incoming data size, types, JSON library and passed flags).

hubo1016 · 2018-03-31T16:50:56Z

Take it easy... When optimising, we begin from the slowest part. I have tested the performance of aiohttp before, It is about 4000qps/CPU core, that is 250us per request. We must reduce 50us or more to make a difference. And if I remember It correctly, encoding of a JSON response takes about 10us. So I was wondering if doing checks are really a bottleneck now.

I don't have a testing environment with me until next week. I will look at your gist after that.

kxepal · 2018-03-31T16:55:04Z

@hubo1016
You can always use third-party fast json library like ujson, rapidjson or else in-place of stdlib ones today. That's out of scope of overall aiohttp optimizations which you couldn't replace with something faster today.

hubo1016 · 2018-03-31T17:00:42Z

@kxpal understood. I am just using it as a benchmark baseline, since different hardware produces very different numbers.

asvetlov · 2018-03-31T17:05:40Z

4000 RPS?
On my laptop I have 12000 RPS (using the single thread/process), see test files/params in gist above.

thomaszdxsn · 2018-03-31T17:36:08Z

8200 RPS on my laptop

asvetlov · 2018-03-31T17:37:53Z

@thomaszdxsn python debugger is off, access log is off, uvloop is used?

thomaszdxsn · 2018-03-31T17:44:49Z

https://gist.github.com/asvetlov/13295efdc1301bea241d7d35496e6f81

I take this gist, and not use cProfile.

python debugger is off

what mean? I know python have a debugger call pdb

asvetlov · 2018-03-31T17:51:11Z

Sorry, my mistake -- should be "profiler".
Python uses the same low-level tracing API for both debugging and profiling.
The question was about usingpython run_aiohttp.py 8080 to run server (no -m cProfile usage).

thomaszdxsn · 2018-03-31T17:57:48Z

yes, this is my command:

$ pipenv run python run_aiohttp.py 8888

wrk command:

$ wrk -t 10 http://localhost:8888

output:

Running 10s test @ http://localhost:8888
  10 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.17ms  358.79us   9.55ms   85.53%
    Req/Sec     0.86k    94.33     1.73k    81.85%
  85793 requests in 10.10s, 12.35MB read
Requests/sec:   8491.46
Transfer/sec:      1.22MB

another wrk:

$ wrk -t8 -c200 -d30s --latency http://localhost:8888

output:

Running 30s test @ http://localhost:8888
  8 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    26.25ms   20.63ms 263.13ms   94.34%
    Req/Sec     1.06k   272.09     1.44k    82.98%
  Latency Distribution
     50%   22.36ms
     75%   27.56ms
     90%   36.88ms
     99%  125.95ms
  252024 requests in 30.05s, 36.29MB read
  Socket errors: connect 0, read 3, write 0, timeout 0
Requests/sec:   8387.49
Transfer/sec:      1.21MB

my computer:

MacBook Pro 2015
2.7 GHz Intel Core i5
8 GB 1867 MHz DDR3

maybe reason is MacOS

hubo1016 · 2018-03-31T17:58:51Z

@asvetlov ah, I didn't turn off the access log. I just use the default settings.

It makes sense if you are targeting 20k+qps. Shall we rewrite the whole class with Cython first? That might be easier.

fafhrd91 · 2018-04-05T19:28:02Z

from my experience, biggest win in term of performance is related to ability to run async code in place
and return early if async function does not execute actual async operations or async result available immediately. at the moment aiohttp always run one task, I'd run handler as soon as full http message available in data_received callback and would start task only if handler returns not completed future.
that would require some low level generators code, but speedup should be significant, especially for benchmarks. we had such code on client side before 2.0, but I am not sure if it would work with async/await.

asvetlov · 2018-04-06T07:17:12Z

Good point

pfreixes · 2018-04-07T06:36:41Z

interesting @fafhrd91! Will be nice to see if the naive code path that has a completed response without needing a loop iteration exists right now in the current implementation, plenty of awaits that might be breaking the execution.

novoxd · 2018-12-22T21:24:10Z

It will be nice to implement http/2 , that will give a really good performance boost on real world apps. As http/2 has a HPACK header compression, it is possible to use caching for header parsing, validation, and all other stuff. What will give blasting performance boost and top limit of uvloop tcp performance (100k rps) can be reached nearly, as http/2 is a binary protocol. And that's not all - almost all real world apps uses sessions, no matter what kind they are (kv, rdms, jwt) they all need some work for cpu, this work can bee cached to. And there is a ready implementation of http/2 with python binding , what would not be hard to integrate.

asvetlov · 2018-12-22T21:45:08Z

Are you volunteering for the task?

novoxd · 2018-12-22T21:52:28Z

Maybe. If it will be requirement for my current project (we have some performance issues against heavy application level ddos). This task is interesting for me, but I cant give any promises about my part in it because there are always some factors that may break promises.

bdraco · 2024-09-10T03:41:03Z

There has been a lot of improvement, but there is still a lot that could perform better

bdraco · 2024-09-10T03:51:47Z

Without profiler
% wrk -t 10 http://localhost:8080
Running 10s test @ http://localhost:8080
10 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 237.23us 22.55us 1.12ms 84.31%
Req/Sec 4.20k 56.62 4.54k 78.81%
421797 requests in 10.10s, 63.56MB read
Requests/sec: 41763.44
Transfer/sec: 6.29MB

With profiler
% wrk -t 10 http://localhost:8080
Running 10s test @ http://localhost:8080
10 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 454.83us 34.72us 1.99ms 86.39%
Req/Sec 2.20k 32.69 2.32k 75.64%
220865 requests in 10.10s, 33.28MB read
Requests/sec: 21867.62
Transfer/sec: 3.30MB

bdraco · 2024-09-10T03:53:53Z

I didn't expect add_app to be called every request

Dreamsorcerer · 2024-09-10T11:43:32Z

I didn't expect add_app to be called every request

I'm guessing that's UrlMappingMatchInfo.add_app(), which would make sense, because you get a new match info for each request and you want the app available on there. It just sets it as the current_app and adds it to an empty list (when routing through sub apps, then I think you end up with a list representing the chain of apps).

bdraco · 2024-11-15T01:31:43Z

diff --git a/aiohttp/web_response.py b/aiohttp/web_response.py
index 73063890f..a7c37c121 100644
--- a/aiohttp/web_response.py
+++ b/aiohttp/web_response.py
@@ -93,7 +93,7 @@ class StreamResponse(BaseClass, HeadersMixin):
         self._compression = False
         self._compression_strategy: int = zlib.Z_DEFAULT_STRATEGY
         self._compression_force: Optional[ContentCoding] = None
-        self._cookies = SimpleCookie()
+        self._cookies: Optional[SimpleCookie] = None
 
         self._req: Optional[BaseRequest] = None
         self._payload_writer: Optional[AbstractStreamWriter] = None
@@ -209,6 +209,8 @@ class StreamResponse(BaseClass, HeadersMixin):
 
     @property
     def cookies(self) -> SimpleCookie:
+        if self._cookies is None:
+            self._cookies = SimpleCookie()
         return self._cookies
 
     def set_cookie(
@@ -230,6 +232,9 @@ class StreamResponse(BaseClass, HeadersMixin):
         Sets new cookie or updates existent with new value.
         Also updates only those params which are not None.
         """
+        if self._cookies is None:
+            self._cookies = SimpleCookie()
+        else:
             old = self._cookies.get(name)
             if old is not None and old.coded_value == "":
                 # deleted cookie
@@ -277,6 +282,7 @@ class StreamResponse(BaseClass, HeadersMixin):
         Creates new empty expired cookie.
         """
         # TODO: do we need domain/path here?
+        if self._cookies:
             self._cookies.pop(name, None)
         self.set_cookie(
             name,

#2779 (comment) should have skipped some of the pop operations if nothing set.

Can get another ~2% out of it with the above. That will also have some small per request memory benefit to not create a SimpleCookie if not needed

related issue #2779 We create these objects even if there are no cookies

Dreamsorcerer · 2024-11-15T02:28:02Z

I don't see many opportunities to optimize left.

My only question is whether the proposal in the original issue (i.e. basically replacing direct web.Response() instantiation with helper functions) is something we want to look at implementing, or will it not make a difference?

bdraco · 2024-11-15T03:34:50Z

Maybe _prepare_headers could be optimized a bit more, but I'm not sure its needed to split it. Would need to experiment a bit with it

bdraco · 2024-11-15T03:48:24Z

Profile above shows more time is spent in the tuple compares. #2779 (comment)

#9896 is a test to see if that makes a difference to unpack them

#2779 (comment)

bdraco · 2024-11-15T04:09:54Z

Tried a few methods but nothing made it faster in a meaningful way. All the time is spent in these 3 places

bdraco · 2024-11-15T04:11:44Z

and the bulk is really the time syscall to make the rfc822_formatted_time

bdraco · 2024-11-15T04:20:46Z

I don't see a way to speed any of those up

The extra super() property call in

aiohttp/aiohttp/web_response.py

Line 214 in 5ab2d47

return super().content_length

is likely why content_length is coming up as a bit slower

bdraco · 2024-11-15T04:25:33Z

even if we clean that all up a bit more its likely still less than everything we would get from #2779 (comment)

bdraco · 2024-11-15T04:39:23Z

I can't come up with anything to make it faster that doesn't significantly hurt readability or make the code much less dry, and definitely nothing worth that kind of trade off. Could benefit from a second set of eyes looking at the codspeed flame graphs

bdraco · 2024-11-15T04:41:50Z

and the bulk is really the time syscall to make the rfc822_formatted_time

While that going to be painfully slow on older arm systems that have errata that prevents using the vdso call, on most modern linux systems its going to use vdso for time and its not that bad

bdraco · 2024-11-15T04:57:37Z

Resource.resolve shows up in the profile.. #9899 attempt to make it faster

bdraco · 2024-11-15T05:25:02Z

Maybe Response.__init__ and StreamResponse.__init__ could be optimized a bit more, StreamResponse always copies the headers that Response makes so it always get an extra memcpy

asvetlov · 2024-11-15T09:20:06Z

Another interesting optimization would be in using https://pypi.org/project/hyperscan/ for route path search.
It could scan over all registered routes at once instead of looping.
The library support a limited set of CPU architectures, thus it should be optional.
Also, it doesn't provide an API for getting match object, thus it would require additional re.pattern.match() call for extracting match_info dict for variable patterns.
Good news, it seems that hyperscan supports unicode and could compile all regex patterns that we use in aiohttp.

asvetlov · 2024-11-15T12:40:08Z

See #9907 for hyperscan POC

Dreamsorcerer · 2024-11-15T14:14:02Z

Another interesting optimization would be in using https://pypi.org/project/hyperscan/ for route path search.

Interesting. I would note that while you were away we did a big refactor in #7829. For an application with 5000 routes, this reduced route resolution from ~5s to ~100ms. So, we've already seen a substantial improvement on that front, though if there's more to be had, then great.

Dreamsorcerer · 2024-11-15T14:16:44Z

I think if @asvetlov agrees that the original proposal is not worth doing, then we can probably close this issue. If there are any more specific suggestions for optimisations anyone has, then a new issue can be opened for each suggestion (or just a PR).

asvetlov · 2024-11-15T18:45:53Z

I see a lot of improvements here, thank you guys.
Please keep the issue open, I'd like to look at call graphs again to see the current picture.

related issue #2779 We create these objects even if there are no cookies

asvetlov · 2024-11-20T17:37:21Z

Everything looks good now.
Thank to everybody who took a part in working on it.

Dreamsorcerer added the enhancement label Sep 8, 2024

bdraco added a commit that referenced this issue Nov 15, 2024

Defer creation of SimpleCookie objects in the web server until needed

35030e8

related issue #2779 We create these objects even if there are no cookies

bdraco mentioned this issue Nov 15, 2024

Defer creation of SimpleCookie objects in the web server until needed #9895

Merged

bdraco mentioned this issue Nov 15, 2024

DNM: Attempt to speed up _prepare_headers #9896

Closed

bdraco added a commit that referenced this issue Nov 15, 2024

DNM: Attempt to speed up version checks in client

4603372

#2779 (comment)

bdraco mentioned this issue Nov 15, 2024

DNM: Attempt to speed up version checks in client #9897

Closed

bdraco added a commit that referenced this issue Nov 15, 2024

DNM: Benchmark overhead of setdefault calls in _prepare_headers

a1cfe90

#2779 (comment)

bdraco mentioned this issue Nov 15, 2024

DNM: Benchmark overhead of setdefault calls in _prepare_headers #9898

Closed

bdraco mentioned this issue Nov 15, 2024

Refactor Resource.resolve to avoid linear search of methods #9899

Merged

bdraco mentioned this issue Nov 17, 2024

Add benchmarks for creating web responses #9940

Merged

bdraco added a commit that referenced this issue Nov 17, 2024

Defer creation of SimpleCookie objects in the web server until needed

1b63b8c

related issue #2779 We create these objects even if there are no cookies

asvetlov closed this as completed Nov 20, 2024

github-project-automation bot moved this to Done in Sprint @ Man Group London Hackathon 2024, Nov 9 Nov 20, 2024

Speedup aiohttp web server #2779

Speedup aiohttp web server #2779

Comments

asvetlov commented Feb 27, 2018

pfreixes commented Feb 27, 2018

hubo1016 commented Feb 28, 2018

spcharc commented Mar 30, 2018

asvetlov commented Mar 30, 2018

asvetlov commented Mar 30, 2018

hubo1016 commented Mar 31, 2018

asvetlov commented Mar 31, 2018

hubo1016 commented Mar 31, 2018

asvetlov commented Mar 31, 2018

hubo1016 commented Mar 31, 2018

kxepal commented Mar 31, 2018 • edited Loading

hubo1016 commented Mar 31, 2018

asvetlov commented Mar 31, 2018

thomaszdxsn commented Mar 31, 2018

asvetlov commented Mar 31, 2018

thomaszdxsn commented Mar 31, 2018 • edited Loading

asvetlov commented Mar 31, 2018

thomaszdxsn commented Mar 31, 2018 • edited Loading

hubo1016 commented Mar 31, 2018

fafhrd91 commented Apr 5, 2018

asvetlov commented Apr 6, 2018

pfreixes commented Apr 7, 2018

novoxd commented Dec 22, 2018 • edited Loading

asvetlov commented Dec 22, 2018

novoxd commented Dec 22, 2018

bdraco commented Sep 10, 2024

bdraco commented Sep 10, 2024

bdraco commented Sep 10, 2024

Dreamsorcerer commented Sep 10, 2024

bdraco commented Nov 15, 2024 • edited Loading

Dreamsorcerer commented Nov 15, 2024

bdraco commented Nov 15, 2024

bdraco commented Nov 15, 2024

bdraco commented Nov 15, 2024

bdraco commented Nov 15, 2024

bdraco commented Nov 15, 2024

bdraco commented Nov 15, 2024

bdraco commented Nov 15, 2024

bdraco commented Nov 15, 2024

bdraco commented Nov 15, 2024

bdraco commented Nov 15, 2024 • edited Loading

asvetlov commented Nov 15, 2024

asvetlov commented Nov 15, 2024

Dreamsorcerer commented Nov 15, 2024

Dreamsorcerer commented Nov 15, 2024

asvetlov commented Nov 15, 2024

asvetlov commented Nov 20, 2024

kxepal commented Mar 31, 2018 •

edited

Loading

thomaszdxsn commented Mar 31, 2018 •

edited

Loading

thomaszdxsn commented Mar 31, 2018 •

edited

Loading

novoxd commented Dec 22, 2018 •

edited

Loading

bdraco commented Nov 15, 2024 •

edited

Loading

bdraco commented Nov 15, 2024 •

edited

Loading