"Slow" reaction on TCP telegrams #142

squonk11 · 2021-03-02T17:37:04Z

I see the problem that TCP communication seems to be relatively "slow". I put "slow" in quotes because it is "somehow" fast but not really fast enough for what I intend to do. In my application I send TCP telegrams from PC to the ESP32. The telegram is more or less tunneled to the asynchronous interface without major modifications. The reply coming from the asynchronous interface is then packed into a TCP telegram and sent back as response. The response time of this system is approx. 10 - 15ms (measured with Wireshark on the PC). Here I was hoping to have a response time of maybe 5ms. The response time on the serial communication is approx. 1ms only - so this can not be the problem. The delay obviously occurs somewhere in Wifi, Lwip, Smooth.
I checked already several times all menuconfig options for possible optimizations and I also distributed all tasks to the two cores of the ESP32 in order to have a somehow balanced load.
So, my question is: do you have an idea if there is somewhere a delay in Smooth coming e.g. from a task cycle time? Do you have any other hint how I could speedup the communication?

PerMalmberg · 2021-03-03T09:30:37Z

All networking is done by the SocketDispatcher running on its own thread and transferred to/from other tasks via queues. It polls the sockets on a 1ms interval when there was nothing to do on the last call, see note here: https://github.com/PerMalmberg/Smooth/blob/master/lib/smooth/core/network/SocketDispatcher.cpp#L112

There's also a 10ms timeout when doing the select-call here https://github.com/PerMalmberg/Smooth/blob/master/lib/smooth/core/network/SocketDispatcher.cpp#L78 which might attribute to the response times, but I honestly don't remember what the criteria is for it to reach the timeout as sockets operate in blocking mode.

There's also the transfer between tasks that may delay things, esp. if the tasks are written such that they are using delays instead of waiting for events to arrive.

Does it make a difference in response timie if you have a constant stream of data compared to single messages?

squonk11 · 2021-03-03T17:55:34Z

I tweaked a little bit with the two timing parameters you mentioned (sleep_for and the timeouts for select). The sleep_for seems to have not so much influence - if I increase it to 2 or 3 the response time does not change so much. But the timeout values for select() seem to have some more influence. Reducing them to zero give a bit faster responses. The linux man mage for select() says:

If both fields of the timeval structure are zero, then
select() returns immediately. (This is useful for
polling.)

I think setting this to zero seems to make much sense since here we are actually polling. Shall I modify this also in the next PR?

But nevertheless the responsetimes are still not what I am expecting. Currently I am trying to implement some CPU-load measurement for each CPU in oder to better balance the load - unfortunately right now I am getting weired results: during higher network load the CPU load of both cores reduce significantly. Strange...

PerMalmberg · 2021-03-04T16:23:13Z

I think setting this to zero seems to make much sense since here we are actually polling. Shall I modify this also in the next PR?

Well, we are polling non-blocking sockets, so I expect select() to return immediately anyway, but I might be wrong. If it indeed does make a difference then we should perhaps make it configurable such that its easy to adjust without needing to do code changes. I'm sure I set it at the current value for a reason.

squonk11 · 2021-03-07T18:40:31Z

I used vTaskGetRunTimeStats( char *pcWriteBuffer ) to obtain some runtime statistics from FreeRTOS. This is the result:


SocketDispatche 43413072        86%
Console Task    54172           <1%
IDLE            13752405        27%
IDLE            31896409        63%
LED             288470          <1%
SD_Mounter      60794           <1%
main            4397505         8%
tiT             1879716         3%
UDPHello task   82824           <1%
UART Reader     1808900         3%
Tmr Svc         15              <1%
esp_timer       58525           <1%
wifi            3086702         6%
ipc0            54724           <1%
ipc1            72974           <1%
sys_evt         5450            <1%

It seems as if SocketDispatcher consumes really much ressources. Do you have any idea if I could have a chance to optimize the runtime behaviour somehow? I tried already setting for some functions the attribute IRAM_ATTR. But this did not help. But I also have some doubts on this statistic data because the percentage of the SocketDispatcher plus the percentage of either of the IDLE tasks exceeds 100%. Strange...

PerMalmberg · 2021-03-09T07:38:38Z

I did similar tests way back, also got weird results. It gets better if you run on a single core iirc.

I have no ideas as to what to optimize, but I'd start with ensuring that SocketDispatcher doesn't perform work when it doesn't need to. Perhaps I have made an error in how it waits for work?

squonk11 · 2021-03-09T17:41:25Z

In order to check for possible optimizations, I am currently trying to understand how Smooth works internally. As far as I understand, SocketDispatcher is calling select() in order to check for work. If there is work, it calls it->second->readable(*this); or it->second->writable();. Do these functions call directly (somewhere) the applications overrides for e.g. data_received(Packet& packet, int len) of IPacketAssembly<Protocol, Packet>? Or is there another mechanism behind like another loop or Task polling for events coming from e.g. readable()?

PerMalmberg · 2021-03-16T11:31:28Z

You've probably figured it out already, but each Task gets notified via a condition_variable when there is data available for that task. So there is a handover of the data through queues between the socketdispatcher and the tasks. Or, in other words, no readable() and writable() are only run in the socketdispatcher.

squonk11 · 2021-03-17T17:30:46Z

I still don't understand fully the internals of Smooth, especially the passing of messages through queues. And so I don't understand if somewhere there might be some delay.
Nevertheless I still see a response time af approx. 15ms (on average) in Wireshark. Approx. 1,5ms of this is due to the UART communication. In our compay we are doing a similar thing with a 300MHz µC with integrated Ethernet MAC (no UART in between). Here we have a response time of approx. 1ms. So the communication with the ESP32 is more than a factor of 10 slower. I would have expected that the 240MHz ESP32 (with UART in between) will have a response time of max. 5ms - or even less. But maybe this is also related to the fact that the ESP32 executes a lot of code from serial flash which is tremendously slow. Maybe the new version of the ESP32 (the ESP32-S3) might be faster since it has an octal serial flash (and not 4 bit like today).
In order to figure out if lwip or smooth is the bottleneck I will setup a new project doing this communication with bare esp-idf and lwip functionality only. I will let you know the result - but it will take some time.

PerMalmberg · 2021-03-17T23:17:19Z

Oh, I'm certain that there are things in Smooth that needs optimization. I really wish I had the time to be of more active assistance, I know this would be a fun thing to dig into. Perhaps I can make some time and do some testing on Linux, at least I'd not have to setup the HW. Smooth doesn't see a difference between the two platform. It might be tricky to see the problem on Linux though as the speed of the computer isn't really comparable with the ESP32. Looking forward to your results.

…

On Wed, 17 Mar 2021, 18:31 Lothar, ***@***.***> wrote: I still don't understand fully the internals of Smooth, especially the passing of messages through queues. And so I don't understand if somewhere there might be some delay. Nevertheless I still see a response time af approx. 15ms (on average) in Wireshark. Approx. 1,5ms of this is due to the UART communication. In our compay we are doing a similar thing with a 300MHz µC with integrated Ethernet MAC (no UART in between). Here we have a response time of approx. 1ms. So the communication with the ESP32 is more than a factor of 10 slower. I would have expected that the 240MHz ESP32 (with UART in between) will have a response time of max. 5ms - or even less. But maybe this is also related to the fact that the ESP32 executes a lot of code from serial flash which is tremendously slow. Maybe the new version of the ESP32 (the ESP32-S3) might be faster since it has an octal serial flash (and not 4 bit like today). In order to figure out if lwip or smooth is the bottleneck I will setup a new project doing this communication with bare esp-idf and lwip functionality only. I will let you know the result - but it will take some time. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#142 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAU2LLBYB5T5LLAXLFRLNE3TEDRNRANCNFSM4YPNRPWA> .

squonk11 · 2021-03-27T13:40:30Z

Now I experimented a little bit with the TCP socket server example. After applying some "optimizations" I achieved TCP echo response time of approx. 2-3ms. The bare example code had a response time of approx. 6-8ms. These optimizations are:

menuconfig settings
Here I activated the following options: ESP32_WIFI_IRAM_OPT, ESP32_WIFI_RX_IRAM_OPT, LWIP_IRAM_OPTIMIZATION. This gave a speedup of approx 1ms.
Then I found somewhere a hint that for small data packets it would be an advantage to disable WIFI AMPDU TX and RX. Also this gave a slight improvement. But de-activating this probably does not make much sense if I am operating in parallel a webserver.
I also tested to activate PSRAM - but this resulted in a slowdown of approx 1-2ms. So I deactivated it again.
switch to WiFi AP mode (instead of STA)
After making the ESP32 an access point I had a direct connection from my PC to the ESP32 without my local network/router involved. This also gave a speedup - approx. 2ms.
Nagle's algorithm
I deactivated LWIP's Nagle algorithm by setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, (void *)&nodelay, sizeof(int));. This did not change anything. Obviously Nagle's algorithm is already deactivated by default.
recv with flag MSG_DONTWAIT set
The call to recv() by default is a blocking call and if data arrives it seems to wait a little longer in order not to return too often with small chunks of data. In my case with small TCP telegrams in ping pong communication this does not make sense. So I made recv() a non-blocking call by passing the flag MSG_DONTWAIT to it. Also this gave a speedup of approx. 1ms.
polling routine in IRAM
I put the polling routine do_retransmit() into IRAM by setting the IRAM_ATTR for it. This also gave a small speedup of approx. 1ms.

After all this I come to the conclusion that the ESP32 is quite slow - probably due to the fact that a lot of code is executed from slow serial flash memory. The "overhead" of the socket management in Smooth makes it even slower. Next I will check if I could optimize Smooth a little somehow. But I have some doubt that I will be able to achieve my target of 5ms response time using Smooth.
Nevertheless I could imagine that my target of having a response time of 5ms for my communication protocol (incl. UART communication) might be achievable if I use bare esp-idf functionality. Then the question would be if it might be possible to operate the esp-idf socket communication in parallel to the webserver using Smooth?
But before thinking about this I will look for some optimizations on Smooth.

PerMalmberg · 2021-04-02T06:50:40Z

That's a good reduction in time just from fairly small changes.

And yes, it should be possible to operate sockets in parallel with Smooth. SocketDispatcher only works with the sockets that are registered with it.

squonk11 · 2021-04-02T16:41:10Z

In the meantime I tried to speedup Smooth a little bit. I did the following optimizations:

restart_inactive_sockets(); and check_socket_timeouts();:
These functions are called during each tick() interval of the SocketDispatcher. I guess it is not needed to call these functions each tick() interval. So I just call these every 100 tick() intervals.
select() timeout:
The timeout for select is currently 10ms. I reduced this value to 1ms. I also tried to set it to 0ms in order to make it non-blocking. But this results in a watchdog timeout on the core where SocketDispatcher::tick() is running. Currently I do not understand why. Probably some other setting via socket options is necessary to make it non-blocking. Do you have an idea?
loop over file descriptors should start at LWIP_SOCKET_OFFSET:
In this loop the index starts at 0. But lwip starts with its file descriptors at LWIP_SOCKET_OFFSET. So it makes sense to start the loop with LWIP_SOCKET_OFFSET.
declaration of methods with inline:
Smooth contains many "trivial" methods which just have a simple return statement; e.g. like: this. I made lots of those functions inline. This has the advantage that the instruction pipeline and cache of the µc-core is not destroyed just for calling a simple function.
Direct WiFi connection:
I realized that it also gives an interesting speedup if I use the ESP32 in AP mode - without my home network structure involved. Currently it seems that the WiFi code uses only channel no 1 in AP mode. Next I will try to set a wifi channel which is currently less "loaded". Currently in the low channel range several routers of my neighbors are operating. Maybe this will also give an improvement.

After all this I was able to drill the response time down to 7-8ms (from initially 10-15ms). So, these actions seem to make much sense - but I currently don't know which of these brings how much - maybe there are some modifications which might not be so urgent.
Unfortunately my target of 5ms is still not reached. But I think I will work with this status for the time being. Maybe later on I will do the switch to a lwip/esp-idf only socket communication in parallel to Smooth or I might switch to the new ESP32-S3.

PerMalmberg · 2021-04-02T18:50:33Z

I think a configurable timeout in ms is better than x * tick since you can set the tick interval in configuration and the the call would then vary based on that. With the default tick interval of 10ms (if I remember correctly) it'd take up to a second for a new socket to be accepted instead of 10-ish ms when doing it your way.
All sockets in Smooth operate in a non-blocking mode. Select can however not do that iirc. As soon as there's s work to be done on an active socket it will react to it. So I think there's no use trying to adjust this one.
Hm interesting. I'm not saying you're wrong but we need to be really careful with this one. I distinctly remember working with that offset early in the development of Smooth. Is the socket id in smooth really the same as that of lwip?
While inlining functions may make things faster there's no guarantee it does and it may just cause the binary to become larger. Usually the compiler makes a good choice for the respective method. Unless you can prove that explicitly inlining a function increases performance, please don't do it.

squonk11 · 2021-04-03T12:15:01Z

yes, you are absolutely right. Defining some kind of timeout interval makes much more sense.
yes, I have seen that you use TCP_NODELAY already which (according to my understanding) switches off Nagle's algorithm for sending data. But select also has a timeout parameter which (according to my understanding) will block until this time is expired - even if some data already arrived. In my case this means that if after e.g. 1ms my request telegram arrived it will wait for another 9ms for more data (which will not happen because the PC first waits for the response before sending the next request). Due to this I think that in the case of a ping-pong communication (like in my case) a short timeout makes sense. For other socket applications (e.g. sending lager data to the ESP32) a longer timeout could give better performance due to processing larger data chunks during each tick().
So, my suggestion would be to have e.g. 10ms as default but this timeout could be made adjustable at runtime.
From lwip documentation I read LWIP_SOCKET_OFFSET==n: Increases the file descriptor number created by LwIP with n. This can be useful when there are multiple APIs which create file descriptors. When they all start with a different offset and you won't make them overlap you can re implement read/write/close/ioctl/fnctl to send the requested action to the right library (sharing select will need more work though).. In the esp-idf implementation it is defined as #define LWIP_SOCKET_OFFSET (FD_SETSIZE - CONFIG_LWIP_MAX_SOCKETS). In my configuration the value is 54. I don't know if the socket id in Smooth is the same variable as that of lwip but as far as I tested they always had the same value.
I must confess that I set the compiler optimization to "optimize for size". I need to check again if it also makes a difference if I set the compiler optimization to "optimize for performance". I will let you know the result.

Edit: I just checked the influence of the select timeout: reducing it from 10ms to 1ms gives on average a 4ms faster response. So this is one of the more interesting improvements for my scenario.

PerMalmberg · 2021-04-04T06:56:36Z

Ok then I suggest the select-timeout is made configurable.
Alright, so when running on the ESP the loop should start at an offset, but not when running on Linux. I guess this also means that it's not possible to use sockets outside Smooth since select would be affected by all sockets.

squonk11 · 2021-04-04T17:55:26Z

Now I also checked the performance difference of making several functions inline. With compiler options set to "optimize for performance" it seems to have only a small advantage; probably less than 1ms. So we can neglect this.

PerMalmberg added the Discussion label Mar 16, 2021

squonk11 mentioned this issue Apr 16, 2021

Byte error in transmission using "responses::FileContentResponse" #159

Open

Elfelsoufim mentioned this issue Jun 2, 2021

MQTT maximum throughput #161

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Slow" reaction on TCP telegrams #142

"Slow" reaction on TCP telegrams #142

squonk11 commented Mar 2, 2021

PerMalmberg commented Mar 3, 2021

squonk11 commented Mar 3, 2021

PerMalmberg commented Mar 4, 2021

squonk11 commented Mar 7, 2021

PerMalmberg commented Mar 9, 2021

squonk11 commented Mar 9, 2021

PerMalmberg commented Mar 16, 2021

squonk11 commented Mar 17, 2021

PerMalmberg commented Mar 17, 2021 via email •

edited

Loading

squonk11 commented Mar 27, 2021

PerMalmberg commented Apr 2, 2021

squonk11 commented Apr 2, 2021 •

edited

Loading

PerMalmberg commented Apr 2, 2021

squonk11 commented Apr 3, 2021 •

edited

Loading

PerMalmberg commented Apr 4, 2021

squonk11 commented Apr 4, 2021 •

edited

Loading

"Slow" reaction on TCP telegrams #142

"Slow" reaction on TCP telegrams #142

Comments

squonk11 commented Mar 2, 2021

PerMalmberg commented Mar 3, 2021

squonk11 commented Mar 3, 2021

PerMalmberg commented Mar 4, 2021

squonk11 commented Mar 7, 2021

PerMalmberg commented Mar 9, 2021

squonk11 commented Mar 9, 2021

PerMalmberg commented Mar 16, 2021

squonk11 commented Mar 17, 2021

PerMalmberg commented Mar 17, 2021 via email • edited Loading

squonk11 commented Mar 27, 2021

PerMalmberg commented Apr 2, 2021

squonk11 commented Apr 2, 2021 • edited Loading

PerMalmberg commented Apr 2, 2021

squonk11 commented Apr 3, 2021 • edited Loading

PerMalmberg commented Apr 4, 2021

squonk11 commented Apr 4, 2021 • edited Loading

PerMalmberg commented Mar 17, 2021 via email •

edited

Loading

squonk11 commented Apr 2, 2021 •

edited

Loading

squonk11 commented Apr 3, 2021 •

edited

Loading

squonk11 commented Apr 4, 2021 •

edited

Loading