Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: wolfSSL_free crash on ESP32s3 #6637

Closed
PaulMartinsen opened this issue Jul 20, 2023 · 36 comments
Closed

[Bug]: wolfSSL_free crash on ESP32s3 #6637

PaulMartinsen opened this issue Jul 20, 2023 · 36 comments
Assignees
Labels

Comments

@PaulMartinsen
Copy link

Contact Details

No response

Version

a026d84 (master branch ca 8:33 am, 15/7/2023)

Description

Calling wolfSSL_free to dispose of a TLS session from a 'cleanup' thread leads to an assertion failure in xQueueGenericSend in esp-idf/v5.0/components/freertos/FreeRTOS-Kernel/queue.c. This seems to occur only when wolfSSL_free is called from a different task to the one that created it. In both case all communication is finished and it is time to free the memory. And wolfSSL_free is only being called once for the session.

Stack trace (below), shows this occurs when esp_sha_hw_unlock calls esp_CryptHwMutexUnLock(&sha_mutex). The same application code operates correctly running on Windows (however, that's not using the Espressif port).

Currently Windows decided it can't find the ESP32s3 JTAG interface, so I haven't been able to look at what's happening in any more detail. Will add new info as it comes to hand.

I'm using this version which is a clean fork from a recent master branch with only trivial changes for building on Windows. @gojimmypi , do you think your new PR might make a difference?

Reproduction steps

No response

Relevant log output

0x4038171d: esp_system_abort at C:/Programming/Toolchains/VisualGDB/ESP32/v5/esp-idf/v5.0/components/esp_system/esp_system.c:135
0x40388e65: __assert_func at C:/Programming/Toolchains/VisualGDB/ESP32/v5/esp-idf/v5.0/components/newlib/assert.c:78
0x403820ec: xQueueGenericSend at C:/Programming/Toolchains/VisualGDB/ESP32/v5/esp-idf/v5.0/components/freertos/FreeRTOS-Kernel/queue.c:837 (discriminator 8)
0x42095433: esp_CryptHwMutexUnLock at k:/Lumos/Vendor/WolfSSL/wolfssl/wolfcrypt/src/port/Espressif/esp32_util.c:80
0x42094dd0: esp_sha_hw_unlock at k:/Lumos/Vendor/WolfSSL/wolfssl/wolfcrypt/src/port/Espressif/esp32_sha.c:819
0x42094eb5: esp_sha_init_ctx at k:/Lumos/Vendor/WolfSSL/wolfssl/wolfcrypt/src/port/Espressif/esp32_sha.c:260 (discriminator 15)
0x42094f62: esp_sha_init at k:/Lumos/Vendor/WolfSSL/wolfssl/wolfcrypt/src/port/Espressif/esp32_sha.c:99
0x4208cd9d: InitSha256 at k:/Lumos/Vendor/WolfSSL/wolfssl/wolfcrypt/src/sha256.c:771
0x4208d5a6: wc_Sha256Free at k:/Lumos/Vendor/WolfSSL/wolfssl/wolfcrypt/src/sha256.c:1855
0x4209715f: FreeHandshakeHashes at k:/Lumos/Vendor/WolfSSL/wolfssl/src/internal.c:6845
0x42097cc2: SSL_ResourceFree at k:/Lumos/Vendor/WolfSSL/wolfssl/src/internal.c:7929
0x42097e61: FreeSSL at k:/Lumos/Vendor/WolfSSL/wolfssl/src/internal.c:8470
0x420606ba: wolfSSL_free at k:/Lumos/Vendor/WolfSSL/wolfssl/src/ssl.c:1491
0x4204dfef: CTLSSession::Dispose() at k:/Lumos/DeviceWebservicesLibrary/Communications/TLSContext.cpp:77
0x4204e01d: CTLSSession::~CTLSSession() at k:/Lumos/DeviceWebservicesLibrary/Communications/TLSContext.cpp:69
0x4204d249: CTcpConnectionInfo::~CTcpConnectionInfo() at k:/Lumos/DeviceWebservicesLibrary/Communications/TcpConnectionInfo.h:13
 (inlined by) CTcpConnectionInfo::~CTcpConnectionInfo() at k:/Lumos/DeviceWebservicesLibrary/Communications/TcpConnectionInfo.h:13
0x4200bbdf: std::default_delete<CConnectionInfo>::operator()(CConnectionInfo*) const at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits/unique_ptr.h:85
 (inlined by) std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> >::~unique_ptr() at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits/unique_ptr.h:361
 (inlined by) void std::destroy_at<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > >(std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> >*) at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits/stl_construct.h:88
 (inlined by) void std::allocator_traits<std::allocator<std::_Fwd_list_node<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > > > >::destroy<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > >(std::allocator<std::_Fwd_list_node<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > > >&, std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> >*) at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits/alloc_traits.h:533
 (inlined by) std::_Fwd_list_base<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> >, std::allocator<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > > >::_M_erase_after(std::_Fwd_list_node_base*, std::_Fwd_list_node_base*) at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits/forward_list.tcc:86
0x4204c65f: std::_Fwd_list_base<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> >, std::allocator<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > > >::~_Fwd_list_base() at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits/forward_list.h:346
 (inlined by) std::forward_list<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> >, std::allocator<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > > >::~forward_list() at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits/forward_list.h:591
 (inlined by) remove_if<CConnectionManager::DiscardOldConnections(const time_point&)::<lambda(const std::unique_ptr<CConnectionInfo>&)> > at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits/forward_list.tcc:330
0x4204c6c5: CConnectionManager::DiscardOldConnections(std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > const&) at k:/Lumos/DeviceWebservicesLibrary/Communications/ConnectionManager.cpp:200
0x4204c779: CConnectionManager::Process() at k:/Lumos/DeviceWebservicesLibrary/Communications/ConnectionManager.cpp:161
0x4204e639: non-virtual thunk to CConnectionFactory::Process() at k:/Lumos/DeviceWebservicesLibrary/Services/ConnectionFactory.h:54
0x4216b0f5: CManualBackgroundTaskDispatcher::RunEachTaskOnce() at k:/Lumos/DeviceWebservicesLibrary/Services/ManualBackgroundTaskDispatcher.cpp:21
0x4200c354: app_main at C:\Users\Paul\Projects\BLS\FPH Projects\2022 SDC\Software\Lumos\Lumos32i\build\VisualGDB\Debug/../../../main/Lumos32i.cpp:544
0x42174c2b: main_task at C:/Programming/Toolchains/VisualGDB/ESP32/v5/esp-idf/v5.0/components/freertos/FreeRTOS-Kernel/portable/port_common.c:131 (discriminator 2)
0x403856f5: vPortTaskWrapper at C:/Programming/Toolchains/VisualGDB/ESP32/v5/esp-idf/v5.0/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:151
@gojimmypi
Copy link
Contributor

Hi @PaulMartinsen - thank you for reporting this issue.

There's definitely an improvement in my #6624 regarding properly locking and releasing the hardware, and falling back to SW as needed.

Given that, if you disable the HW acceleration, do you see the same problem?

Note that I've recently refactored the macros in #6587, removing the WROOM32, so different settings are needed in user_settings.h to disable HW acceleration. It looks like you've already applied that update, so this comment is for other viewers.

Coincidentally I am now working on updating the ESP32 client and server examples. See also wolfssl-examples/ESP32. There are no changes to these apps in the open PR.

The work on I am doing is with regards to supporting a new cipher suite, but I've confirmed at least my current server works properly with a Linux client. If you use my example, just make sure you don't enable the SM Ciphers.

If you have some example code to reproduce the error, that would be very helpful. I didn't immediately see it in your fork.

Regarding your JTAG problems, I can certainly relate! I've found Zadig to be quite helpful when Windows decides to change the drivers. I set libusbK for the Tigard and for the ESP32-S3 on-board JTAG for my dev board, it uses WINUSB.

image

You are still using VisualGDB, eh?

For reference, here's my debug setting dialog box:

image

Other ideas: different USB cable, restart Windows, power cycle board. Exit/restart Visual Studio. VM running? Make sure the device does not get used there (yes, I've had that happen more than once!). Perhaps also try using Windows to completely remove the USB driver. Which exact board model are you using?

Thanks again for all your input! Cheers

@PaulMartinsen
Copy link
Author

Thanks @gojimmypi . I was a little worried reporting the problem with so little info, but some very helpful suggestions there as always so I'm glad I did.

I'm using the ESP32-S3-DevkitC-1 v1.0 board with the P2N8R2 variant on it. I always overlook testing w/o the hardware acceleration, but that sounds like a good first step. And your new PR. This is part of a much larger project now, but I will try to create a simplified example that illustrates the problem, if the PR doesn't resolve, based on the library examples.

One odd thing I notice with zadig: I get two devices when I connect the boards jtag port: USB JTAG/serial debug unit (Interface 0) and USB JTAG/serial debug unit (Interface 2). No Interface 1. Interface 0 is using usbser (v10.0.22621.1778) and interface 2 is using WINUSB (v10.0.22621.608). Have you had any experience like that? Maybe switching one or both to WINUSB (v6.1.7600.16385) is a path worth trying?

@gojimmypi
Copy link
Contributor

Hi @PaulMartinsen, I'm glad you found the suggestions helpful.

I'm using the ESP32-S3-DevkitC-1 v1.0

Ah yes, excellent, I have exactly the same board. The -R2 (2 MB PSRAM) is not enough to run Linux, but I also have a -R8 for that.

There are several options for allocating the external PSRAM. Which are you using?

I get two devices when I connect the boards jtag port

hm, that does sound odd. I only see one JTAG device. The UART is on a second physical connector. In contrast, my Tigard shows both USB devices on one USB connector.

If you just cannot get it working, I'd suggest going into Device Manager and removing the driver. Be sure to unplug the device & allow Windows to reinstall the driver. I've also seen VisualGDB itself detect bad driver problems and offer to correct automatically with their UsbDriverTool.

TLS session from a 'cleanup' thread leads to an assertion failure

I'll be looking into this more in the coming days.

One thing I've found helpful when troubleshooting odd problems is the ESP stack high water mark. I think it is much better to monitor the actual level, rather than just rely on the stack smashing monitor & compiler options. I have an example in my latest benchmark app.

Have a great weekend. Cheers

@PaulMartinsen
Copy link
Author

Hi @gojimmypi , Just confirming I do not see the cross-task crash if I disable hardware acceleration (that is NO_ESP32_CRYPT, NO_WOLFSSL_ESP32_CRYPT_RSA_PRI, NO_WOLFSSL_ESP32_CRYPT_HASH and NO_WOLFSSL_ESP32_CRYPT_AES defined in user_settings.h).

Currently I don't think I'm using the PSRAM at all (i.e., not configured, free ram reports 73k). Next stage of the project is going to be the memory intensive bit and I'll investigate the different options for that. Do you have recommendations?

I vaguely recall reading the jtag connector can support serial comms too so I'm assuming that's why two devices are showing up with just that USB port plugged in. I know I spent a lot of time trying to get it going on my old machine; then it just worked on the new one through VisualGDB. Until now.

Thanks for the reminder about the high-water mark. That's a great tool too add to my health monitoring service.

You have a great weekend too.
paul

@PaulMartinsen
Copy link
Author

Quick update: I tried PR6624, but the cross-task crashing problem with hardware acceleration enabled still occurs in that branch unfortunately.

@PaulMartinsen
Copy link
Author

Hi @gojimmypi . Turns out it is an id10t programmer error. While building a example to demonstrate I noticed I'd failed to call wc_ShaFree after setting a server id for session resumption. Doing so resolves the crash both for PR6624 and the master branch mentioned.

I'll mark this closed with apologies for the distraction, thanks for your help and thanks to whomever wrote the comment in sha256.c that helped me connect the dots:

        /* probably due to unclean shutdown, error, or other problem.
         *
         * if you find yourself here, code needs to be cleaned up to
         * properly release hardware. this init is only for handling
         * the unexpected. by the time free is called, the hardware
         * should have already been released (lockDepth = 0)
         */
        (void)InitSha256(sha256); /* unlock mutex, set mode to ESP32_SHA_INIT */

@PaulMartinsen
Copy link
Author

PaulMartinsen commented Jul 23, 2023

Hi @gojimmypi . It looks like I jumped to conclusions closing this one. The problem resurfaced when I tidied up my fix and remains persistent. I haven't been able to create a simplified version to share yet, but instrumented a couple of files (attached) to:

  1. track ctx->lockDepth in esp32_sha.c
  2. track the lock not being released in SendTls13ClientHello in tls13.c.

What I have noticed is that a lock on the sha hardware is taken but never released during SendTls13ClientHello. This leads to the crash when wc_Sha256Free tries to unwind the lock (to recover from the unreleased lock). SendTls13ClientHelloHashOutputHashRaw first calls wc_Sha256Update, which gets the lock, then calls wc_Sha512Update which sees the lock in use and reverts to software calculation. They seem to be using different WC_ESP32SHA values though, so I'm not sure why wc_Sha512Update can't use the hardware. The relevant piece of the log file (line 115ff) is:

Client hello 4533
HashRaw, 9749: wc_Sha256Update of WC_ESP32SHA @ 3fcd8020d
Lock depth @ 779 = 1 for WC_ESP32SHA @ 3fcd8020d
HashRaw, 9770: wc_Sha512Update of WC_ESP32SHA @ 3fcd8110d
�[0;32mI (15005) wolf_hw_sha: 
Hardware in use for WC_ESP32SHA @ 0x3fcd8110d; Mode REVERT to ESP32_SHA_SW
HashRaw, 9794: EdDSA_Update
HashRaw, 9800
Client hello 4538

I've attached the log output I captured, the instrumented files and wolf config files. I'm using your PR6624 version of the WolfSSL library.

In this scenario:

  1. server on the ESP32s3 receives a command to subscribe to an event (here an event is a message sent periodically with the ESP32s3 as the client and a .net program as the receiving server).
  2. a connection service task opens a connection to the server, completes the TLS handshake, then signals the notification task that actually transmits the event messages. I call wolfssl_Debugging_On immediately after creating the session for the TLS handshake (i.e., after calling wolfSSL_new.
  3. a couple of event messages are sent over the connection from the notification task.
  4. the connection is closed by the notification task. It is this point where the assert described above fails.

I turn on logging immediately after creating the TLS handshake at step 2, which is in the connection service task:

  WOLFSSL* pNewSession = wolfSSL_new(m_Context.GetContext());
  CTLSSession NewSession(pNewSession);
  printf("Created session %0xd for %s\n", (unsigned)pNewSession, svServerId.data());
  wolfSSL_Debugging_ON();
  esp_log_level_set("wolfssl", ESP_LOG_INFO);
  esp_log_level_set("wolf_hw_sha", ESP_LOG_VERBOSE);

I also noticed when the SSL session for the server request is released, the lock depth is 1 one (line 62 in the log). In this case, wc_Sha256Free successfully unwinds the lock and recovers. Both the session creation and cleanup will be happening in a single task in this situation though. However, I'm not sure why the lock isn't released earlier.

Let me know if you'd like to do a video chat to discuss. It seems quite complex!

@PaulMartinsen PaulMartinsen reopened this Jul 23, 2023
@gojimmypi
Copy link
Contributor

gojimmypi commented Jul 23, 2023

Hi @PaulMartinsen

The problem resurfaced when I tidied up my fix and remains persistent.

It would be quite interesting to see the diff of before and after of the tidy. What changed after the problem went away?

They seem to be using different WC_ESP32SHA values

Let me explain in more detail some of the challenges. Perhaps I can inspire an "aha" moment and we can better isolate the root cause without me actually seeing your code.

The interesting thing in a multi-thread environment with multiple processes both between threads and within a given thread is the ctx object, and more importantly: if a copy has been made of the object.

That's what the initializer attempts to address: who am I and am I the one that called the HW encryption lock.

I keep track of the address of the object that was currently active when the lock was placed. That way, any copy of the object that may think it was doing HW encryption (due to object property settings), should be gracefully and quietly falling back to SW.

What's interesting is when a copy is made of the SHA ctx object. I suppose there's a non-zero chance of an in-progress SHA process that is abandoned before completion. I would think that the worst possible case scenario here though, would be an infinitely long lock. This seems quite unlikely in normal circumstances.

What might be interesting though is if in one thread with ctx HW in process, that before completion it is abandoned and at nearly the same time another process in another thread is able to allocate a new ctx in exactly the same place in memory as the one in the other thread that was just released. This too seems pretty unlikely, but I hope you can see where I'm going with these ideas: I'm looking for an errant copy of a ctx that is misbehaving.

The other place to look is why the unlock failed. It is very important to only call the wolfSSL locking. See the hack alert notes. See in particular the ESP-IDF periph_module_disable and periph_module_enable function implementations in periph_ctrl.c.

There's no protective logic in the ESP-IDF functions to prevent the caller from exceeding or omitting a call. For example, see the periph_module_disable that will happily decrement their lock depth counter, even if ref_counts[periph] is already zero with ref_counts[periph]--;.

What's even more curious is the tail end of your log. The "Bad lock depth" sounds like something I would have written, but I was not able to find those words in the code. Is that something you added? It does sound like something was not keeping track of who is actually in control of the hardware when the lock was released.

21:04:32.821	�[0;32mI (29495) wolfssl: Free'ing client ssl�[0m
21:04:32.821	Bad lock depth @ 1857 = 1 for 3fcd8020d
21:04:32.826	Lock depth @ 811 = 1 for WC_ESP32SHA @ 3fcd8020d
21:04:32.832	Lock depth @ 818 = 0 for WC_ESP32SHA @ 3fcd8020d
21:04:32.837	
21:04:32.837	assert failed: xQueueGenericSend queue.c:837 (pxQueue->pcHead != ((void *)0) || pxQueue->u.xSemaphore.xMutexHolder == ((void *)0) || pxQueue->u.xSemaphore.xMutexHolder == xTaskGetCurrentTaskHandle())

How to detect this?

So we need to be looking at code that may for some reason think it is in charge of the HW, but isn't. We should be looking for copies of the ctx object, or in particular anything else that may be using the HW directly and not via one of the wolfSSL functions.

Let me know if you'd like to do a video chat to discuss.

yes, I agree that might help. I'm available later today (late afternoon, early evening Pacific time), or we can schedule something during the week.

In the meantime I will study the information you provided in more detail & try to reproduce the problem.

btw, have you tried SINGLE_THREADED mode? (edit: I see you are not currently using this. Might be an interesting data point)

btw (2) - if you are using my branch, there's a new NO_WOLFSSL_ESP32_CRYPT_HASH to turn off just SHA HW acceleration.

  • edit: try turning these on to ESP_LOGI or ESP_LOGW to help detect the possible object copy instances:
                case ESP32_SHA_HW_COPY:
                    /* This is an interesting but acceptable situation:
                    ** an anticipated active HW copy that will demote to SW. */
                    ESP_LOGV(TAG, "HW WC_ESP32SHA ctx mode = ESP32_SHA_HW_COPY.");
                    break;

                default:
                    /* this will frequently occur during new init */
                    ESP_LOGV(TAG, "ALERT: unexpected WC_ESP32SHA ctx mode. "
                                  "Uninitialized?");
                    ctx->mode = ESP32_SHA_INIT;
                    break;

And now that I look at that code, I'm wondering if there's a missing ctx->mode assignment for case ESP32_SHA_HW_COPY? That might be one of my first changes: ctx->mode = ESP32_SHA_INIT; there. In fact, the assignment should probably even be outside the entire switch statement, as any copy should be in initialization mode. Hm... unless the copy is of a SW mode in progress. That would be odd. Still, this is someplace I'd definitely look first in your code. Add some more verbosity here and see if this is indeed a problematic area, in particular to confirm if we ended up here, the ESP32_SHA_HW_COPY state is handled properly. What I'm thinking here is what if the next operation after the copy is not an initialization?

  • edit(2) I'd be interested in seeing your call stack, single threaded if possible, during the error condition.

  • edit(3) I see from your logs there's mention of a "subscribe": are you using wolfMQTT?

@PaulMartinsen
Copy link
Author

@gojimmypi , removing the jtag driver & letting VisualGDB install a new one resolved the debugging issue for me. Sometimes the simple solutions are best.

This let me look into why the assert was failing in xGenericQueueSend. After a bit of mucking about I got the debugger to stop in the right place to inspect the values in the assert and found:
image
So the assert is failing because the pxQueue->u.xSemaphore.xMutexHolder is neither null, a mutex nor on the current thread. The current task handle matches the one for my Notify task (which is the one calling wolfSSL_Free) and the pxQueue->u.xSemaphore.xMutexHolder task handle matches the one for my Connect task (which is the one that created the WOLFSSL session). I copied the call stack to this location below.

This mutex comes from line 67 in esp32_sha.c: sha_mutex. It is initialized by esp_CryptHwMutexInitwc_InitMutexxSemaphoreCreateMutex.

Previously I thought we might have two different problems: something not releasing the hw and the cleanup not working. But the xSemaphoreCreateMutex doc says This type of semaphore uses a priority inheritance mechanism so a task ‘taking’ a semaphore MUST ALWAYS ‘give’ the semaphore back once the semaphore it is no longer required. So maybe if the hw remains locked by a different thread, then the lock can't be untangled because the mutex shoots itself in the foot?

Call stack to failing mutex (line number in queue.c is 6 higher than in the original issue due to extra code to trap the fault)::

xQueueGenericSend
	at C:\Programming\Toolchains\VisualGDB\ESP32\v5\esp-idf\v5.0\components\freertos\FreeRTOS-Kernel\queue.c(843)
esp_CryptHwMutexUnLock
	at K:\Vendor\WolfPR\wolfssl\wolfcrypt\src\port\Espressif\esp32_util.c(83)
esp_sha_hw_unlock
	at K:\Vendor\WolfPR\wolfssl\wolfcrypt\src\port\Espressif\esp32_sha.c(824)
esp_sha_init_ctx
	at K:\Vendor\WolfPR\wolfssl\wolfcrypt\src\port\Espressif\esp32_sha.c(260)
esp_sha_init
	at K:\Vendor\WolfPR\wolfssl\wolfcrypt\src\port\Espressif\esp32_sha.c(99)
InitSha256
	at K:\Vendor\WolfPR\wolfssl\wolfcrypt\src\sha256.c(773)
wc_Sha256Free
	at K:\Vendor\WolfPR\wolfssl\wolfcrypt\src\sha256.c(1858)
FreeHandshakeHashes
	at K:\Vendor\WolfPR\wolfssl\src\internal.c(6845)
SSL_ResourceFree
	at K:\Vendor\WolfPR\wolfssl\src\internal.c(7929)
FreeSSL
	at K:\Vendor\WolfPR\wolfssl\src\internal.c(8470)
wolfSSL_free
	at K:\Vendor\WolfPR\wolfssl\src\ssl.c(1492)
CTLSSession::Dispose
	at K:\DeviceWebservicesLibrary\Communications\TLSContext.cpp(87)
CTLSSession::~CTLSSession
	at K:\DeviceWebservicesLibrary\Communications\TLSContext.cpp(77)
CTcpConnectionInfo::~CTcpConnectionInfo
	at K:\DeviceWebservicesLibrary\Communications\TcpConnectionInfo.h(13)
CTcpConnectionInfo::~CTcpConnectionInfo
	at K:\DeviceWebservicesLibrary\Communications\TcpConnectionInfo.h(13)
std::default_delete<CConnectionInfo>::operator()
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\unique_ptr.h(79)
std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> >::~unique_ptr
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\unique_ptr.h(361)
std::destroy_at<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > >
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\stl_construct.h(88)
std::allocator_traits<std::allocator<std::_Fwd_list_node<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > > > >::destroy<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > >
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\alloc_traits.h(533)
std::_Fwd_list_base<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> >, std::allocator<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > > >::_M_erase_after
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\forward_list.tcc(86)
std::_Fwd_list_base<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> >, std::allocator<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > > >::~_Fwd_list_base
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\forward_list.h(590)
std::forward_list<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> >, std::allocator<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > > >::~forward_list
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\forward_list.h(591)
std::forward_list<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> >, std::allocator<std::unique_ptr<CConnectionInfo, std::default_delete<CConnectionInfo> > > >::remove_if<CConnectionManager::ReleaseInterest(const CConnectionInterest*)::<lambda(const std::unique_ptr<CConnectionInfo>&)> >(struct {...})
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\forward_list.tcc(330)
CConnectionManager::ReleaseInterest
	at K:\DeviceWebservicesLibrary\Communications\ConnectionManager.cpp(176)
CConnectionInterest::~CConnectionInterest
	at K:\DeviceWebservicesLibrary\Communications\ConnectionUser.cpp(22)
CSubscriptionRecord::~CSubscriptionRecord
	at K:\DeviceWebservicesLibrary\Storage\SubscriptionRecord.h(39)
std::destroy_at<CSubscriptionRecord>
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\stl_construct.h(88)
std::allocator_traits<std::allocator<std::_Fwd_list_node<CSubscriptionRecord> > >::destroy<CSubscriptionRecord>
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\alloc_traits.h(533)
std::_Fwd_list_base<CSubscriptionRecord, std::allocator<CSubscriptionRecord> >::_M_erase_after
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\forward_list.tcc(86)
std::_Fwd_list_base<CSubscriptionRecord, std::allocator<CSubscriptionRecord> >::~_Fwd_list_base
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\forward_list.h(590)
std::forward_list<CSubscriptionRecord, std::allocator<CSubscriptionRecord> >::~forward_list
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\forward_list.h(591)
std::forward_list<CSubscriptionRecord, std::allocator<CSubscriptionRecord> >::remove_if<CNotificationProvider::RemoveSubscriptions()::<lambda(const CSubscriptionRecord&)> >(struct {...})
	at c:\programming\toolchains\visualgdb\esp32\v5\tools\xtensa-esp32s3-elf\esp-2022r1-11.2.0\xtensa-esp32s3-elf\xtensa-esp32s3-elf\include\c++\11.2.0\bits\forward_list.tcc(330)
CNotificationProvider::RemoveSubscriptions
	at K:\DeviceServiceProviders\Communications\NotificationProvider.cpp(294)
CNotificationProvider::WorkerTask
	at K:\DeviceServiceProviders\Communications\NotificationProvider.cpp(179)
CWorkerTask::WorkerWrapper
	at K:\Espressif\Tasks\WorkerTask.cpp(47)
vPortTaskWrapper
	at C:\Programming\Toolchains\VisualGDB\ESP32\v5\esp-idf\v5.0\components\freertos\FreeRTOS-Kernel\portable\xtensa\port.c(151)

@gojimmypi
Copy link
Contributor

Hi @PaulMartinsen and thanks very much for all the detail. You have an excellent question.

removing the jtag driver ... install a new one resolved the debugging issue

Oh, that's great! I'm glad to hear your JTAG issues are resolved.

So the assert is failing because [... xMutexHolder is not ] on the current thread.

Please confirm if you are sharing a WOLFSSL_CTX *ctx or WOLFSSL *ssl or both between FreeRTOS tasks.

I reached out internally and asked about a given thread taking a mutex and releasing on a different thread. The response was:

the scheduler must support priority inversion for that

I looked at the Espressif docs and there are relatively few references to the term "inversion".

I'm not sure to what extent the Espressif FreeRTOS implemented the handing of priority inversion.

See FreeRTOS Mutexes. In particular:

Priority inheritance does not cure priority inversion! It just minimises its effect in some situations.

There's a comment on the wolfSSL forum:

Doing a wolfSSL read/write for the same WOLFSSL object at the same time is not thread safe by default. The WOLFSSL object should only be used from a single thread. However we do have a build-time feature you can enable to allow read and write on separate threads. This feature is enabled using the HAVE_WRITE_DUP build option.

See the referenced example tls/client-tls-writedup.c and Chapter 9 of the docs on Thread Safety.

As the problem here seems to be the HW mutex across threads, I'm wondering how that would occur in the first place. I would have thought that when given a mutex in one task that is locked, any other thread that attempts a lock should "see" that and fall back to SW.

The curiosity is how a different thread ended up trying to unlock the mutex. I'm going to spend some time today trying to force this situation & see what happens.

If you are sharing a WOLFSSL *ssl between FreeRTOS tasks, is there any chance your wolfSS_free crash occurs when trying to close and cleanup that shared ssl on one thread while another is still processing HW with an open connection?

I would have expected the processing of HW to have released the mutex upon completion, and certainly not waiting until calling wolfSSL_free as indicated in your call stack.

@PaulMartinsen
Copy link
Author

Please confirm if you are sharing a WOLFSSL_CTX *ctx or WOLFSSL *ssl or both between FreeRTOS tasks.

The WOLFSSL_CTX is only being used by me from the connect task. The WOLFSSL is being created in the connect thread then, once the TLS session is established, it gets used and (ultimately destroyed) in the notify task. So context only in one task, WOLFSSL handed from one task to another.

My working theory is that wc_Sha256Update is acquiring the lock (the hardware lock, the mutex or possibly both) during client hello and never giving it back. Based on:

  1. the call to wc_Sha512Update a few lines later in HashRaw falls back to s/w because it can't get the lock (see log below),
  2. I couldn't see anywhere releasing the lock in the file with wc_Sha256Update, and
  3. the mutex remains owned by the connect thread, which is where client hello happens. Not sure how this mutex relates to the hardware lock in esp32_sha.c yet though.

Grasping straws I know. Any idea where the lock taken in wc_Sha256Update get's released? Also, what is the role for the mutex in esp32_sha.c?

Log snippet from earlier:

Client hello 4533
HashRaw, 9749: wc_Sha256Update of WC_ESP32SHA @ 3fcd8020d
Lock depth @ 779 = 1 for WC_ESP32SHA @ 3fcd8020d
HashRaw, 9770: wc_Sha512Update of WC_ESP32SHA @ 3fcd8110d
�[0;32mI (15005) wolf_hw_sha: 
Hardware in use for WC_ESP32SHA @ 0x3fcd8110d; Mode REVERT to ESP32_SHA_SW
HashRaw, 9794: EdDSA_Update
HashRaw, 9800
Client hello 4538

@gojimmypi
Copy link
Contributor

Hi @PaulMartinsen and thank you for the clarifications.

The WOLFSSL is being created in the connect thread then, once the TLS session is established, it gets used and (ultimately destroyed) in the notify task. So context only in one task, WOLFSSL handed from one task to another.

Interesting, The general consensus is to use a WOLFSSL_CTX *ctx only in one thread (although with care might be able to be shared, but not recommended) ... and to never share a WOLFSSL *ssl between tasks/threads.

But I believe that was in the context of not sharing concurrent operations and not what you are doing: fulling handing off the object to a new thread and never again touching it in the originating thread. I'll need to do some more homework on that.

Any idea where the lock taken in wc_Sha256Update get's released?

Oh! That's a wonderful question. I've had regular discussions with folks on social media that claim to never need to use JTAG for debugging. Here's an excellent example of how to answer your question and why JTAG debug is so valuable.

I placed a single breakpoint on the first statement of esp_sha_hw_unlock and ran my wolfssl_client. Here's the result:

image

In particular, note the call stack. We ended up in esp_sha_hw_unlock from the wc_Sha256Final, specifically the return InitSha256(sha256) statement. (not sure if you knew you can double-click on the call stack functions and go directly to that code)

image

As for the question:

Also, what is the role for the mutex in esp32_sha.c?

The role of the mutex is to ensure only one SHA calculation is using the hardware at any given time. Interestingly, the wolfSSL implementation knows that the SHA calculations can take some time, so more than one can be started and they run concurrently. This is most evident during the TLS handshake. It's really quite clever. Of course for hardware, at least on the ESP32, this is impossible as the HW only supports one full SHA calculation at a time, and it must complete before starting the next one.

Thus if a second SHA calculation is requested and the mutex indicates the HW is busy, we fall back to SW calcs.

In your output example above, I see the most recent message is wc_Sha512Update of WC_ESP32SHA - thus until the ShaFinal is called and the mutext is released on the HW, future SHA hash attempts will fall back to SW.

On my to-do list is the interleaved message digest capability available on the ESP32-S3 that you are using. This method does allow concurrent HW SHA hashes to be calculated. See the bottom of page 825 in the Technical Reference Manual:

image

the mutex remains owned by the connect thread, which is where client hello happens

I think this is ultimately the answer. I'll try to demonstrate this before our call tomorrow.

@PaulMartinsen
Copy link
Author

Thanks for clarifying mutex use. I was confused between the mutex and the ctx in the WC_ESP32SHA data.
Yes, I was wondering if you were overlapping SHA calculations. I wonder if using DMA as well would have any benefit?

I had a few minutes over lunch, so taking another look at log with better understanding...
In my instrumented code, I'm reporting the two points where the ctx is updated. The first when incremented:
image

Second when decremented. This is in esp_sha_hw_unlock, same function in your debugging session, a few lines later:
image

I'm using the ctx address to track which instance is holding onto the hardware. Is there a better way? I know you spent a lot of time working on a copy issue, so maybe the hardware is being released off a copy.

Screenshot below lists all the instances the context at address 3fcd8020d is referenced in the log file from that zip file above. I see it being acquired on line 117 of the log, used a few times before showing up as a bad lock depth when attempting to destroy (line 783 from log). Connection is complete at log line 683 (the point when I hand the WOLFSSL from the connect to notify task). The context at 3fcd8020d is never used again until it is time to cleanup (i.e., not used during sending the messages from the notify task).
image

Actually, both my instrumented lines in esp32_sha.c have the keyword Lock depth @. There's a bit of toing and froing, taking and releasing the lock, up log line 117 when wc_Sha256Update starts doing its thing. Then no more activity in esp32_sha.c. So the lock isn't released anywhere (i.e., not even by a possible copy) until cleanup time.
image

If wc_Sha256Final is supposed to be releasing the lock, maybe I'm missing some cleanup after making the connection? This does look deep inside client hello though. Next step will have to wait to later; lunch is over.

@PaulMartinsen
Copy link
Author

Thanks so much for your time this morning @gojimmypi . I think we much much faster progress understanding the problem than trying to digest log files. Perhaps github needs to had video meetings to issue tracking!

I roughed out the concept we talked about. My testing has been limited to subscribing to an event like I demoed in our discussion. It didn't crash! Tail end of the log with the interesting bits below.

  • Log snippet starts with Removing subscription, which is the point all event messages have been sent and it is time to close the connection.
  • As before connection was opened on the connection task, closed on notify task.
  • As before, freeing the client session detects the non-zero lock in WC_ESP32SHA->ctx left over from wc_Sha256Update during client hello.
  • Now, however, wc_Sha256Free recovers and releases the lock w/o crashing, and continues closing the session.
  • I can create a new subscription with same behavior.
Removing subscription: 
urn:uuid:2dcf4629-04b4-48d8-a8a7-ff3f2bf1fcd9
  expires: 2023-07-29 02:05:17.974667000
  notify: https://merlin.lan:9027/Lumos/Health/
  end to: http://merlin.lan:9026/Lumos/SubscriptionNotifications/
  state: 1
  notify interest: 2
Disposing session 3fcd0250 on task Notify
�[0;32mI (45679) wolfssl: wolfSSL Entering wolfSSL_free�[0m
Free SSL: 3fcd0250d
�[0;32mI (45689) wolfssl: Free'ing client ssl�[0m
Bad lock depth @ 1857 = 1 for 3fcd2050d
Lock depth @ 979 = 1 for WC_ESP32SHA @ 3fcd2050d
Lock depth @ 986 = 0 for WC_ESP32SHA @ 3fcd2050d
�[0;32mI (45709) wolfssl: wolfSSL Entering ClientSessionToSession�[0m
�[0;32mI (45709) wolfssl: wolfSSL Entering wolfSSL_FreeSession�[0m
�[0;32mI (45719) wolfssl: wolfSSL_FreeSession full free�[0m
�[0;32mI (45719) wolfssl: wolfSSL Entering wolfSSL_sk_CIPHER_free�[0m
�[0;32mI (45729) wolfssl: wolfSSL Entering wolfSSL_sk_free�[0m
�[0;32mI (45729) wolfssl: wolfSSL Entering wolfSSL_sk_X509_pop_free�[0m
�[0;32mI (45739) wolfssl: wolfSSL Entering wolfSSL_sk_pop_free�[0m
�[0;32mI (45749) wolfssl: wolfSSL Entering wolfSSL_sk_X509_NAME_pop_free�[0m
�[0;32mI (45749) wolfssl: wolfSSL Entering wolfSSL_sk_pop_free�[0m
�[0;32mI (45759) wolfssl: CTX ref count not 0 yet, no free�[0m
�[0;32mI (45759) wolfssl: wolfSSL Leaving wolfSSL_free, return 0�[0m

So I think this might address the crash; though still we have the issue with the hardware being hogged by one calculation in some cases. As you noted, interleaving message digesting could address that. At least for the s3. Maybe for teh ESP32 as well. To me it looks like the differnce between the ESP32 & s3 might be only the s3 has separate input and output registers (which would make sequential operations with dma easier) while the ESP32 uses the same registers for input and output. On page 575 of they data sheet they talk about initializing with the first message block but I can't see why that couldn't be the last message block of a partial calculation with an extra step to write to the LOAD_REG to get the partial result out:
image

I noticed a couple of things that surprised me:

  • when unlocking, we also periph_module_disable(PERIPH_SHA_MODULE) to disable the SHA module. I imagined this is more a power saving thing and wondered if there was any significant cost turning it on and off for every calculation. Hmm. and would making it part of initialization simplify unwinding? I guess if there's not cost, keeping it self contained is nice.
  • the hardware lock mutex was released in esp_sha_hw_unlock regardless of the ctx->lockDepth value. I wondered if it should only be released when ctx->lockDepth is zero?
  • there might be a race condition when the global sha_mutex is initialized. i.e., two tasks trying to do sha calculations at the same time could end-up creating more than 1 mutex. I guess those tasks would have to be running on different CPUs? Seems unlikely.

@PaulMartinsen
Copy link
Author

Actually maybe we could use a critical section to protect InUse. It has the extra benefit of static initialization to eliminate the (theoretical) race condition. I pushed an update using critical sections.

Do you think this might solve the problem or have I missed something?

@gojimmypi
Copy link
Contributor

Hi @PaulMartinsen and thank you for your time as well on Friday & all the great input. I agree the discussion was very productive. It was good to talk with you and learn more about your project. It looks very cool!

First, for the record & as discussed: the official response from wolfSSL is that the ctx has some thread safety, so care should be used when sharing between RTOS tasks, and avoid sharing if possible. The ssl object is not thread safe, and should never be shared across RTOS tasks.

That said, I understand your desire to open a TLS connection is one task, and manage data exchange over TLS in a separate task. If care is taken in your code to ensure the wolfssl ctx and ssl objects do not encounter any concurrent access, then it may be possible to safely implement that design. I'd like to help.

I took a look at your suggested changes. I was not able to see the benefit of using the InUse over the current FreeRTOS mutex. In particular, although the simpler InUse may work now, it will be considerably more difficult in the future to implement the interleaved SHA HW processing we talked about.

Today, if the hardware is busy, we immediately fall back to software, as we are assuming only one HW has can be in progress at any given time. In the future, we'll want to actually share the HW between multiple, concurrent SHA calculations, such as that in the int HashRaw() in internal.c. We'll want to take advantage of the mutex waiting capabilities when the HW is busy with a different hash, and only fall back when that wait takes too long or the HW is otherwise not available.

I did implement several new features in my dev branch that allows better visibility into multi-threaded HW locking. See the new #define DEBUG_WOLFSSL_SHA_MUTEX in the user_settings.h.

I've also implemented a feature that cleans up HW lock when free is called before sha[nnn]final(). See esp_sha_release_unfinished_lock and esp_sha_mutex_ctx_owner.

I have a specific wolfssl_client_6637 project for this issue, and I've been testing with the corresponding updated wolfssl_server.

The code still need a bit of work, but I'd be interested in the results as related to the assert failure that we've seen in your code.

See also the client 6637 sample output, in particular the esp_sha_release_unfinished_lock 3fcc9ff8 on line 2474.

@PaulMartinsen
Copy link
Author

Thanks @gojimmypi . Threading is a tricky topic, so not surprising there could be dragons lurking if I'm not very careful.

For posterity, we're looking at this sequence, where | ↣ | is the (one way) boundary where responsibility of the WOLFSSL structure shifts from the connection to notification task: resolve host → create socket → configure socket → connect to peer → secure connection → | ↣ | → send secure message; → send another secure message → etc → close connection.
With appropriate synchronization privatives (currently using task notifications on RTOS and std::mutex with std::condition_variable on Windows).

Motivation, again for posterity: I could have a dozen open connections to send notifications to, most of them using TLSv1.3, adding new ones at any time. And I want to minimize latency for sending notifications. On the ESP32s3 it takes around one second to do a TLSv1.3 handshake. So notifications would stall for a second every time I need to send a message if I have to open a connection on the same thread.

A couple of other solutions I considered:

  • overlapping the IO on a single thread using poll etc. — this could be a better solution, but seems more difficult;
  • creating a separate task for each connection — fine on windows for a dozen connections, but I'm concerned about the overhead on ESP32s3 (the relatively large stack needed for the TLS fallback came to mind too).

An important part of my project is understanding the constraints and possibilities of the ESP32s3 environment, so very much appreciate your help navigating through. Happy to do whatever I can to find a solution.

@PaulMartinsen
Copy link
Author

I figured a new comment to address the other details...

I'm not sure if you saw the two versions, one with critical sections (the last change), the other with a mutex.

I took a look at your suggested changes. I was not able to see the benefit of using the InUse over the current FreeRTOS mutex. In particular, although the simpler InUse may work now, ...

The main point of critical sections was to use static initialization to avoid the race condition here (from esp_sha_try_hw_lock):

   if (sha_mutex == NULL) {
        ESP_LOGV(TAG, "Initializing sha_mutex");

        /* created, but not yet locked */
        ret = esp_CryptHwMutexInit(&sha_mutex);

That might be me over thinking things though. InUse could be protected using a mutex instead:

bool TryAcquireHardware()
{
    bool bCanUseHardware = false; 
    if (0 == esp_CryptHwMutexLock(&sha_mutex, (TickType_t)0))
    {
        if (!InUse)
        {
            InUse = 1;
            bCanUseHardware = true; 
        }
        esp_CryptHwMutexUnLock(&sha_mutex);
    }

    return bCanUseHardware;
}

void ReleaseHardware()
{
    if (0 == esp_CryptHwMutexLock(&sha_mutex, (TickType_t)0))
    {
        if (InUse)
        {
            InUse = 0;
        }
        else
        {
            assert(false); // bug: releasing hardware when not in use!!
        }
        esp_CryptHwMutexUnLock(&sha_mutex);
    }
}

Maybe with a critical section to protect its initialization if the race condition was a worry?

What I like about this is that responsibility for locking and unlocking the synchronization object remains entirely within esp32_sha.c. Access to the hardware is still serialized to a single user (across all tasks and CPUs), but it is impossible for the mutex to remained locked outside esp32_sha.c (though the hardware can be, mediated by InUse). This seems good to me because unlocking the mutex from the "wrong" thread appears to be the source of this crash". And in TryAcquireHardware there could be a time-out too when InUse is protected with a mutex.

But I'm a bit confused about:

In the future, we'll want to actually share the HW between multiple, concurrent SHA calculations, such as that in the int HashRaw() in internal.c. We'll want to take advantage of the mutex waiting capabilities when the HW is busy with a different hash, and only fall back when that wait takes too long or the HW is otherwise not available.

I don't see how waiting helps us in HashRaw(...), but I'm no longer clear why.... I think I assumed that the calculation was done by the time wc_Sha256Update returned, regardless of whether that calculation was done in hardware or software.

Has the hardware accelerated calculation finished when, for example, wc_Sha256Update(...) returns? Or does it continue in the background? I thought wc_Sha256Update wouldn't return until the calculation was done, but after looking closer I'm not so sure.

I spent too much time thinking about this tonight; I'll have to try the 6637 version tomorrow night.

@PaulMartinsen
Copy link
Author

Hi @gojimmypi , I tried the dev branch but got stuck subscribing to the event. That is, when the server on the ESP32s3 receives a message from my .net client to start sending the notifications. I suspect this is related to our problem though; recall unwinding showed up in the log for the server message as well. And the issue seems to be happening in hashing functions. But we've been looking at the client side because we see the whole transaction.

Anyway, I attached the new log output. In short, while trying to secure the TLS session for my .NET client, CreateTicket in internal.c returns BAD_TICKET_ENCRYPT (timestamp 20:25:41.956 in the log file):
image

Stepping through, it went into wc_ShaHash, but nothing stood out to me inside those:
image

I did have a bit of trouble with not being able to set breakpoints where I wanted and getting missed which makes me think there is some optimization going on even though I have selected the debug optoin for that.

@gojimmypi
Copy link
Contributor

Hi @PaulMartinsen thanks for all the details. I think we're making good progress,

got stuck subscribing to the event

I realize you've gone pretty far down the road of custom TCP payloads over TLS1.3, but I keep thinking about the benefits of using something like wolfMQTT :)

message from my .net client to start sending the notifications

Just for reference: In the absence of your source code I am using the same codebase for both ESP32 based client and server.

I did have a bit of trouble with not being able to set breakpoints

Yes, my Tigard is considerably more robust on the standard ESP32 as compared to the ESP32-S3 with the on-board JTAG. Frequently there are times that the S3 JTAG becomes unresponsive. It seems to be related to the running code. In particular some sort of rapid panic reboot causes subsequent JTAG flashing to be problematic. I have a wolfssl-examples/ESP32-hello-world expressly for this situation.

I wonder if using DMA as well would have any benefit?

Oh yes, absolutely! Once we have asynchronous calls to the HW hash, combining DMA would definitely provide a noticeable performance improvement. Even just the DMA option alone would have benefit.

I suspect this is related to our problem though; recall unwinding showed up in the log for the server message as well.

Yes, you are correct. Recall this is what we found in your prior log with the mutex assert failed:

21:04:32.809	Disposing session 3fcd5d58 on task Notify
21:04:32.809	�[0;32mI (29485) wolfssl: wolfSSL Entering wolfSSL_free�[0m
21:04:32.815	Free SSL: 3fcd5d58d
21:04:32.821	�[0;32mI (29495) wolfssl: Free'ing client ssl�[0m
21:04:32.821	Bad lock depth @ 1857 = 1 for 3fcd8020d
21:04:32.826	Lock depth @ 811 = 1 for WC_ESP32SHA @ 3fcd8020d
21:04:32.832	Lock depth @ 818 = 0 for WC_ESP32SHA @ 3fcd8020d
21:04:32.837	
21:04:32.837	assert failed: xQueueGenericSend queue.c:837 (pxQueue->pcHead != ((void *)0) || pxQueue->u.xSemaphore.xMutexHolder == ((void *)0) || pxQueue->u.xSemaphore.xMutexHolder == xTaskGetCurrentTaskHandle())

With my recent changes to ShaFree(), etc, I'm properly unlocking the HW with the new esp_sha_release_unfinished_lock() when the in-progress hash is abandoned and the expected ShaFinal() is not called.

In your latest log, I'm happy to see that at least the nasty assert no longer occurs:

20:25:42.006	�[0;32mI (36786) wolfssl: wolfSSL Entering wolfSSL_free�[0m
20:25:42.011	Free SSL: 3fcd3268
20:25:42.011	�[0;32mI (36796) wolfssl: Free'ing server ssl�[0m
20:25:42.017	2) esp_sha_hw_unlock Lock depth @ 944 = 1 for WC_ESP32SHA @ 3fcd5814
20:25:42.022	3) esp_sha_hw_unlock Lock depth @ 954 = 0 for WC_ESP32SHA @ 3fcd5814
20:25:42.027	�[0;33mW (36816) wolf_hw_sha: Unlocking for 3fcc20fc, from ctx 3fcd5814, & = 3fccd750, mutex_ctx_owner = 3fcc20fc�[0m
20:25:42.039	�[0;32mI (36826) wolfssl: wolfSSL Entering ClientSessionToSession�[0m
20:25:42.044	�[0;32mI (36826) wolfssl: wolfSSL Entering wolfSSL_FreeSession�[0m
20:25:42.050	�[0;32mI (36836) wolfssl: wolfSSL_FreeSession full free�[0m
20:25:42.055	�[0;32mI (36836) wolfssl: wolfSSL Entering wolfSSL_sk_CIPHER_free�[0m
20:25:42.061	�[0;32mI (36846) wolfssl: wolfSSL Entering wolfSSL_sk_free�[0m
20:25:42.066	�[0;32mI (36856) wolfssl: wolfSSL Entering wolfSSL_sk_X509_pop_free�[0m
20:25:42.072	�[0;32mI (36856) wolfssl: wolfSSL Entering wolfSSL_sk_pop_free�[0m
20:25:42.084	�[0;32mI (36866) wolfssl: wolfSSL Entering wolfSSL_sk_X509_NAME_pop_free�[0m
20:25:42.089	�[0;32mI (36866) wolfssl: wolfSSL Entering wolfSSL_sk_pop_free�[0m
20:25:42.095	�[0;32mI (36876) wolfssl: CTX ref count not 0 yet, no free�[0m
20:25:42.100	�[0;32mI (36886) wolfssl: wolfSSL Leaving wolfSSL_free, return 0�[0m
20:25:42.105	SSL Accept took 3133 ms
20:25:42.105	Error 30 fault detected [no subcode]
20:25:42.111	"Bad user ticket callback encrypt Error"
20:25:42.116	Detail: SSL/TLS handshake failed

This is exactly why we need a proper FreeRTOS mutex to manage the HW between tasks once we start using the SHA interleave capabilities.

It's really a shame that the current implementation grabs the first TLS hash (when multiple hash algorithms are enabled). For example the int HashRaw() that I mentioned above that may do this with all the SHA types enabled:

wc_ShaUpdate(&ssl->hsHashes->hashSha, data, sz);
ret = wc_Sha256Update(&ssl->hsHashes->hashSha256, data, sz);
ret = wc_Sha384Update(&ssl->hsHashes->hashSha384, data, sz);
ret = wc_Sha512Update(&ssl->hsHashes->hashSha512, data, sz);

Without interleaving, only the first and simplest ShaUpdate will use HW, subsequent calls to for instance wc_Sha512Update will fall back to SW as the hardware is still "in use" for the active ShaUpdate hash.

I don't see how waiting helps us in HashRaw(...)

You are correct. There's no benefit. Today

However, once we have hash interleaving (perhaps even asynchronously), we'll have multiple different tasks using the HW concurrently on different hashes. Even is there's an ssl & ctx properly constrained to one task, there may be other tasks with their own ssl & ctx , or even someone else just wanting to compute a hash.

Bottom line is that I'd like to ensure the ESP32 HW hash library is fully thread safe. To do this, we need FreeRTOS mutexes and the ability to wait for the resource to be freed.

"Bad user ticket callback encrypt Error"

So at this point, I think we are only dealing with TLS Session Resumption. I've been reviewing the comments in TLS1.3 Session Resumption #6172 this morning.

Can you post some sample code of how you are doing session resumption? More specifically: are you doing this in your connection task, or your data exchange task?

My thought at this time is that your ssl object is getting cleaned up in one task, and the other task does not know about it. That's probably why we were seeing the previous mutex assert error.

Do you have any InUse flags for your inter-task-shared ssl object? If an ssl object gets unexpectedly cleaned up (Free'd) in one task, how does the other task know?

Thanks again for all your input. Cheers.

@PaulMartinsen
Copy link
Author

A couple of things I can respond to quickly before I start the rest of the day; more later...

I'm pretty confident that assert, & subsequent crash, happen because xSemaphoreTake("sha_mutex,&hellip;") is being called in one task (connection) and xSemaphoreGive("sha_mutex") in another task (notify), and I think semaphores don't work in that situation. Take and give must happen on the same thread.

This arises because wc_Sha*Update (called in this case by HashRaw) start the hash, and the first one claims ownership of the hardware until wc_Sha*Finalize or wc_Sha*Free releases that ownership. Because HashRaw never releases the hash (it might have a good reason for that; I don't know), wc_Sha256Free ends up being called when the connection is closed. HashRaw is part of client hello, so in my case the xSemaphoreTake gets called on the connection thread (as part of client hello) and xSemaphoreGive on the notify thread (as part of freeing memory) = boom!

This also means that the server code doesn't crash because the HW lock unroll is happening on a single thread.

I'm 80% confident the only wolf library functions I'm calling on the notify thread are:

  • wolfSSL_write
  • wolfSSL_read
  • wolfSSL_get_error
  • wolfSSL_free

There's possibly a dozen that are called on the connection thread to set everything up; I'll have to work through the code to gather that list.

Can you post some sample code of how you are doing session resumption? More specifically: are you doing this in your connection task, or your data exchange task?

My session resumption setup currently is:

  uint8_t au8ServerId[SHA_DIGEST_SIZE];
  wc_Sha ServerId;
  if (0 == svServerId.length()
    || 0 != wc_InitSha(&ServerId)
    || 0 != wc_ShaUpdate(&ServerId, (const uint8_t*)svServerId.data(), static_cast<word32>(svServerId.length()))
    || 0 != wc_ShaUpdate(&ServerId, (const uint8_t*)&uServerPort, static_cast<word32>(sizeof(uServerPort)))
    || 0 != wc_ShaFinal(&ServerId, au8ServerId)
    || WOLFSSL_SUCCESS != wolfSSL_SetServerID(pNewSession, (const uint8_t*)au8ServerId, sizeof(au8ServerId), 0))
  {
    // This is only a warning because the server id is used as a key for session 
    // resumption. We can still make connections without it, they'll just take a
    // little longer. 
    cerr << __FUNCTION__ << "Warning: failed to set server id\n";
  }
  wc_ShaFree(&ServerId);

A later version replaces that with a CShaCalculator class to avoid calling wc_ShaFree if either of the first two predicates fail (i.e., if 0 == svServerId.length() or 0 != wc_InitSha(&ServerId) fail we shouldn't call wc_ShaFree). Worth noting that if you have the server id and port all together in a single string you can just pass them directly to wolfSSL_SetServerID now with the fix from #6172. And even if you dont have them all together, assembling that string and doing the hash once (which is what wolfSSL_SetServerID does for long strings) is probably faster.

I'll get a list of functions I'm calling from the connect thread and address other questions a bit later.

@PaulMartinsen
Copy link
Author

continuing on...

Wish I could use mqtt; unfortunately the standard we are following calls for challenging, memory hungry, complexity.

Thread safety

My thought at this time is that your ssl object is getting cleaned up in one task, and the other task does not know about it. That's probably why we were seeing the previous mutex assert error.

The WOLFSSL object is cleaned up in the data exchange task. By this stage, the connection task has washed its hands of the WOLFSSL object and is busy doing something else or waiting patiently for something to do (e.g., make a new connection for a different peer). I believe cross-task give and take causes the crash as above.

Do you have any InUse flags for your inter-task-shared ssl object? If an ssl object gets unexpectedly cleaned up (Free'd) in one task, how does the other task know?

I do have an in use flag, its even called m_bInUse, protecting the WOLFSSL data (and other connection related data such as the socket handle). For me, m_bInUse is a std::atomic_bool that's acquired and released by at most one task at any time:

bool CConnectionInfo::TryAcquireConnection(CConnectionUser& rReference) noexcept
{
  bool bCurrentInUse = false;
  if (!m_bInUse.compare_exchange_strong(bCurrentInUse, true))
  {
    // someone else is using the connection. 
    return false;
  }

  AddInterest();
  rReference.Attach(*this);
  return true;
}

void CConnectionInfo::ReleaseConnection() noexcept
{
  assert(m_bInUse.load());
  assert(m_nInterest > 0);
  m_LastUsed.store(m_rClock.MonotonicTime());
  RemoveInterest();
  m_bInUse.store(false);
}

Here CConnectionUser is a fancy mutex that mediates access to the CConnectionInfo (e.g., making sure users of a connection have the lock before doing anything to a connection), and maintains the lock life time (i.e., calling CConnectionInfo::ReleaseConnection to free the acquired lock in its destructor). Any task that wants to use the connection (i.e., connection or data exchange) can only do so through a CConnectionUser that has a lock on the CConnectionInfo. And there can only be one lock at any time (because m_bInUse is a boolean, and CConnectionInfo::TryAcquireConnection is the only way to get to the connection). There is also a xTaskNotify posted from the connection task to the data exchange task when the connection task has done everything it can (e.g., open the connection or fail at doing so), so the data exchange task can do its thing (e.g., send a message or try again later).

BTW, while trying to learn about lwip timeouts I found Espressif sockets are "thread-safe", so that's nice.

Interleaving

I'm still a bit confused about the future interleaving, so working it through so see what I missed...

It seems to me there are two 'in use' flags that are important here:

  1. SHA_BUSY_REG (on ESP32s3, different on other devices), and
  2. InUse or sha_mutex.

The first one indicates whether the hardware is occupied with a calculation and, if I understood correctly, the second one indicates whether there is (or soon will be) a calculation result in the hardware sha registers that has not been moved into application memory (i.e., the WC_ESP32SHA data structure). Then, I think

  • when using software hashing, wc_Sha256Update has completed the hashing calculation when it returns, but
  • when using hardware hashing, wc_Sha256Update only starts the hash calculation; it may or may not be complete when wc_Sha256Update returns. Certainly, the result from the hash is not yet transferred to application memory. The hash result is not transferred to application memory until wc_Sha256Finalize is called.

It seems to me that this difference in the behaviour of the Update function leads to the problem in HashRaw, where there are two sequential hashes (a wc_Sha256Update and a wc_Sha512Update). The current implementation has the second calculation fallback to software. Not necessarily because the hardware accelerator is busy, but possibly only because it holds an intermediate result that hasn't been moved out to application memory.

So I'm struggling a bit to see how waiting on a mutex that needs to be released in the same task that took it will be helpful since the wc_Sha256Update and wc_Sha512Update both also execute in that task. Maybe there could be a task or interrupt monitoring the SHA_BUSY_REG, but won't that also blow up when it tries to release the semaphore?

The big advantage of having wc_Sha256Update return while the hardware completes the calculation is that other things could happen. For example one sha calculation could be done on the accelerator and another on the CPU; keep all the calculators busy.

If the CPU calculation takes much longer than the HW one though, I wonder if it could be faster to simply wait for the accelerator to finish and store the intermediate result in application memory? Maybe with a token to detect sequential hash updates to save re-initializing the accelerator unnecessarily? If we can wait in a task friendly way, then other tasks could do stuff while the calculation is finishing up too, which is another way to keep the processors busy.

I'm not sure if this means the behaviour of wc_Sha*Update then becomes same as software (e.g., calculation is done & result stored on return), or there's other synchronization needed inside HashRaw. Maybe we can't overlook the fact that other hardware might have separate 256 and 512 bit accelerators so wc_Sha256Update and wc_Sha512Update could happen in parallel (if they don't separately and sequentially wait for the calculation to complete).

Do we need to park this bug until the strategy for interleaving is clearer? That stone might kill this bird too.

@gojimmypi
Copy link
Contributor

Hi @PaulMartinsen

SHA_BUSY_REG ... indicates whether the hardware is occupied with a calculation

Yes, I think we agree: This is a register than only means "the SHA calculation has not completed and is not yet available to read". It is not at all a hardware-sharing semaphore.

So I'm struggling a bit to see how waiting on a mutex

We're not today waiting on a mutex. However in the future with interleaved HW calculations, we would likely want to wait a bit.

Today, if the HW is not done (either a single digest, or a stream of multiple digests without final being called), we immediately fall back to SW.

If we were interleaving, we might have two different SHA hashes in two different tasks. Alternatively, in our TLS connection today, we are (somewhat) concurrently digesting both SHA256 and SHA512 hashes in the same thread. If the SHA256 is started first, it will use HW. When the SHA512 digest then starts, it falls back to SW.

If we had the interleaving capabilities, we could have both the SHA256 and SHA512 hashes to be incrementally digested both using hardware.

The wait lock would only be useful if a separate task also wanted to do interleaving hash digests. Today we only have one HW hash computed at a time. We don't allow a separate task to compute a parallel HW hash, as we don't yet support interleaving. Once the HW is locked, it stays locks (with the mutex) until it is completely finished.

If the CPU calculation takes much longer than the HW one though, I wonder if it could be faster to simply wait for the accelerator to finish and store the intermediate result in application memory? Maybe with a token to detect sequential hash updates to save re-initializing the accelerator unnecessarily? If we can wait in a task friendly way, then other tasks could do stuff while the calculation is finishing up too, which is another way to keep the processors busy.

I believe this is essentially the interleaving feature, no?

I'm pretty confident that assert, & subsequent crash, happen because xSemaphoreTake("sha_mutex,…") is being called in one task (connection) and xSemaphoreGive("sha_mutex") in another task (notify), and I think semaphores don't work in that situation. Take and give must happen on the same thread.

Yes, I do believe you are correct & that is what we are seeing. I've convinced myself of that earlier today.

I've applied some changes to my dev branch that now demonstrate that problem. I have a listening TLS server and a special 6637 client.

When setting #define WOLFSSL_TEST_STRAY 1 in the 6637 client user_settings.h, a stray ctx will be simulated as locking the hardware. This occurs after 8 HW locks (a somewhat arbitrary number), and then is attempted to be later cleaned up in the peek task, confirming that the free that occurs in your data task is in conflict with the ongoing (and ultimately incomplete) HW digest in your connection task.

I'm currently looking into possible solutions. Again, having a ctx shared across multiple tasks is not supported, but I do believe you've found an excellent bug that needs to be resolved in having multiple tasks allowed to use the HW properly, in particular to have the HW properly released by a long running and incomplete SHA digest that is Free'd before Final is called.

Thanks very much for all your help and insight. I think we are very close to a good solution.

@gojimmypi
Copy link
Contributor

Hi @PaulMartinsen

I've made a but of progress on our multi-thread lock/release problem. I have an update to my wolfssl_client_6637.

See the sample_output_stray_lock_recovery.txt log file. In particular:

  • There's a separate peek task, that locks a stray ctx HW semaphore after 8 iterations. (ctx owner = 3ffb4044)
  • I'm now keeping track of the task that performed the lock.
  • A short time later the peek task (in your case, the data task), tries to free the ctx no longer in use.
  • The new esp_sha_release_unfinished_lock is called during the various SHA Free operations, such as this wc_Sha256Free.
  • See the log RELEASE STRAY and subsequent proper HW locks.
  • Of particular interest is the new ESP32_SHA_FREED status for ctx->mode.
  • When the ESP32_SHA_FREED is encountered, any thread can attempt to release unused HW lock as needed.
  • This logic all assumes there are ongoing tasks in each thread that attempts a new HW lock after an abandoned hash.
  • One could alternatively implement their own HW unlock for any ctx found with a ESP32_SHA_FREED state.
  • Only the originating thread can of course release the lock, but now that thread can detect the ESP32_SHA_FREED status.

The code still needs a bit of TLC and cleanup (thread-safety issues), but is basically operational.

There are other ways that could accomplish this same goal. For this iteration, this simple implementation should suffice. I suspect when doing interleaved hashes we may need different logic.

If you get a chance, please give this initial working stray HW lock recovery commit a try with your separate connection and data tasks & let me know how it goes.

Cheers

@PaulMartinsen
Copy link
Author

Hi @gojimmypi , thanks very much for all your work on this. Sounds very promising. I'll try it out tomorrow night.

@PaulMartinsen
Copy link
Author

@gojimmypi , I tested the lasted commit (from 7 August, 12:55 p.m., "initial working stray HW lock recovery") in the ED25519_SHA2_fix branch. I added the following to my user_settings.h so it would compile.

#define DEBUG_WOLFSSL_SHA_MUTEX
#define WOLFSSL_TEST_STRAY 1

With this version I get an endless reboot cycle. Decoding the back trace shows it is failing in line 740 of sha.c
image

This occurs when the ESP32s3 server initializes the WOLFSSL_CTX, loading in the certificate authority certificates:

0x4208d441: wc_ShaUpdate at k:/Lumos/Vendor/WolfPR/wolfssl/wolfcrypt/src/sha.c:740                                                                                                       
0x4208908c: wc_ShaHash at k:/Lumos/Vendor/WolfPR/wolfssl/wolfcrypt/src/hash.c:1345                                                                                                       
0x42072427: CalcHashId_ex at k:/Lumos/Vendor/WolfPR/wolfssl/wolfcrypt/src/asn.c:12531                                                                                                    
0x4207246c: GetCertName at k:/Lumos/Vendor/WolfPR/wolfssl/wolfcrypt/src/asn.c:13413                                                                                                      
0x42072d5d: GetName at k:/Lumos/Vendor/WolfPR/wolfssl/wolfcrypt/src/asn.c:14278                                                                                                          
0x420751f6: wc_GetPubX509 at k:/Lumos/Vendor/WolfPR/wolfssl/wolfcrypt/src/asn.c:15107                                                                                                    
0x4207521d: DecodeToKey at k:/Lumos/Vendor/WolfPR/wolfssl/wolfcrypt/src/asn.c:15149                                                                                                      
0x420637ab: ProcessBuffer at k:/Lumos/Vendor/WolfPR/wolfssl/src/ssl.c:7399                                                                                                               
0x42063e95: ProcessFile at k:/Lumos/Vendor/WolfPR/wolfssl/src/ssl.c:8183                                                                                                                 
0x42063f8e: wolfSSL_CTX_use_certificate_file at k:/Lumos/Vendor/WolfPR/wolfssl/src/ssl.c:8741 (discriminator 6)                                                                          
0x4204f7be: CCredentialStore::InstallCredentials(WOLFSSL_CTX*) const at k:/Lumos/DeviceWebservicesLibrary/Storage/CredentialStore.cpp:144                                                
0x42012232: CSoapTcpServerContext::SetupTLS() at k:/Lumos/DeviceServiceProviders/Communications/SoapTcpServerContext.cpp:90                                                              
0x420122ef: CSoapTcpServerContext::Initialize(CSoapUserData const&, unsigned short, bool) at k:/Lumos/DeviceServiceProviders/Communications/SoapTcpServerContext.cpp:32 (discriminator 1)
0x4201123d: CTargetServiceProvider::InitializeSoapContext() at k:/Lumos/DeviceServiceProviders/Communications/TargetServiceProvider.cpp:39                                               
0x4201184e: CTargetServiceProvider::WorkerTask() at k:/Lumos/DeviceServiceProviders/Communications/TargetServiceProvider.cpp:51                                                          
0x4216c995: CWorkerTask::WorkerWrapper(void*) at k:/Lumos/Espressif/Tasks/WorkerTask.cpp:47                                                                                              
0x40385709: vPortTaskWrapper at C:/Programming/Toolchains/VisualGDB/ESP32/v5/esp-idf/v5.0/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:151                                                                                        

Actually, looks like it is inside esp_sha_try_hw_lock with a null pointer in stray_ctx:
image

Though I can't see where that pointer would be set. Maybe I wasn't supposed to #define WOLFSSL_TEST_STRAY 1.

Setting #define WOLFSSL_TEST_STRAY 0 instead, I get past the crash above.... but back to the Bad user ticket callback encrypt Error in the ESPs3 server when it is trying to receive the message to subscribe to an events. Drat.

@gojimmypi
Copy link
Contributor

Hi @PaulMartinsen and thankyou for taking a look at the interim progress.

Ah yes, if you are only using the library and not my sample code, that makes sense. I should have added more details.

I have a concurrent tls_peek task. Its job it to attempt to trip up the locking mechanism. It is (was) expected to be running and initializing that stray.

The peek task can be turned off by setting run_peek_task = 0.

My peek task initializes that stray ctx that you bumped into being a null pointer. The stray is later released to test the cleanup of an unfinished SHA, where free is never called on the respective ctx that locked HW.

Maybe I wasn't supposed to #define WOLFSSL_TEST_STRAY 1.

Hm, Maybe. But I must have missed something for your setup. In not having your code, I emulated your connection and data tasks. This option is turned on by setting test_separate_tasks = 1 in my example.

I've updated the code to turn off the peek task and enabled my test for multiple ctx tasks. I've also applied a change to check for nulls during testing.

As seen in the latest sample output, I successfully used two separate tasks to connect and send data. It still needs work, as after the initial exchange is successful, I see a bunch of tls_client: ERROR: tls_smp_client_task_send_data failed to read messages. I think it is the way I'm handling me TLS connection. It may actually work for you.

In any case, the point of interest is the successful release of unfinished lock.

Setting #define WOLFSSL_TEST_STRAY 0 instead, back to the Bad user ticket callback encrypt Error

Let's go ahead and leave #define DEBUG_WOLFSSL_SHA_MUTEX and #define WOLFSSL_TEST_STRAY 1 in place for now.

@PaulMartinsen
Copy link
Author

Ok. I updated to the latest version. And have #define DEBUG_WOLFSSL_SHA_MUTEX and #define WOLFSSL_TEST_STRAY 1 in my user_settings.h file. No build problems!

Uploaded.... oh. Still the same problem with Bad user ticket callback encrypt Error when I try to connect to the ESP32s3 server with TLSv1.3. I seem to be stuck on this one.

I put #define NO_WOLFSSL_ESP32_CRYPT_HASH in my user_settings.h. And... that doens't resolve either (same bad ticket error). But I was using this version this afternoon and that was ok. Would these changes have had any impact on the software hashing?

I still get the bad ticket problem without DEBUG_WOLFSSL_SHA_MUTEX or WOLFSSL_TEST_STRAY and all of

#define NO_ESP32_CRYPT
#define NO_WOLFSSL_ESP32_CRYPT_RSA_PRI 
#define NO_WOLFSSL_ESP32_CRYPT_HASH
#define NO_WOLFSSL_ESP32_CRYPT_AES

I think something must have gotten messed up at my end. I'll investigate further in a couple of days; going to be away.

@gojimmypi
Copy link
Contributor

Hello @PaulMartinsen and thank you again for your continued cooperation on this interesting issue.

Apologies for the delay. I've been busily preparing and practicing for my ESP32 webinar that was presented yesterday. I'm now back to this topic.

For other readers just catching up: we're working on something not officially supported but worth investigating: using a single wolfSSL connection in multiple freeRTOS tasks. Specifically one task for managing connections, the other for exchanging data. Although this is not fully thread safe, I do believe this is possible, given the right user code implementation. It's also helped to uncover some corner cases to cleanup for the SHA accelerator as related to ongoing and later abandoned hash calculations.

Still the same problem with Bad user ticket callback encrypt Error

I'm assuming this is still a session resumption problem as mentioned above. Please advise if you can never connect.

Perhaps I'm still missing something from your setup. I'm currently working from the client-tls13-resume. example and seeing if I can replicate the problem you are still encountering without actually having all of your code. Thanks for supplying your resumption logic, above.

@PaulMartinsen
Copy link
Author

I was travelling and wasn't able to join the webinar unfortunately, but a colleague did and found the seminar very helpful @gojimmypi .

I've been doing a bit of work to quantify the benefit of this non officially supported ability to create the secure connection in one task while another handles sending data. In my test, I'm sending a message (about 1kb) over a plain TCP connection (no TLS), with a single number that increments every 50 ms. While that's going on, I start another connection that requires a ~1 second TLS handshake from the ESP32s3 sever to accept a command that starts the connection then a second ~1 second TLS handshake from a ESP32s3 client to establish a secure connection for sending a different message.

When the connection and sending happens in a single task, I see the expected 1 second interruption to messaging.

When TLS handshaking happen in different tasks, I can sustain the 50ms transmission rate. Preliminary data below. Top plot is the difference in transmission time-stamp on the device (i.e., should be 50 ms when the ESP32s3 isn't too busy to send messages). The bottom plot is difference between counter value received (i.e., 1 when no messages are lost). One datapoint for each message received by the client. The second connection, with TLS handshake, starts at about message 50.

Only preliminary data, but here only a single one of the 50ms period messages was lost, and otherwise the ESP32s3 kept sending a message roughly every 50ms while it negotiated the second secure connection. At 50ms per message here, that negotiation would have taken about 20 messages.

image

This was keeping cross-thread locks inside esp32_sha.c.

I'm going to try to get to the bottom of the bad ticket over the weekend.

I'm assuming this is still a session resumption problem as mentioned above. Please advise if you can never connect.

I can never connect from the .net client to the ESP32s3 server with the test version. The ESP32s3 server is configured to support session resumption, but as far as I know the .net client doesn't use it (TLSv1.3 session resumption certainly isn't used when the ESP32s3 as a client is connecting to the .NET server, but I haven't checked the other direction).

@PaulMartinsen
Copy link
Author

PaulMartinsen commented Aug 12, 2023

Here's a breakdown of this afternoon's investigation.

Test 1. Test latest version from ED25519_SHA2_fix/6cb7eeef.
Same problem as before, with CreateTicket failing when it does the sanity checks in the ESP32s3 server. The log output output shows that it is using the software hash at this point.

Test 2. Check locking InUse concept works
Went back to my original concept version to see nothing else had broken along the way. That was okay; I can connect to the ESP32s3 server, create a TLS client connection on the "connect" thread, send a bunch of event messages and close the connection from the "notify" thread.

Test 3. Try without HAVE_SESSION_TICKET.
I went back to the latest version from ED25519_SHA2_fix/6cb7eeef. This time, removing HAVE_SESSION_TICKET from my user_settings.h for testing.

This got around the bad ticket problem so I can connect to the ESP32s3 server and the ESP32s3 can create client connections to the .NET program.

After checking connections worked (this is important in a second), I captured log messages while subscribing to, then sending event messages. No crashing this time when the WOLFSSL context was freed. I didn't see anything about the hardware at the end of that log file. Digging deeper, it looks like the hardware was never used throughout the whole subscribe → connect → send → free sequence. Each ESP32_SHA_INIT call has the hardware in use by 3fcd1d74. I'm not sure where that context was supposed to release the hardware, but it looks like it didn't.

Test 4. a fresh start
After resetting the esp32s3 for a fresh start. This time I've done a single subscribe → connect → send → free sequence. The log has everything from the start of the boot sequence. This time, esp_sha_release_unfinished_lock was called and it reports unable to unlock ctx mutex. So it detected it wasn't safe to unlock the mutex from a different thread.

About 45 minutes later, with the ESP32s3 just blinking its little light, I made another query-response connection to the ESP32s3 server. In ESP32_SHA_INIT, it reports:
Hardware in use by 3fcd0eb8; Mode REVERT to ESP32_SHA_SW for 3fcd6560. So it didn't get unlocked anywhere. My guess is that's why I never saw the hardware unlock in the previous log (i.e., it was still locked from a previous subscription).

@PaulMartinsen
Copy link
Author

Hi @gojimmypi , I was thinking about the missing unlock and had a go at mocking up a concept that could resolve both the missing unlock thread-safely, resolve the cross-thread crash and support overlapping updates . To keep things easy (for me), i just made a Windows console app to illustrate the idea with mocked hardware & standard C++ in visual studio.

The accelerator driver (similar to what esp32_sha.c does) is in CShaAcceleratorDriver. The rest is just supporting/test. What do you think: https://github.com/blue-leaf-software/accel-share.

Is it worth doing another teams session to discuss the way forward?

@gojimmypi
Copy link
Contributor

I was travelling and wasn't able to join the webinar

A recording of my Getting Started with wolfSSL on the ESP32 webinar is now available.

but a colleague did and found the seminar very helpful

❤️ thank you

I can never connect from the .net client to the ESP32s3 server ... as far as I know the .net client doesn't use it

That's an interesting comment. Are you not using wolfSSL on the .Net side? There's a C# Wrapper complete with Client and Server. I've not personally used these yet, but this might be an excellent opportunity to take them for a test drive,

I found dotnet/runtime#84136 that references a Microsoft Blog. It states:

This feature is automatically provided by SChannel on Windows

So it sounds like if you are using .NET 7, the C# side should support TLS 1.3 sessions resumption. I've not personally tested this.

had a go at mocking up a concept

It would definitely be good if we were working on the exact same code to resolve this. I've been trying to reproduce the problem you are without using your code.

I'll take a look at your accelerator-sharing example, but what I'd really like to see is the real multi-task TLS connection failure.

I'm not sure where that context was supposed to release the hardware, but it looks like it didn't.

In theory, my idea is that when wc_Sha256Free is called, this call to esp_sha_release_unfinished_lock would get cleaned up if it is on the same same task otherwise set ctx->mode = ESP32_SHA_FREED so that another task needing HW can later release it.

it didn't get unlocked anywhere.

Yes, there needs to be some mechanism to indicate an in-progress hash has been abandoned without sha[n]final() being called. If the current lock is freed without the final (i.e. has ctx->mode == ESP32_SHA_FREED) and another call in the same task wants to use HW, it should be released.

One of the first things I thought of when I saw your example code is the magical process of garbage collection in C++. I wonder how long before the finalize / free code is called?

Is it worth doing another teams session to discuss the way forward?

Yes. Let me do a little more homework and we can schedule something.

@PaulMartinsen
Copy link
Author

Just time for a couple of comments before rest of day starts...

My .net client only fails to connect with the proposed changes. Normally it connects fine. I'm using .net 4.8.2 for historical reasons. Some of microsoft documentation on tlsv1.3 resumption didn't match my tests. I eneded up using a wolfssl based server to check resumption worked for my client

My concept avoids the problem of abandoned hashes locking the hardware for good, or needing something else to free the lock in a specific thread, by keeping a hash state in the hash context which tracks where that hashes result is and keeping locks within the accelerator driver. I think you already have something similar to track initialization, which is where I got the idea.

C++ cleanup isn't asynchronous like garbage collection in c# . Destructor get called when stack variables go out of scope. More like the 'using' keyword in c#. I think I only take advantage of that in the concept to make sure the mutex locks are released when the function exits. In otherwords, the cleanup happens as the stack gets unwound.

@PaulMartinsen
Copy link
Author

Hi @gojimmypi , an update on interleaving to fix this issue...

I've had a first go at a C implementation of the interleaving concept for the sha accelerator on, with a VisualGDB demo/test program for the ESP32s3.

Currently only the Sha1 hash is implemented so definitely a work in progress with the following still to do:

  • test on ESP32 etc.
  • add wrappers for other hash sizes (still using previous approach).
  • test with WolfSSL tests.
  • work out if copy operation does the right thing (or anything useful).
  • figure out best way to deal with partial result buffer.
  • review for race conditions/ improve multi-threaded efficiency.

I spent a few hours trying to figure out the initializer and lockDepth inside WC_ESP32HA without success. So, I ended up starting with a clean slate, placing a version 2 into esp32_shaV2.c. Hopefully this is a useful prototype to see how to fit it amongst the locks and initializers, if those are still needed.

I found, in v5.0, Espressif have started a hardware abstraction layer. It lives in components/hal/ and (for example) components/hal/esp32s3/. Espressif note, the hardware abstraction API is an experimental feature and recommend making a copy if you want to rely on the API. But I used it in place. This greatly simplifies working with the sha accelerator registers and features across different targets, with pre-processor macros such as SOC_SHA_SUPPORT_RESUME and SOC_SHA_SUPPORT_SHA384 and functions such as sha_hal_hash_block, sha_hal_read_digest. I'm not sure if it was available before v5.0 though.

I also found a esp_crypto_sha_aes_lock_acquire() function which takes a system wide lock on the hardware accelerator, to manage accelerator contention across components (at least amongst those respecting this lock).

In the prototype,

  • I separated out enabling/disabling the sha accelerator and am not doing this automatically. This proved handy for testing. It could be trivially done automatically inside esp_sha_init_2 and esp_sha_free_2 and probably should be.
  • I added esp_sha_free_2, which gets called inside wc_ShaFinal and handles cleanup. On devices that don't support interleaving this makes the accelerator available for the next user.
  • esp_sha_process_2 and esp_sha_digest_process_2 either do the hash or return SHA_HW_FALLBACK to fall back to the software implementation. I didn't have a need for esp_sha_try_hw_lock or esp_sha_hw_unlock, but I may have missed something. It was straight forward to update wc_Sha with a minor change for this approach.
  • despite the hardware abstraction layer, flipping endians still remains necessary.
  • I don't use the initializer, mode or lockDepth fields in WC_ESP32SHA.

I'll keep working on the above to dos over the next week or so. But I was wondering what issues you see with this approach?

@gojimmypi
Copy link
Contributor

I'm closing this issue for now, but I've added it to my Mothball section in #6234.

See also the updates in #7535, #7505 and #7262 that may have addressed the issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants