-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve MQTT DNS asynchronously #1298
Resolve MQTT DNS asynchronously #1298
Conversation
It seems promising, is anyone able to test? @divadiow ? @MaxineMuster ? |
im afraid im away now until 5th August |
I don't use MQTT usually, so I don't know exactly what to test. I installed Mosquitto and can confirm that all my devices (LN882H, BK7231N and W801) do connect to MQTT via hostname. Log shows this for all devices:
|
Also works on w600:
|
I have now installed this build "1298_merge_0b44b8f5db05" on a BL602 as I had a daily WATCHDOG reboot, https://github.com/openshwprojects/OpenBK7231T_App/actions/runs/10123274891 So far no problems, but too early to tell if it fixed my BL602 problem which I am hoping is related. |
Can we merge this? |
It did not fix my BL_RST_SOFTWARE_WATCHDOG problem, but no other problems were experienced with the patch. |
So it's ready to merge, let's say, tomorrow? |
Should be. Works fine for a week already with both bk7231n and bl602 |
What drivers do you run? BL_RST_SOFTWARE_WATCHDOG on bl602 can be because of some blocking of main thread as it is with this, however it can also be if something else crashes. |
"0 drivers active, total 13" not sure how to list the 13 drivers, not sure it's relevant since 0 are active. |
That means nothing extra is running, makes it even more curious what could be causing those reboots |
Besides that I now discovered that this change is not really solving MQTT name resolution, my MQTT was down for an hour and when re-started my MQTT server the BL602 is just in an endless name resolution for the MQTT name without reconnecting. Info:MQTT:mqtt_userName ??????? and that repeats and on the homepage is says MQTT State: disconnected RES: 0(ERR_OK) Reverted to version 1.17.651 |
Looks like callback is never called in your case. I guess I need to time it |
is there anything you want me to try? |
I know changed the wait for dns to be limited to 10s. Hopefully solves issue @diepeterpan is seeing |
I suspect it is DHCP renewal based on the timing, I will assign a fixed IP and capture it in the BL602 web interface to see if it still gives a BL_RST_SOFTWARE_WATCHDOG error daily. For now, I will not test this MQTT change until I conclude the fixed IP test. BTW. Maybe just me, but it seems like it is not using the "Configure Names" to register in DHCP, it does not matter what I set it to it used Bouffalolab_BL602-390daf. Even if I set flag Flag 29 - [NETIF] Use short device name as a hostname instead of a long name the name Bouffalolab_BL602-390daf is registered. Anybody else that can test my wild accusations? Looking at the source the BL602 SDK is not calling CFG_GetOpenBekenHostName whereas the BK7231N SDK does. https://github.com/search?q=repo%3Aopenshwprojects%2FOpenBL602%20CFG_GetOpenBekenHostName&type=code |
custom host nameis not implemented on bl602 |
When I change to a fixed IP my BL_RST_SOFTWARE_WATCHDOG error disappears on version 1.17.651; I am connecting to an OpenWrt router with a default DHCP timeout of 12 hours and always had BL_RST_SOFTWARE_WATCHDOG around the 12-hour mark. So the DHCP renewal/reclaim timeout seems to be not working/correctly. Is anybody else having a similar experience? (PS. I am still not yet testing the MQTT change, want to run the current stability test a bit longer on version 1.17.651) |
Works fine with the same 12h lease time , could be something about particular configuration/hardware. Try maybe increasing lease time on dhcp server and see if reboots follow? I'm also using openwrt 22.03. |
Also in router log I see that socket is sending DHCPREQUEST message like every 4-6 hours, so lease should never expire |
I am not changing my DHCP server for just one device, I have 50 other devices running fine for more than 2 years. If it is stable with fixed IP then that is what I will have to do for Openbeken on BL602. Thanks for the suggestions. |
You don't need to change DHCP server for that, you can just edit lease time. Or yuo can set static lease for one particular device and set lease time on that |
Thx for the tip, I will change back to DHCP on device and set Openwrt for the specific reservation of MAC/IP to 3 hours and test. |
It renewed after 3 hours the reserved IP based on MAC address without problem, one thing I noticed is that OpenWrt does not accept DNS names with an _ 'underscore' during capturing of a DHCP reservation, e.g. Bouffalolab_BL602-390daf so I had to change the reservation to Bouffalolab-BL602-390daf. Not sure whether that plays a role during renewal without a reservation causing a BL_RST_SOFTWARE_WATCHDOG. Feels like I am hunting for a needle in a haystack. Let's see after 24 hours if still good. |
Got a BL_RST_SOFTWARE_WATCHDOG within 12 hours of changing to a 3 hour reserved DHCP from fixed IP. I am going to back to a fixed IP for now. |
I now have a second device which I test on, 0 drivers active and without MQTT setup on the BL602. Using a 3 hour reserved DHCP entry I have received a BL_RST_SOFTWARE_WATCHDOG error within 6 hours. On the OpenWrt router this is what I could see for the BL602 device when it BL_RST_SOFTWARE_WATCHDOGged on me. Fri Aug 9 14:46:04 2024 daemon.notice hostapd: wl0-ap0: AP-STA-DISCONNECTED 24:94:94:39:12:4d If this is isolated to my environment then I will run with a fixed IP on the BL602 and just make a note meself for these two for now to run on fixed IP. |
Ok, getting back to this on the second device, configured it with a fixed IP and connected to test MQTT server which I can switch off and on and installed Openbeken Built on Aug 6 2024 17:28:02 version 1298_merge_cf69b4d9b27d. Will feedback in 24 hours. |
Switched off my MQTT server for more than an hour, got a BL_RST_SOFTWARE_WATCHDOG on Openbeken Built on Aug 6 2024 17:28:02 version 1298_merge_cf69b4d9b27d. I am going to run the same test on version 1.17.653 and see whether there's a difference e.g. is this change actually improving anything, at least for me? |
I don't see any difference in using this merge build and current version 1.17.653; tested the same scenarios with MQTT and same problem of BL_RST_SOFTWARE_WATCHDOG when MQTT server is down for extended period e.g. 40 - 70 min. |
This was never intended to fix daily reboots and it was always a stretch that it's the same issue, It's very likely not.
with cf69b4d build? |
Then I don't experience the issue this patch is trying to solve, sorry for wasting your time. |
Yes, your issue seems different. If you can provide serial log (not the log from the browser) of the restart in a new bug, then maybe we could try to find the cause. But without that it's not much we can do. |
Understood, maybe some day I will have the energy to disassemble the bulb again and solder wires for the serial log. For now a fixed IP seems to improve stability and it's rare that my network or MQTT is down. Thanks for the offer. PS. Tested more on my fixed IP theory and that's also down the drain after 12 hours. :-( |
@giedriuslt Sorry, I am out of the loop. Is the "callback not getting called" issue fixed now? |
Yes, that should be fixed. |
Thank you, so let's give it a chance. |
🎉 This PR is included in version 1.17.655 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
655 release did not help. |
I am also running version 1.17.655 on a seperate BL602 with serial logging being captured hoping to capture a BL_RST_SOFTWARE_WATCHDOG, crossing fingers to see whether we can get to the root of it. Will keep you posted if and when I catch da’DOG. |
Got a WATCHDOG error and this is the serial log, 100 lines before and after, does not look like it contains anything relevant. Anything I can do to get more information that might help?
|
Well this indicates that something is taking long time (>4s) in the main loop and watchdog reboots it. This is very similar as what I had to find issue with mqtt reboots with lost wifi, but in that case reboot happened after some mqtt connection info was printed, so it was rather easy to trace. Now there is nothing to go by... I can only think of adding more logging to the main loop, and hopefully that points to the cause. |
Thank you, I will gladly run a test build and see if we can catch it again with more logging. Should we open a new bug to try and trace the BL_RST_SOFTWARE_WATCHDOG? |
Created a build which traces most of main loop functionality to serial console #1322 |
This makes use of asynchronous dns resolution in lwip, fixes issues with restarts if internet breaks. tested on bk7231n and bl602.
Fixes #1284.
XR809 uses very old lwip so there is no way to ask for ipv4 ip, but it probably does not support ipv6 anyway, so should be no issue