-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make OTA more robust for battery powered devices #1552
base: dev
Are you sure you want to change the base?
Conversation
Two issues are addressed: (1) On congested networks, the upgrade process may fail because the radio fails delivery of a message (page or block request, of which there are potentially thousands for a new image). Instead of stopping the upgrade process right away, we let it run to allow the devices to retry fetching the page or block. (2) There are battery powered devices that need to be interacted with in a short time window to kick off the upgrade process. Not all battery powered devices do this (famous RWL021 at least in my network), but instead query for new images themselves. By not failing the initial notification AND by extending the timeout for the OTA, we give these devices an opportunity to query for this upgrade when they check in. At least for RWL021, this approach has worked for me.
This may address also the issues seen in #1401 - this also looks like OTA was cancelled upon receiving the first failed block. |
I can now report that the Philips device has resumed the firmware update and managed to complete it - so that worked as I hoped it would: 2025-02-12 16:50:44.682 INFO (MainThread) [zigpy.device] [0x2722] OTA upgrade progress: (40 / 240760): 0.0166% |
I'm hoping that zigpy/bellows#668 will help with the communication difficulties by having zigpy internally retry and thus communicate more reliably with end devices. For the most part, IKEA devices seem to sleep for about an hour at a time so I think we could reduce |
@puddly I think you have misunderstanding then IKEA end device is handshaking the time out then connecting to its parents (end device timeout) if its one good Zigbee 3 router and then use it for pulling its parent for new frames so its depends of the paring / last jump. But if you is thinking of the end device is doing checking to the coordinator you is very true then normally its around 50 minutes and the coordinator shall acking with fast pull with timeout XX and under that time its shall being "online" for getting commands until timeout or the coordinator is sending commands for ending the fast pull. |
@MattWestb It's a little hard to say right now since I don't have a sniffer handy but I was working on trying to improve the reliability of communication with aggressively sleeping end devices (like IKEA) and I'm unable to contact them even after maybe 10 consecutive I'm actually not sure if this is a misconfiguration on our part when it comes to setting the long poll interval upon joining but they're really the only ones that are troublesome for initiating OTA updates. |
@puddly Do you need one sniff of pairing one gen3 controller to one Zigbee 3 router ? |
Retrying more reliably will certainly help. But I guess it won't address all OTA issues with battery powered devices. In the case of the RLW021 switch, an attempt to update the device is not successful, even if the device is interacted with in the time window for starting the OTA upgrade. A reset may help (I haven't tried). With the one day timeout and a pending upgrade, the image is made available for the device, and once the device is checking in (and querying for firmware updates) it will proceed. In the observed case it was able to start pulling the image. In this case (because of congestion, interference, whatever other reason) the update stalled. After some hours, the device checked in again and resumed pulling the firmware packets. Eventually, the upgrade of that device succeeded.
The specific IKEA device I am looking at (Tradfri on/off) has been sleeping for 24 hours. The Philips device has been sleeping for I guess 8 hours. So increasing the timeout to 1 hour wouldn't allow the upgrade to finish if the objective is to allow the device to resume the firmware download at a later point in time. A more generic approach will probably require two strategies for OTA: One for mains powered devices, where the coordinator initates the upgrade and one for battery powered devices, where the coordinator makes a firmware available for the device to request when it is ready. In both cases, finishing (failing) the upgrade process when there is a radio problem and ignoring attempts by the device to retry fetching a block are, I believe, too aggressive. Devices (at least the two battery devices I am looking at in my network) will re-request after about 10 seconds when they don't receive the block they requested. I don't claim that this PR will fix all issues. But I believe it will improve reliability of the upgrade process for battery powered devices with little adjustments. |
If IKEA controllers running on current firmware (all also gen 1 (ZLL) have getting Zigbee 3 update) is not doing checkins then its not OK configured and need being reconfigured and normally they is very fast going in sleep then being paired so can being tricky getting OK but is working OK then doing checkins (but can having problem getting OTA working OK that you is working on). |
The current firmware on it is 0x23079631 and it is right now downloading 0x24040006. I don't have debug logging active on this instance, but it looks like this switch is checking for new firmware every 24 hours. Pushing a button on the switch did not start the upgrade process. |
@MattWestb Sure! I will take a look. I know we had some issues with fast/slow polling for IKEA devices causing battery drain issues so maybe we're missing something with the newer ones that would help with this issue.
Interesting. This isn't something I've run into myself so if you are seeing devices do this, we should re-think the timeout strategy. My concern with increasing it to 24 hours is that the OTA progress dialog within Home Assistant will just be stalled for the entire duration with no way to cancel it or have any feedback as to why it isn't progressing. We could adjust the messaging to reflect any new behavior however. From what I recall, many devices check in about once a day so it'll be difficult to reliably initiate an OTA update via user interaction and have it work for every end device. Many do actually poll their parent router frequently enough to receive the notification (about once every 8 seconds). Others need to be "woken up" to poll, which yours should be doing. The fact that they aren't seems like a more fundamental bug either with the way we send requests to end devices or the device firmware itself. |
I fully agree that this is not good UX. Bot otoh, having a device offering an update that I can never get is not good UX either.
Definitely - the major concern I would have is that there is no option to cancel the update (other than restarting HA) once it is started or times out after 24 hours. But those would be changes that are way out of my comfort zone.
The approach would be different - instead of instructing the device to do an upgrade, from what I can see in my logs, devices will query the coordinator if there is a firmware upgrade available in intervals. Quite long intervals for battery powered devices, shorter intervals for mains powered devices. I am pretty sure the initial "notify" is not what is triggering the firmware request after some hours.
By no means I claim to have any knowledge about how Zigbee works - I haven't read the specs, I just happen to have a network with about 100 various devices where I am seeing this kind of issues. But as far as I understand, battery powered devices will turn off their radio unless they need to send something (like: a button press) and from time to time (in an interval of hours) will check in. Firmware upgrades, key changes, etc. You're saying that the specs say you can "wake up" a battery device over radio? And thus the expectation is that the initial notify is reaching the device? I can say that the Tradfri On/Offs, the Tradfri Switches and the Philips RWL021 do not do this in my network. The RWL021 will not receive the notify even when woken up by a button press. The only way I could get these devices to upgrade was to offer the upgrade file with the changes I did and let the device request it when it was up to it. |
Sommrig shortcutbutton is requesting end device timeout of 8 minutes so the parent shall holding commas for it and not flagging it off like if the network is asking for its parent. |
Sniff set to Puddlys gmail !! End device puling its parent router for commands: Checkingins: I was having one very good Silabs paper on it but cant finding it. |
Found one good paper of pull control mechanism: If not using end device timeout is the classic Zigbee HA pull timing: |
FYI and for reference, a related issue with a feature request for check and block was also raised here in the zha library repository that also is very relavant to making OTA upgrades more robust for battery powered Zigbee devices: That in turn was raised from discussion here about adding more manual checks that without this are now recommended to users: |
Quite a discussion has started around this topic, this should be an attempt to refocus on some of the issues I am seeing in my network around OTA. As I have already mentioned, I am not using a dedicated small network for this but my home network with about a 100 devices. These consist of quite a mix of multiple generations of devices: ancient Osram Plugs, old and recent bulbs and plugs from Innr, some Philips devices, quite a number of Müller Licht tint bulbs, IKEA switches and bulbs, some Ubisys devices and probably a few more that I do not remember off the top of my head. My user experience with OTA in my network is, with the current state of affairs, a bit frustrating. I recently added some new bulbs and wanted to bind them to old IKEA on/off switches. I do know that for device binding to work properly, the IKEA switches need a firmware upgrade. Before the OTA revamp I had configured HA to autoupgrade IKEA devices because of this and this appeared to work. Because there was no user interaction or logs, after some time these devices got the new firmware and all was good. These days, I need to start the upgrade process manually and this is where this all started. It has been reported in various forums that OTA upgrades need patience. After having run into failed upgrades I started to research and I remember one forum message where a user reported that eventually, after a week of continously trying to upgrade an IKEA switch, they succeeded. And, I could reproduce this. For about a week, I started an OTA for that switch before I left home, found that it stopped along the way, started again in the evening to find the next morning that it again just didn't complete, did a couple of factory resets and battery swaps until it suddenly completed the upgrade one evening. My journey started when I looked into the logs of the upgrade process and found lines like these: 2025-02-14 08:48:19.026 INFO (MainThread) [zigpy.device] [0xdf50] OTA upgrade progress: (41440 / 205488): 20.1666% So the devices do in fact re-request blocks at times. Whether it is supposed to do this according to the specs I do not know (and honestly, not care, because this IKEA device just does it). I also found that when the upgrade stops, it does so with a generic FAILURE: 1 in the logs. And this is where I started patching, eventually seeing that when delivering the upgrade block runs into an issue it will just "finish" the upgrade process, failing to respond to block requests even if the devices would send them. Like so (already with additional logging that I have in my version now): 2025-02-14 08:41:28.647 INFO (MainThread) [zigpy.device] [0xdf50] OTA upgrade progress: (33280 / 205488): 16.1956% What we see here is a successful request for blocks with offset 33280, then some exception and then a successful transmission from block 33360. This means, while the local stack reported an error, the device actually received the blocks. There are other cases where the device will re-request after some seconds (in this case 5, for RWL021 I did see 10): 2025-02-14 08:30:50.777 INFO (MainThread) [zigpy.device] [0xdf50] OTA upgrade progress: (18760 / 205488): 9.1295% This lead me to the conclusion that robustness of OTA upgrades can be improved when the upgrade is not stopped when the first block delivery fails. As shown, at least IKEA as well as Philips devices will re-request after some time if they didn't get the block. If those devices do this, why ignore it by stopping the infrastructure to deliver image blocks? I therefore suggest to split this PR in two parts - lets discuss the timeout and the potential impact on the UX in this PR. If you agree, I will create another PR, only removing the "finish" calls on the OTA handler when it encounters failed block requests, so devices have a chance to re-request a failed block if they're up to it. I blieve this is a low risk change that could be released in a short time frame, focussing just on the reliabiliy of an OTA upgrade once it has started. |
Two issues are addressed:
(1) On congested networks, the upgrade process may fail because the radio fails delivery of a message (page or block request, of which there are potentially thousands for a new image). Instead of stopping the upgrade process right away, we let it run to allow the devices to retry fetching the page or block.
(2) There are battery powered devices that need to be interacted with in a short time window to kick off the upgrade process. Not all battery powered devices do this (famous RWL021 at least in my network), but instead query for new images themselves. By not failing the initial notification AND by extending the timeout for the OTA, we give these devices an opportunity to query for this upgrade when they check in. At least for RWL021, this approach has worked for me.
When used in HA, this appears to improve how OTA can be done for battery powered devices. E.g. this one is a RWL021 that refused to upgrade. With the change, it started fetching a new image when it checked in (hours after I started the upgrade in HA).
While the upgrade did not finish due to an issue on the radio, I see that it at least will retry fetching the block for three times (this is with modified logging so I can observe what is happening for this device):
2025-02-12 09:44:10.967 INFO (MainThread) [zigpy.device] [0x2722] OTA upgrade progress: (78296 / 240760): 32.5204%
2025-02-12 09:44:11.585 INFO (MainThread) [zigpy.device] [0x2722] OTA upgrade progress: (78336 / 240760): 32.5370%
2025-02-12 09:44:26.158 INFO (MainThread) [zigpy.device] [0x2722] OTA image_block handler exception
zigpy.exceptions.DeliveryError: Failed to deliver message: <sl_Status.ZIGBEE_DELIVERY_FAILED: 3074>
2025-02-12 09:44:31.303 INFO (MainThread) [zigpy.device] [0x2722] OTA image_block handler exception
zigpy.exceptions.DeliveryError: Failed to deliver message: <sl_Status.ZIGBEE_DELIVERY_FAILED: 3074>
2025-02-12 09:44:41.091 INFO (MainThread) [zigpy.device] [0x2722] OTA image_block handler exception
zigpy.exceptions.DeliveryError: Failed to deliver message: <sl_Status.ZIGBEE_DELIVERY_FAILED: 3074>
2025-02-12 09:44:51.092 INFO (MainThread) [zigpy.device] [0x2722] OTA image_block handler exception
It then stopped requesting new blocks. It remains to be seen whether it will resume downloading the image from where it left it or if it will restart the transfer. Either way, the chances of an successful upgrade are greatly increased, I have succesfully upgraded two IKEA tradfri switches using these modifications that refused to complete the uprade process previously.