Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make OTA more robust for battery powered devices #1552

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

FlorianBruckner
Copy link

Two issues are addressed:

(1) On congested networks, the upgrade process may fail because the radio fails delivery of a message (page or block request, of which there are potentially thousands for a new image). Instead of stopping the upgrade process right away, we let it run to allow the devices to retry fetching the page or block.

(2) There are battery powered devices that need to be interacted with in a short time window to kick off the upgrade process. Not all battery powered devices do this (famous RWL021 at least in my network), but instead query for new images themselves. By not failing the initial notification AND by extending the timeout for the OTA, we give these devices an opportunity to query for this upgrade when they check in. At least for RWL021, this approach has worked for me.

When used in HA, this appears to improve how OTA can be done for battery powered devices. E.g. this one is a RWL021 that refused to upgrade. With the change, it started fetching a new image when it checked in (hours after I started the upgrade in HA).

image

While the upgrade did not finish due to an issue on the radio, I see that it at least will retry fetching the block for three times (this is with modified logging so I can observe what is happening for this device):

2025-02-12 09:44:10.967 INFO (MainThread) [zigpy.device] [0x2722] OTA upgrade progress: (78296 / 240760): 32.5204%
2025-02-12 09:44:11.585 INFO (MainThread) [zigpy.device] [0x2722] OTA upgrade progress: (78336 / 240760): 32.5370%
2025-02-12 09:44:26.158 INFO (MainThread) [zigpy.device] [0x2722] OTA image_block handler exception
zigpy.exceptions.DeliveryError: Failed to deliver message: <sl_Status.ZIGBEE_DELIVERY_FAILED: 3074>
2025-02-12 09:44:31.303 INFO (MainThread) [zigpy.device] [0x2722] OTA image_block handler exception
zigpy.exceptions.DeliveryError: Failed to deliver message: <sl_Status.ZIGBEE_DELIVERY_FAILED: 3074>
2025-02-12 09:44:41.091 INFO (MainThread) [zigpy.device] [0x2722] OTA image_block handler exception
zigpy.exceptions.DeliveryError: Failed to deliver message: <sl_Status.ZIGBEE_DELIVERY_FAILED: 3074>
2025-02-12 09:44:51.092 INFO (MainThread) [zigpy.device] [0x2722] OTA image_block handler exception

It then stopped requesting new blocks. It remains to be seen whether it will resume downloading the image from where it left it or if it will restart the transfer. Either way, the chances of an successful upgrade are greatly increased, I have succesfully upgraded two IKEA tradfri switches using these modifications that refused to complete the uprade process previously.

Two issues are addressed:

(1) On congested networks, the upgrade process may fail because the radio fails delivery of a message (page or block request, of which there are potentially thousands for a new image). Instead of stopping the upgrade process right away, we let it run to allow the devices to retry fetching the page or block. 
(2) There are battery powered devices that need to be interacted with in a short time window to kick off the upgrade process. Not all battery powered devices do this (famous RWL021 at least in my network), but instead query for new images themselves. By not failing the initial notification AND by extending the timeout for the OTA, we give these devices an opportunity to query for this upgrade when they check in. At least for RWL021, this approach has worked for me.
@FlorianBruckner
Copy link
Author

This may address also the issues seen in #1401 - this also looks like OTA was cancelled upon receiving the first failed block.

@FlorianBruckner
Copy link
Author

I can now report that the Philips device has resumed the firmware update and managed to complete it - so that worked as I hoped it would:

2025-02-12 16:50:44.682 INFO (MainThread) [zigpy.device] [0x2722] OTA upgrade progress: (40 / 240760): 0.0166%
2025-02-12 16:50:45.206 INFO (MainThread) [zigpy.device] [0x2722] OTA upgrade progress: (80 / 240760): 0.0332%
2025-02-12 16:50:45.746 INFO (MainThread) [zigpy.device] [0x2722] OTA upgrade progress: (96 / 240760): 0.0399%
2025-02-12 16:50:46.275 INFO (MainThread) [zigpy.device] [0x2722] OTA upgrade progress: (64160 / 240760): 26.6489%
2025-02-12 16:50:46.798 INFO (MainThread) [zigpy.device] [0x2722] OTA upgrade progress: (64200 / 240760): 26.6656%

@puddly
Copy link
Collaborator

puddly commented Feb 12, 2025

I'm hoping that zigpy/bellows#668 will help with the communication difficulties by having zigpy internally retry and thus communicate more reliably with end devices.

For the most part, IKEA devices seem to sleep for about an hour at a time so I think we could reduce MAX_TIME_WITHOUT_PROGRESS to 1 hour or even possibly have a separate timeout for sleeping end devices while retaining 30s for routing devices.

@MattWestb
Copy link
Contributor

@puddly I think you have misunderstanding then IKEA end device is handshaking the time out then connecting to its parents (end device timeout) if its one good Zigbee 3 router and then use it for pulling its parent for new frames so its depends of the paring / last jump.

But if you is thinking of the end device is doing checking to the coordinator you is very true then normally its around 50 minutes and the coordinator shall acking with fast pull with timeout XX and under that time its shall being "online" for getting commands until timeout or the coordinator is sending commands for ending the fast pull.
One example from Parasoll made to day for one other issue:
Parasoll01
But i think you is having very right then bellows is very often getting time out then trying communicating with sleepers that have checking in and trying sending commands to cluster 0x0020 (OTA) so somthing spooky is there as you have feeling of.

@puddly
Copy link
Collaborator

puddly commented Feb 12, 2025

@MattWestb It's a little hard to say right now since I don't have a sniffer handy but I was working on trying to improve the reliability of communication with aggressively sleeping end devices (like IKEA) and I'm unable to contact them even after maybe 10 consecutive MAC_INDIRECT_TIMEOUTs (so 60-70s of retries). Other devices poll regularly, once every 7-10 seconds so they're easy to reach, but IKEA sleeps for a very very long time.

I'm actually not sure if this is a misconfiguration on our part when it comes to setting the long poll interval upon joining but they're really the only ones that are troublesome for initiating OTA updates.

@MattWestb
Copy link
Contributor

MattWestb commented Feb 12, 2025

@puddly Do you need one sniff of pairing one gen3 controller to one Zigbee 3 router ?
I can do it and mailing you the sniff if you like and can looking on the setting it requesting from its parent (end device time out is the interesting think).

@FlorianBruckner
Copy link
Author

I'm hoping that zigpy/bellows#668 will help with the communication difficulties by having zigpy internally retry and thus communicate more reliably with end devices.

Retrying more reliably will certainly help. But I guess it won't address all OTA issues with battery powered devices.

In the case of the RLW021 switch, an attempt to update the device is not successful, even if the device is interacted with in the time window for starting the OTA upgrade. A reset may help (I haven't tried).

With the one day timeout and a pending upgrade, the image is made available for the device, and once the device is checking in (and querying for firmware updates) it will proceed. In the observed case it was able to start pulling the image. In this case (because of congestion, interference, whatever other reason) the update stalled. After some hours, the device checked in again and resumed pulling the firmware packets. Eventually, the upgrade of that device succeeded.

For the most part, IKEA devices seem to sleep for about an hour at a time so I think we could reduce MAX_TIME_WITHOUT_PROGRESS to 1 hour or even possibly have a separate timeout for sleeping end devices while retaining 30s for routing devices.

The specific IKEA device I am looking at (Tradfri on/off) has been sleeping for 24 hours. The Philips device has been sleeping for I guess 8 hours. So increasing the timeout to 1 hour wouldn't allow the upgrade to finish if the objective is to allow the device to resume the firmware download at a later point in time.

A more generic approach will probably require two strategies for OTA: One for mains powered devices, where the coordinator initates the upgrade and one for battery powered devices, where the coordinator makes a firmware available for the device to request when it is ready. In both cases, finishing (failing) the upgrade process when there is a radio problem and ignoring attempts by the device to retry fetching a block are, I believe, too aggressive. Devices (at least the two battery devices I am looking at in my network) will re-request after about 10 seconds when they don't receive the block they requested.

I don't claim that this PR will fix all issues. But I believe it will improve reliability of the upgrade process for battery powered devices with little adjustments.

@MattWestb
Copy link
Contributor

The specific IKEA device I am looking at (Tradfri on/off) has been sleeping for 24 hours.

If IKEA controllers running on current firmware (all also gen 1 (ZLL) have getting Zigbee 3 update) is not doing checkins then its not OK configured and need being reconfigured and normally they is very fast going in sleep then being paired so can being tricky getting OK but is working OK then doing checkins (but can having problem getting OTA working OK that you is working on).

@FlorianBruckner
Copy link
Author

If IKEA controllers running on current firmware (all also gen 1 (ZLL) have getting Zigbee 3 update) is not doing checkins

The current firmware on it is 0x23079631 and it is right now downloading 0x24040006. I don't have debug logging active on this instance, but it looks like this switch is checking for new firmware every 24 hours. Pushing a button on the switch did not start the upgrade process.

@puddly
Copy link
Collaborator

puddly commented Feb 12, 2025

Do you need one sniff of pairing one gen3 controller to one Zigbee 3 router ?

@MattWestb Sure! I will take a look. I know we had some issues with fast/slow polling for IKEA devices causing battery drain issues so maybe we're missing something with the newer ones that would help with this issue.

After some hours, the device checked in again and resumed pulling the firmware packets. Eventually, the upgrade of that device succeeded.

Interesting. This isn't something I've run into myself so if you are seeing devices do this, we should re-think the timeout strategy. My concern with increasing it to 24 hours is that the OTA progress dialog within Home Assistant will just be stalled for the entire duration with no way to cancel it or have any feedback as to why it isn't progressing. We could adjust the messaging to reflect any new behavior however.

From what I recall, many devices check in about once a day so it'll be difficult to reliably initiate an OTA update via user interaction and have it work for every end device. Many do actually poll their parent router frequently enough to receive the notification (about once every 8 seconds). Others need to be "woken up" to poll, which yours should be doing. The fact that they aren't seems like a more fundamental bug either with the way we send requests to end devices or the device firmware itself.

@FlorianBruckner
Copy link
Author

My concern with increasing it to 24 hours is that the OTA progress dialog within Home Assistant will just be stalled for the entire duration with no way to cancel it or have any feedback as to why it isn't progressing.

I fully agree that this is not good UX. Bot otoh, having a device offering an update that I can never get is not good UX either.

We could adjust the messaging to reflect any new behavior however.

Definitely - the major concern I would have is that there is no option to cancel the update (other than restarting HA) once it is started or times out after 24 hours. But those would be changes that are way out of my comfort zone.

From what I recall, many devices check in about once a day so it'll be difficult to reliably initiate an OTA update via user interaction and have it work for every end device.

The approach would be different - instead of instructing the device to do an upgrade, from what I can see in my logs, devices will query the coordinator if there is a firmware upgrade available in intervals. Quite long intervals for battery powered devices, shorter intervals for mains powered devices.

I am pretty sure the initial "notify" is not what is triggering the firmware request after some hours.

Others need to be "woken up" to poll, which yours should be doing. The fact that they aren't seems like a more fundamental bug either with the way we send requests to end devices or the device firmware itself.

By no means I claim to have any knowledge about how Zigbee works - I haven't read the specs, I just happen to have a network with about 100 various devices where I am seeing this kind of issues. But as far as I understand, battery powered devices will turn off their radio unless they need to send something (like: a button press) and from time to time (in an interval of hours) will check in. Firmware upgrades, key changes, etc.

You're saying that the specs say you can "wake up" a battery device over radio? And thus the expectation is that the initial notify is reaching the device? I can say that the Tradfri On/Offs, the Tradfri Switches and the Philips RWL021 do not do this in my network. The RWL021 will not receive the notify even when woken up by a button press. The only way I could get these devices to upgrade was to offer the upgrade file with the changes I did and let the device request it when it was up to it.

@MattWestb
Copy link
Contributor

Sommrig shortcutbutton is requesting end device timeout of 8 minutes so the parent shall holding commas for it and not flagging it off like if the network is asking for its parent.
We cant changing the time its inn the firmware what i knowing.
Pull configure is possible configuring from the system but it shall working OK if the coordinator is responding OK and can sending data to it as it was online.

@MattWestb
Copy link
Contributor

Sniff set to Puddlys gmail !!

End device puling its parent router for commands:
If it one Zigbee 3 device and using pull control it can taking very long time between the sleeper is pulling its parent for commands and if it no it doing it as old Zigbee HA standard (standard shall being under the half of the standard pulling interval do it can missing one pull before the parent is throng the command to the device and removing it in its children table and broadcasting it dont have the device as child in the network (broadcast must working and not blocked of broadcast storms).
Sommrig is pulling its parent every 4 minutes minus 10 seconds so the parent can missing one pull and still being OK then the end device timeout is 8 minutes (from sniffing after sending it to Puddly).

Checkingins:
Only good Zigbee 3 device is supporting it.
The system is setting one time (IKEA looks using 50 minutes) then the device is sending one checking to the coordinator and is pulling its parent for getting commands with normal timeout.
The coordinator is sending one fast pull with time out to the device.
Now the coordinator can sending commands as it was one no sleeper as long the time is not running out or its sending one end fast pull to the device so it going back to sleep.
If implemented OK its great time doing OTA things but normally the device is doing it self and then it awake but can being in normal pulling state (have not sniffing it its doing the OTA request in checking or normal long pull mode).

I was having one very good Silabs paper on it but cant finding it.

@MattWestb
Copy link
Contributor

Found one good paper of pull control mechanism:
https://community.silabs.com/s/article/the-poll-control-cluster-a-reliable-way-for-sed-to-receive-asynchronous-transmi?language=ko

If not using end device timeout is the classic Zigbee HA pull timing:
ZigBee Pro PICS (the conformance specification) dictates a prescribed value of 7.680 seconds for this timeout and therefore recommends polling at least once per 7.5 seconds to ensure reliable communication, since polling outside of this timeout can result in missed messages or missed APS ACKs.
And if using it its up to the end device and its parent setting it like IKEA is doing 8 minutes = pulling 4 min minus 10 seconds.

@Hedda
Copy link
Contributor

Hedda commented Feb 13, 2025

(2) There are battery powered devices that need to be interacted with in a short time window to kick off the upgrade process. Not all battery powered devices do this (famous RWL021 at least in my network), but instead query for new images themselves. By not failing the initial notification AND by extending the timeout for the OTA, we give these devices an opportunity to query for this upgrade when they check in. At least for RWL021, this approach has worked for me.

When used in HA, this appears to improve how OTA can be done for battery powered devices. E.g. this one is a RWL021 that refused to upgrade. With the change, it started fetching a new image when it checked in (hours after I started the upgrade in HA).

FYI and for reference, a related issue with a feature request for check and block was also raised here in the zha library repository that also is very relavant to making OTA upgrades more robust for battery powered Zigbee devices:

That in turn was raised from discussion here about adding more manual checks that without this are now recommended to users:

@FlorianBruckner
Copy link
Author

Quite a discussion has started around this topic, this should be an attempt to refocus on some of the issues I am seeing in my network around OTA.

As I have already mentioned, I am not using a dedicated small network for this but my home network with about a 100 devices. These consist of quite a mix of multiple generations of devices: ancient Osram Plugs, old and recent bulbs and plugs from Innr, some Philips devices, quite a number of Müller Licht tint bulbs, IKEA switches and bulbs, some Ubisys devices and probably a few more that I do not remember off the top of my head.

My user experience with OTA in my network is, with the current state of affairs, a bit frustrating. I recently added some new bulbs and wanted to bind them to old IKEA on/off switches. I do know that for device binding to work properly, the IKEA switches need a firmware upgrade. Before the OTA revamp I had configured HA to autoupgrade IKEA devices because of this and this appeared to work. Because there was no user interaction or logs, after some time these devices got the new firmware and all was good. These days, I need to start the upgrade process manually and this is where this all started.

It has been reported in various forums that OTA upgrades need patience. After having run into failed upgrades I started to research and I remember one forum message where a user reported that eventually, after a week of continously trying to upgrade an IKEA switch, they succeeded. And, I could reproduce this. For about a week, I started an OTA for that switch before I left home, found that it stopped along the way, started again in the evening to find the next morning that it again just didn't complete, did a couple of factory resets and battery swaps until it suddenly completed the upgrade one evening.

My journey started when I looked into the logs of the upgrade process and found lines like these:

2025-02-14 08:48:19.026 INFO (MainThread) [zigpy.device] [0xdf50] OTA upgrade progress: (41440 / 205488): 20.1666%
2025-02-14 08:48:20.983 INFO (MainThread) [zigpy.device] [0xdf50] OTA upgrade progress: (41440 / 205488): 20.1666%
2025-02-14 08:48:22.166 INFO (MainThread) [zigpy.device] [0xdf50] OTA upgrade progress: (41480 / 205488): 20.1861%

So the devices do in fact re-request blocks at times. Whether it is supposed to do this according to the specs I do not know (and honestly, not care, because this IKEA device just does it).

I also found that when the upgrade stops, it does so with a generic FAILURE: 1 in the logs. And this is where I started patching, eventually seeing that when delivering the upgrade block runs into an issue it will just "finish" the upgrade process, failing to respond to block requests even if the devices would send them. Like so (already with additional logging that I have in my version now):

2025-02-14 08:41:28.647 INFO (MainThread) [zigpy.device] [0xdf50] OTA upgrade progress: (33280 / 205488): 16.1956%
2025-02-14 08:41:33.469 INFO (MainThread) [zigpy.device] [0xdf50] OTA image_block handler exception
zigpy.exceptions.DeliveryError: Failed to deliver message: <sl_Status.ZIGBEE_DELIVERY_FAILED: 3074>
2025-02-14 08:41:33.791 INFO (MainThread) [zigpy.device] [0xdf50] OTA upgrade progress: (33360 / 205488): 16.2345%

What we see here is a successful request for blocks with offset 33280, then some exception and then a successful transmission from block 33360. This means, while the local stack reported an error, the device actually received the blocks.

There are other cases where the device will re-request after some seconds (in this case 5, for RWL021 I did see 10):

2025-02-14 08:30:50.777 INFO (MainThread) [zigpy.device] [0xdf50] OTA upgrade progress: (18760 / 205488): 9.1295%
2025-02-14 08:30:55.602 INFO (MainThread) [zigpy.device] [0xdf50] OTA image_block handler exception
zigpy.exceptions.DeliveryError: Failed to deliver message: <sl_Status.ZIGBEE_DELIVERY_FAILED: 3074>
2025-02-14 08:30:56.074 INFO (MainThread) [zigpy.device] [0xdf50] OTA upgrade progress: (18760 / 205488): 9.1295%

This lead me to the conclusion that robustness of OTA upgrades can be improved when the upgrade is not stopped when the first block delivery fails. As shown, at least IKEA as well as Philips devices will re-request after some time if they didn't get the block. If those devices do this, why ignore it by stopping the infrastructure to deliver image blocks?

I therefore suggest to split this PR in two parts - lets discuss the timeout and the potential impact on the UX in this PR.

If you agree, I will create another PR, only removing the "finish" calls on the OTA handler when it encounters failed block requests, so devices have a chance to re-request a failed block if they're up to it. I blieve this is a low risk change that could be released in a short time frame, focussing just on the reliabiliy of an OTA upgrade once it has started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants