Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sonoff RF bridge OB38S003 Portisch FW freezes after 24-48 hours #19

Open
zd3sf opened this issue Nov 30, 2024 · 19 comments
Open

Sonoff RF bridge OB38S003 Portisch FW freezes after 24-48 hours #19

zd3sf opened this issue Nov 30, 2024 · 19 comments
Labels
bug Something isn't working

Comments

@zd3sf
Copy link
Contributor

zd3sf commented Nov 30, 2024

For receive only. Transmit keeps working fine!

Sonoff RF bridge V2.2 with [v0.4.5] portisch firmware and ESPHome 2024.11.2

After 24-48 hours of continuous operation, the MCU stops receiving 0xA4 codes. Restarting the ESP doesn't resolve the problem.

Workaround: reset MCU using rfraw AA FE 55, sending a command momentarily (e.g bucket sniffing, advanced transmit etc), or power unplug. Either of these work! MCU can be reset on a schedule using home assistant or ESPHome.

Logs: none yet, just changed ESPHome log level to Debug and will record results in an update.

Update: no logs from the MCU shows up on ESPhome.

@otsoni
Copy link

otsoni commented Dec 3, 2024

I got some problems just after receiving one long code (66 bytes, I think). It seems to keep receiving, but beeping causes issues. This might be something else than your problem but I will dig into my problem more and hopefully we can find resolution to both problems. I already added stack trap for the last two bytes of internal RAM and it seems like it is not stack overflow, but not really sure about that. I have plans to add traps after arrays in external RAM to see if those are overflowing

@otsoni
Copy link

otsoni commented Dec 3, 2024

I think that my problem with beeping after receiving large bucket is some other bug as it seems that nothing is overflowing. I need to add more traps and keep board running until it stops receiving buckets and then check the traps. It will take some time

@otsoni
Copy link

otsoni commented Dec 4, 2024

I found the beep bug. https://github.com/mightymos/RF-Bridge-OB38S003/blob/main/src/main_portisch.c#L880 should be (RF_DATA[0] << 8) | RF_DATA[1] instead of (RF_DATA[2] << 8) | RF_DATA[1]

I think that I found the B1 overflow too, but not sure about that as it is working with the memory traps, but memory trap after buckets (https://github.com/mightymos/RF-Bridge-OB38S003/blob/main/src/portisch.c#L53) is changed. I think that the check here https://github.com/mightymos/RF-Bridge-OB38S003/blob/main/src/portisch.c#L729-L733 should be done before writing the duration on row 724. For my understading it is now checking the overflow condition after the overflow has already happened, but correct me if I have been thinking something silly.

It seems that there is same kind of bug with RF_DATA (https://github.com/mightymos/RF-Bridge-OB38S003/blob/main/src/portisch.c#L737-L755) if I understand the logic correctly. I haven't tried those changes yet.

I tried to understand from the memory map what happens with the overflow. It seems to overwrite bucket_sync variable and I just don't understand why that disables the B1 listening. Maybe I don't need to understand everything. Or maybe my build wiht the memory traps is mapping the memory in a different way than the original.

@zd3sf
Copy link
Contributor Author

zd3sf commented Dec 5, 2024

I'm sorry, this goes beyond what I know about these microcontrollers. I should also report that I have strange thing happen where my window blind (B0 transmit) goes up on its own without anyone's input. I check ESPHome logs and I find nothing.

@mightymos
Copy link
Owner

mightymos commented Dec 5, 2024

I'm sorry, this goes beyond what I know about these microcontrollers. I should also report that I have strange thing happen where my window blind (B0 transmit) goes up on its own without anyone's input. I check ESPHome logs and I find nothing.

@otsoni thanks for the catch on the beep, I think you are right and I think I fixed it. I do not have the buzzer installed on my board so I do not actually hear any beep. I will try to look at the original portisch logic and see if I understand if there are issues as you highlighted.

I think I made a mistake previously with setting state for the state machines in the main file instead of between two files. I think it is back to the original portisch steps, but I just did a quick check that hardware could send, receive, etc, no long time period checking.

I will eventually need to set the state in one file because it is too confusing to follow the logic when it is set between two files. But I am hoping by making it like original portisch for now testing can be more stable. Sorry for all the testing need. I will make the code more readable later, new release is published.

EDIT: I am concerned about the blinds opening on their own, and will try to focus on soon. Observed that once ESPHome receives a bucket decoding (0xB1) it sends back an acknowledge. On original portisch receiving an ack kicks us out of sniffing mode and back to standard decoding (as tested manually with Tasmota). The same is also happening with ESPHome now. I guess my question is there a reason to leave it in sniffing mode or was it just to stress the firmware to look for overflow or bug behavior?

@zd3sf
Copy link
Contributor Author

zd3sf commented Dec 5, 2024

Thanks guys for looking into it. The bridge was not in sniffing mode, just in standard decoding. I am not sure why the bridge is sending commands on its own, no one is touching the physical remote either.

@mightymos
Copy link
Owner

mightymos commented Dec 6, 2024

Thanks guys for looking into it. The bridge was not in sniffing mode, just in standard decoding. I am not sure why the bridge is sending commands on its own, no one is touching the physical remote either.

I just observed I receive 00C303 codes on my original portisch black box after flashing a new yaml to the white sonoff.
When I flash the yaml over wifi it takes awhile for the logging to connect and so I do not know if I am missing messages or not.

My guess is that ESPHome is executing a default action on startup so that the visual toggle in HA matches the last executed action.
Possibly something similar is happening with your blinds.

However, I am not yet ready to blame ESPHome until this behavior can be confirmed.

EDIT:
I added a simulated door sensor to the yaml and it also sends 0xDEADBE on startup that gets decoded on my original portsich.
So default off actions are performed at startup as best I can tell.

@otsoni
Copy link

otsoni commented Dec 7, 2024

I think that I found the B1 overflow too, but not sure about that as it is working with the memory traps, but memory trap after buckets (https://github.com/mightymos/RF-Bridge-OB38S003/blob/main/src/portisch.c#L53) is changed. I think that the check here https://github.com/mightymos/RF-Bridge-OB38S003/blob/main/src/portisch.c#L729-L733 should be done before writing the duration on row 724. For my understading it is now checking the overflow condition after the overflow has already happened, but correct me if I have been thinking something silly.

It seems that there is same kind of bug with RF_DATA (https://github.com/mightymos/RF-Bridge-OB38S003/blob/main/src/portisch.c#L737-L755) if I understand the logic correctly. I haven't tried those changes yet.

I tried to understand from the memory map what happens with the overflow. It seems to overwrite bucket_sync variable and I just don't understand why that disables the B1 listening. Maybe I don't need to understand everything. Or maybe my build wiht the memory traps is mapping the memory in a different way than the original.

I still think that this should fix the code receiving problem. I can't make a build for you with those fixed as for some reason I get Insufficient ROM/EPROM/FLASH memory error when building the main branch. @mightymos have you done something special to get the builds or is there something wrong in my build environment?

I think that I could try to use other rf-bridge in passthrough mode to send code with too many buckets and test if I can get other rf-bridge to freeze straight away.

@zd3sf
Copy link
Contributor Author

zd3sf commented Dec 7, 2024

  1. I figured out the blinds moving on their own. I have an automation that sets the blind position at 40%, if it gets triggered and the blinds are already at 40%, there's a bug where ESPHome tries to move the blinds for a millisecond then stops. But the problem is that with RF, the UP/Down commands take some time, that the stop command is issued while the bridge is busy, so it never sends. Of course, it never registers on HA or ESPHome because according to the front end, set position=current position=no action.
    See ESPHome bug here. I'll set a condition where the automation doesn't get executed if set position = current position.

  2. Bridge still freezes. I have a reset command automation every hour, but every now and then, I see a freezing condition. No big deal but just reporting it.

@otsoni
Copy link

otsoni commented Dec 7, 2024

Nice that you foud out the bug. Next I will try to make some kind of radio signal that causes the bridge to freeze. I think it should have at least 9 different bucket lengths. I hope that I can create it with passthrough firmware and esphome. I let you know when I have some more information about that.

@mightymos mightymos added the bug Something isn't working label Dec 7, 2024
@mightymos
Copy link
Owner

Support for learning mode has been added. Portisch port should be feature complete with original now.
To support learning mode I had to fix the timer to work with both microsecond and millisecond delays.

I do not fully understand it, but Portisch computes a checksum along with an 800 millisecond delay with repeat codes.
So as a result of the timer fix, you will see only one code even when repeats are sent and this is how the original behaved.
rcswitch just decodes as fast as it can and so repeats are still output.

I also made some organizational changes to make state machines more readable for me.
I plan to stop making further changes because it is probably providing marginal improvement and the code needs to finally "freeze" to exhaustively investigate freezing or other bugs.

@otsoni If you are making changes to the code you need to monitor the .MEM file to make sure you are not going over microcontroller memory limits (ram, xram, flash). Thanks for trying out your testing strategy, it is difficult for me to test on my own.

@zd3sf Thanks for the additional work with the blinds. I know the example yaml you have is probably diverging from my own. If you want to somehow consolidate changes later or post your own custom yaml file I can include it in the example folder and not edit it once it seems complete.

@zd3sf
Copy link
Contributor Author

zd3sf commented Dec 7, 2024

Thanks @mightymos, we can merge the YAMLs at some point. most of the changes are situation-specific and not related to the general operation of the bridge. I added one line restore_from_flash: true. This allows the state of the blinds to persist between reboots of the ESP.

esp8266:
  board: esp01_1m
  restore_from_flash: true

I tried the latest firmware, standard and bucket recieve work as intended. But, I couldn't get transmit to work at all standard or Bucket. I tried re-sniffing the codes and transmit them, but didnt work.

@otsoni
Copy link

otsoni commented Dec 8, 2024

@otsoni If you are making changes to the code you need to monitor the .MEM file to make sure you are not going over microcontroller memory limits (ram, xram, flash). Thanks for trying out your testing strategy, it is difficult for me to test on my own.

I have been monitoring the .MEM file but it gives me the error also when building without any changes to the main branch. I will create separate issue about that.

@otsoni
Copy link

otsoni commented Dec 8, 2024

I have been monitoring the .MEM file but it gives me the error also when building without any changes to the main branch. I will create separate issue about that.

My bad. It seems that my Ubuntu WSL had some old version of sdcc (I think it was 4.0.0) in it's repositories and updating to 4.4.0 seems to fix build problems. I will continue to debug the freezing problem

@otsoni
Copy link

otsoni commented Dec 8, 2024

@zd3sf I just realized that you have problems with 0xA4 codes freezing and I have been using just 0xB1 sniffing. It seems that the problems with overflowing buffers are not related to 0xA4, just 0xB1 sniffing. So there might be two separate freezing problems. (Or one that is the cause for both and I have been tracing just wrong things)

@mightymos
Copy link
Owner

mightymos commented Dec 9, 2024

Thanks @mightymos, we can merge the YAMLs at some point. most of the changes are situation-specific and not related to the general operation of the bridge. I added one line restore_from_flash: true. This allows the state of the blinds to persist between reboots of the ESP.

esp8266:
  board: esp01_1m
  restore_from_flash: true

I tried the latest firmware, standard and bucket recieve work as intended. But, I couldn't get transmit to work at all standard or Bucket. I tried re-sniffing the codes and transmit them, but didnt work.

I made a new release to declare feature complete with learning mode.

During development I used one of the hardware timers for software uart to output debug information.
But this is usually unnecessary now since I could use the busybee kit instead.
So ironically this weekend I was basically undoing work I had done previously to match original portisch.

Anyway, using two timers for delays was the correct architecture choice.
I appear to be able to decode, sniff, transmit using esphome to original sonoff receiver.

If it is still not working with your devices I will need to hookup oscilloscope to check signal timings.
But I wanted to post a fixed version.

If you could once again be patient and try it that would help, I'll hope for good news.

@mightymos
Copy link
Owner

I have been monitoring the .MEM file but it gives me the error also when building without any changes to the main branch. I will create separate issue about that.

My bad. It seems that my Ubuntu WSL had some old version of sdcc (I think it was 4.0.0) in it's repositories and updating to 4.4.0 seems to fix build problems. I will continue to debug the freezing problem

Ah okay, good catch.
Yes I use version 4.4.0 on a Windows machine natively.

@zd3sf
Copy link
Contributor Author

zd3sf commented Dec 10, 2024 via email

@zd3sf
Copy link
Contributor Author

zd3sf commented Dec 10, 2024

Freezing happened overnight. Standard and B0 transmit remained working, but standard receive stopped working until an MCU reset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants