-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing settings #221
Comments
It isn't the HW-584 hardware, or the router. I think you are OK there. This must be a firmware bug. I haven't seen any other reports of this problem, so we need to dig a little deeper.
My initial questions above are to try to understand the following: Note that I have seen similar corruption of the EEPROM but only during development of code. Typically it is because I have an errant pointer that overwrites the EEPROM. I've built in a lot of code protection to prevent this, To my knowledge it hasn't occurred in the field - only in my test configurations while code is being changed. But there is always a first time :-) OTHER USERS: PLEASE RESPOND IF YOU HAVE SEEN THIS. It is not certain what will happen if the HW-584 modules have the hardware reset triggered frequently. That is very difficult to test, and I know that it is much more difficult to control DC power glitches when more relays are attached. There are several people using 16-relay boards successfully, so we just need to try to understand what might be different in your situation. |
This is very confusing. Everything you are doing looks right. |
Hello Mike |
I just don't see how heavy network loading could cause a problem. If there is a very large amount of messaging through the MQTT Server/Broker it is possible that messages might get lost I suppose. But I don't see how this would cause corruption of the EEPROM in the HW-584. So I will try to get a test going within the next day. |
I've configured 5 HW-584 devices to start running a Home Assistant test tonight. I'll add more if I don't see any replication. These first 5 devices run the "Upgradeable" version of code. Operationally this should be identical to the non-Upgradeable code, but it allows me to change code without needing to use the SWIM interface. |
While this first test runs I thought of a possible issue. |
Hi Mike |
Tomasz - Please email me at [email protected] |
Despite a significant number of email exchanges with @tomaszvil we've been unable to resolve the issue. I believe this to be a power glitch problem, but @tomaszvil has performed a significant number of measurements, power rearrangements, and tests and the problem persists. I am unable to reproduce in my setups, with the exception of seeing a similar problem with a degraded power supply long ago (issue #125 in Closed Issues). I'm out of ideas. |
@nielsonm236 Hello Michael! Configuration after I fixed IO Table: Note: Ignore "heat2floor" in MQTT Username. I just use same username and password for both modules. |
I can try to install 20240612 to check if it will fix the problem. I can dump EEPROM Data block before that if needed. |
BTW NetModule calculates page length wrongly.
|
The information that some pages may not be properly sized may be a clue. I will start there. If a page is not sized properly it might cause a buffer over-run and that could lead to data corruption. Very small chance that is the problem but it needs to be corrected. If the MAC address is corrupted it will interfere with HA. The MAC address is used as part of the device ID's. I'm not sure how the Browser continues to work with a corrupted MAC address. The router should be very unhappy about that and shouldn't be able to find the device. But ... I've forgotten more than I remember so maybe it can still work if the IP address is unchanged. Mike |
I do not think that page size is the problem. It becomes a problem only when MAC Address get's corrupt. Before that page sizes are correct. I see when corrupt it sends next data for MAC Address:
Which is 6 bytes (this might be not correct as I did hexdump from the clipboard) But when unit works correctly it sends:
Which is 12 bytes. |
I think the hexdump might not be correct. A MAC address without a mix of digit and alpha characters is very unlikely. Still the question is "how did the MAC corruption occur", because your screen captures very clearly show that did happen. Since your system operated a very long time I agree power glitching is unlikely. Now I'm thinking tha tis also true for @tomaszvil. A buffer over-run can corrupt just about anything, depending on where the compiler placed those things in RAM. So I agree it is a long shot.
First I need to see if I can get this to reproduce so that I can add some debug. We may have to exchange some information by private message so I can set up a duplication of your Configuration in greater detail. Mike |
Regarding "Browser will toss not-allowed bytes". This can't be true as I was using |
OK. Now I'm completely baffled again. It ran fine for months then started to repeatedly have a EEPROM corruption problem? Thinking out loud to see if it provokes any ideas:
If we can think of any instrumentation to confirm or reject a suspect issue let's do it while you are seeing the problem. Mike |
Regarding ""Browser will toss not-allowed bytes" I'm running on recollection here which could be faulty. I know that if I code a string with multiple spaces in a series it will appear at the Browser as a single space. I can't remember where the spaces are concatenated .... perhaps just in the Browser itself. I also vaguely recall that if I sent bytes outside of the Browser character set they can cause problems. But I don't know if that will affect receive count or not. Notes on MAC adddress storage: |
It is a great list of potential places for looking. Regarding MacAddress storage. I also think that MAC string IS being corrupted in RAM. Because when I power circled last time the unit MacAddress reappeared in the Configuration page intact. Something different happened this time. I will create a separate issue not to pollute this one. #228 So this is not the crash we are looking for... From the page /66: I clicked reboot and Nothing to do, let's wait if it will eventually crash. |
My interpretation of the Link Error Statistics report: Line 33 is very concerning.
Line 34:
Line 35:
To get a better view of "what has happened since we started testing" I'd suggest using command /67 to zero out the counters. That way we will eliminate the "long ago" history and see what is affecting operation right now. I recall you are using a Cisco business class switch, which is why you have Full Duplex enabled. I also recall we would see somewhat higher transmit and receive error counts with that switch, but what is shown above is much higher than I would expect even with the Cisco switch. Mike |
Uhh, so professional. You are developing like for satellites to launch into the space =) No, I am not using a Cisco Business class switch. NetModule is connected to some cheap TP-Link TL-SG105E Gigabit Switch which is connected through another switch to the main router. I can disable full duplex if needed, this should work in both modes, I think. Some years ago one power block has already died with that module, maybe this is why we see unexpected power loss. Both modules I have are configured to Full Duplex mode, both have DS18B20 temperature sensors attached (first one has 4, second one has just 2). Settings are the same only number of configured IO output's differ. Before writing here I used "Oxide clean & Protect liquid" to make sure ethernet cable connectors have good contact. But if you say that tx/rx error rates are reset after every reboot then errors before cleaning should not be included. I have drilled some holes in the box to let some fresh air in. But both chips on the board were not hot to touch. My infrared temperature sensor showed ~50 degrees celsius on the main chip. Anyway, will be cooler now. Here are both /66 pages of my both modules before reset if we want to compare.
Module 2
Now I cleared counters on both devices and rebooted them both. Initial states are almost equal (only 32 differ a bit):
Will recheck what are error rates tomorrow. 20 minutes has passed and still zero. |
Full Duplex shouldn't be used except with those Cisco business class switches. The ENC28J60 isn't supposed to work very well in Full Duplex. It has bugs in the auto-negotiate hardware, so the specs recommend half-duplex. The only reason Full Duplex was allowed in my firmware is that is seemed to work a little better with the Cisco switch IF the user also forced the port on the Cisco switch to Full Duplex (no auto negotiate). So, having it set to Full Duplex is probably causing some of the network errors, particularly when network collisions occur (which is really a common thing on busy networks). At the present time I have no space launches planned. ;-) Mike |
Uhh, I should have studied the manual more diligently. I have disabled Full Duplex on both modules now, restarted and cleared the counters one more time. Will see how it will go. Thank you for all the help. |
Hello guys Thinking out loud: J. Vieira |
I hope either switching to Half Duplex or changing the power module gets this back in shape. Otherwise this will be a bugger to figure out. It is interesting that it is so similar to the @tomaszvil report. But he worked very hard on trying to get it working and had no success. I would love to have been at his location to help him track it down. But ... he is 5000 miles away. The problem with power is that it is not just average voltage level like you see with a meter, but high-frequency noise spikes that you could only see with a high-bandwidth scope, and then only if you get lucky enough to catch it. I found a very stable Lambda industrial power supply to run my test bed. But I still use cheap wall wart power supplies for devices scattered around the house. |
I forgot to mention that the green led was on, and the orange led blinked normally, both before and after I disconnected the network cable, I also forgot to mention that this module is connected to a gigabit switch D-link DGS-1024D. |
Once it stops responding there really isn’t any way to access RAM error counters. About all we can do is run /66 frequently in hopes of capturing meaningful data. But so far even that has not worked well.You have other modules with older code that are still working on the same network, right? If yes then I think the broadcast filter is causing the problem.Sent from my iPhoneOn Sep 9, 2024, at 5:08 PM, jmcvieira1 ***@***.***> wrote:
I forgot to mention that the green led was on, and the orange led blinked normally, both before and after I disconnected the network cable, I also forgot to mention that this module is connected to a gigabit switch D-link DGS-1024D.
I'm going to leave the module connected and i will not make any changes, If it happens again that the module stops responding via LAN, is there anything else I can check? the module is close to my personal computer I can connect the st-link programmer to it...
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Yes, I have ten other modules in my house working with the "normal" Home and Domo firmware. |
@jmcvieira1 OK - I will get to work to make the "broadcast filter" a URL commanded option. It will normally be turned off (so there is no filtering) and hopefully you see your test module become stable again. If you do, it will tell us the broadcast filter should not be used. Unfortunately it will also mean we are no closer to a solution for the @tomaszvil and @yozik04 disconnects. |
Hello the module in question stopped communicating again today at 2:10 PM |
@jmcvieira1 I'm running some tests on a modification that lets you turn off (or on) the Broadcast filter. That is supposed to be the only major difference between the old code and the new code. If my tests are successful I will post the Domo BME280 version. Since your test case seems fairly repeatable maybe this will tell us if that is the cause of your loss of connection ... or if I have some other defect in code. |
@jmcvieira1 Jose - I've attached a Code Uploader and Domo BME280 build. This will let you turn on/off the Broadcast filter so we can determine if the disconnect you are seeing is related to that filter. URL /6e will turn ON the filter. URL /6f will turn OFF the filter. Executing either command causes a reset. If you load the code the filter should default to OFF, but you can run URL /6f to be sure it is off. |
Hello |
@jmcvieira1 Privately messaged me with his results, which really only showed (as expected) that if the Broadcast Filter is turned on there are significant delays in response time to ARP requests, and when the Broadcast filter is turned off the responsiveness problem goes away. I think we need to let the @jmcvieira1 testing continue to see if he loses connection over a longer period of time with the Broadcast filter turned off. He asked a couple of questions, and I responded with the following: What you are seeing is not what I originally expected, but it is similar to what I'm seeing as we have made progress. The exception is that I never get a complete disconnect even if I have the Broadcast Filter turned on. You may already know this (or some of it). Broadcast messages can be used for several purposes, but the most common usage is to implement the Address Resolution Protocol (ARP). On a local network the IP address is not used to actually communicate over Ethernet. Instead the MAC address is used. But hosts need a means of determining which MAC address is associated with which IP address. So a Broadcast ARP Request is sent by the host to everyone on the network asking "Who owns IP xxx.xxx.xxx.xxx and what is your MAC address?". The device that owns the IP address generates a ARP Response identifying its IP/MAC pair, and after that point the Host can communicate with the device using the MAC address. All Hosts (and switches) see these transactions and each maintains an ARP table so that they don't need to send a Broadcast message UNLESS they attempt communication with a supposedly known MAC address and get no response. Then that host will send a new Broadcast ARP request. Another way that a device can let everyone know it's IP address and MAC address is to generate a "Gratuitous ARP Response (GARP)" without any host asking for it. All hosts and switches are supposed to see this GARP and add the IP/MAC pair to their ARP tables. Our recent experimentation with the Broadcast Filter was based on an assumption that something on the tomasvil (and the yozik04) networks was generating too many Broadcast messages of some kind. With the Broadcast Filter turned off the HW-584 firmware must identify and throw away all Broadcast messages that are not needed. If there are hundreds per second the firmware can't keep up. So, an alternative is to turn on a Broadcast Filter in the ENC28J60 and let the hardware throw away ALL Broadcast messages. This takes the load off of the firmware, but it then requires that the HW-584 generate a GARP so that all hosts know the IP/MAC pairing. When the Broadcast Filter is turned on the firmware sends a GARP every 10 seconds. BUT - something is going wrong in this process when the Broadcast filter is turned on. If in fact all the switches and hosts are seeing and processing the GARP messages from the HW-584 we should not be seeing any delays in responses exceeding about 10 seconds, and that 10 second delay should only exist when the ARP tables have timed out (usually after 4 hours). But GARP is definitely working to some extent or we would NEVER be able to connect at all. I cannot explain why it does not work smoothly as I have implemented per the spec (I think). So ... I'm thinking the Broadcast Filter / GARP method is hurting more than it is helping. If we must turn off the Broadcast Filter, which also returns us to the Traditional ARP method, we are basically starting over trying to figure out why the HW-584 doesn't work in the tomasvil network, and why yozik04 saw disconnects. Reviewing the original symptoms reported in both networks the symptom that is most disturbing is that some RAM corruption was being seen. I've added debug around my primary suspect for RAM corruption (ethernet message buffer pointers), but so far we haven't seen a capture. I should note that I performed an experiment that attempted to flood the ENC28J60 with Ping Requests. And ... I saw a couple of RAM corruptions. This experiment was run before I added the buffer pointer debug code, so I should run the experiments again. And I should also add that I know I've seen RAM corruption when I had a very noisy / failing power supply. If we can eliminate power as the source of the problem I think we are left with "Some networks are simply too active for the ENC28J60 and firmware to keep up". That might be the case for tomasvil (no proof yet), but I doubt it is the case for yozik04 (but also no proof). |
@tomaszvil First a general comment from me: I think the Broadcast Filter hasn't solved anything, and may be adding additional confusion to our test results. I will eliminate it in further test releases. But, that takes us back to where we were a couple of months ago. Regarding "I have one additional one which is not registered in HA with Code Revision 20240901 TEST MQTT Home, it is only on the network and from time to time I check the communication via the browser, it can stop communicating two or three times a day." A device that is not accessed via your Browser for a period of 2 to 4 hours may be removed from the ARP tables of your host and your switches. With the Broadcast Filter turned ON it can take several attempts before a connection will occur. On my network (which is apparently a "quiet" network) I have seen this take up to 10 attempts, and I had one case where it seemed I could not access the HW-584 at all after many attempts, but 20 minutes later I was able to access it. In your network (which we suspect is "very busy") perhaps this problem is made much worse as attempts to re-establish the ARP tables may have a lot of interference from network activity. For this reason I think the Broadcast Filter is a bad idea. I don't think this will affect the MQTT Broker once the ARP tables are configured because MQTT pings should be keeping the ARP tables refreshed. So, we are back where we started. The rest of your tests illustrate that there is loss of MQTT communication across several older versions of firmware. As your situation is relatively unique (except that @yozik04 also is having a problem), then the common elements in your test are: a) The firmware itself may have a latent defect that appears in all of those releases that is provoked in your hosts / network / HW-584 configuration, b) The level of network activity overwhelms the hardware / firmware of the HW-584 (simply not enough processing power and memory resource), c) Power supply may still be a contributor. Did you try your power supply from home with one of the older firmware levels? |
@tomaszvil Back on July 28 I suggested running the Browser Only version of code on a HW-584 on your network. It could give us some additional information about network conditions using the Network Statistics report with URL /68 command. I don't think I ever saw a reply with these statistics from your network. |
I will test this version tomorrow. |
after about 4 and then 7,10 hours |
Your test results take us in a whole new direction. The counts suggest that the Broadcast traffic is actually very low. Even lower than on my network. However, the "Packets dropped due to wrong IP dest address" is exceptionally high, occurring about every two seconds. On my network I see this every 5 or 6 minutes. Despite "Packets dropped due to wrong IP dest address" being a very high count I don't think it should cause a problem. But because it is very different from anything I've seen I should take a closer look and try to figure out what it is. I will examine the code, but in general it indicates that something on the network is trying to send a packet to the HW-584 MAC address, but it is using the wrong IP address. When I would see it on my network I thought it was just some Microsoft network probing and I ignored it. But with your counts so high it might be something more significant. Maybe I can build a tracker that will tell us the sender IP/MAC values, and from that we can determine what is going on. I will research and let you know what this might be. If you have Wire Shark expertise you might figure it out faster than I do. FYI, I do not have "Wire Shark expertise", but I have played around with it. Mike |
@tomaszvil Half of all packets received contain the wrong IP Destination address. Is it possible there is another device on the network with the same MAC address? As FYI I've been searching the internet for a plausible explanation and not finding anything that looks promising. Are you running UPG code, or so you program via the SWIM interface? |
I have UPG code : Maybe not everything is configured perfectly, but all other HOSTs in the network are working properly except HW584 |
@tomaszvil Questions: Do you have two different routers? One for 10.0.0.x and one for 192.168.1.x? Or is it a single router? Is the HW-584 within the DHCP address range of the router(s)? Or is the HW-584 statically assigned with the router(s)? We just need to make sure that you don't have the HW-584 MAC address mapped to a 192.x.x.x IP AND to a 10.x.x.x address. If the HW-584 MAC is statically assigned and only appears one time in the router tables, then the use of 10.x.x.x AND 192.x.x.x on the same ethernet should be just fine. I do the same thing with multiple VLANs although all within a 192.x.x.x address space. So, if the router MAC tables are fine (even if two routers), what else could cause a MAC destination to be messaged with an incorrect IP destination address? I'm not sure, but as I mentioned I also see this on my network but I see it very infrequently. Online research suggests packet corruption could cause this, but I find that unlikely in this case. Perhaps some kind of auto-discovery process running on a host? I don't know what that could be, but I do know there are apps that can scan a network and associate all MAC addresses with their IP addresses, so this kind of process exists in the world. But I think those apps use normal ARP response mechanisms, and that would not send a packet to the HW-584 with an invalid IP address. Perhaps an external scan from the internet making it through your firewall? Unlikely. I will continue researching the possibilities. Let me speculate that all the other HOSTs are also seeing this issue (lots of invalid IP Destination addresses). They may be able to handle it because they have virtually infinite processing power compared to the HW-584, and the only effect would be degradation of network performance. Even the HW-584 might be handling this OK (especially with the Browser Only firmware), as the test you ran only shows an event every two seconds. But if these events are greatly increased when we go back to MQTT firmware with MQTT traffic it might be a problem. So, it is worth understanding it even if it ends up being a dead end. |
@jmcvieira1 Is your test still running OK now that the Broadcast Filter is turned off? |
Update |
@jmcvieira1 With the 20240901 TEST version, which has the Broadcast Filter turned on, you will probably find that the Browser has trouble connecting. Usually I could get a connection after only 1 to 3 attempts. But I had one instance when I tried over 10 times and gave up. Then when I came back and tried again after about 20 minutes the device immediately responded. So don't be too surprised. As we move forward we should abandon the Broadcast Filter. |
@tomaszvil I need to retract a part of what I said about my network having 4 VLANs. I DO have 4 VLANs, but with the DD-WRT firmware in my router each of those VLANs are on separate ports of the router. So they do not concurrently share the same ethernet hardware paths. |
@tomaszvil I added some additional debug code to determine the source of the "incorrect IP destination address" counts being generated on a HW-584 device on my network. The debug I added shows the following:
|
In my network I have only one router, Tomasz |
As already mentioned the "Packets dropped due to wrong IP dest address" is curiously high. I've always assumed your network was mostly Linux, but I don't think you ever actually mentioned that to me. Some Linux devices use NetBIOS, but most don't. Maybe Linux devices are generating some similar traffic. You mention that the HW-584 is within the DCHP range - but it should not be since it has a static IP address. It could still work, but the risk is that the address gets assigned to another device by DHCP. I'm puzzled at how DHCP assigns addresses to a PC or Server on your network. If there are two DHCP servers (one for 10.x.x.x and one for 192.x.x.x) when a PC or Server joins the network how do the DHCP servers know which one of them is supposed to assign an address to the new device? Or is it actually the case that all your connected PCs and Servers have been given static addresses in one range or the other? |
Here are two links that talk about DHCP and Static IP Addresses. Since the HW-584 must use a static IP address, then you have two choices: An aspect of DHCP provided IP addresses is that they have a "Lease Time". This means that when the least time expires the IP address needs to be re-established, and if there are clients (HW-584, MQTT server, etc) that require a static IP assignment but they are in the DCHP range without a IP reservation then they could end up in conflict with another device that claimed the IP address at lease expiration. If it has been a while since you set up your network you may have forgotten that you need to do (a) or (b) above when you set up the MQTT and HA servers and the HW-584 devices. If you don;'t mind: What is the Router model number? I'll go look up whether it supports IP Reservations. |
@tomaszvil Just checking in to see if you had any further progress or results. 0000000031 Start Diagnostic Log The numbers on the left are a timestamp (seconds since boot). In my network these are all NetBIOS messages. |
Hello
I am using your firmware to Network Module Web_Relay_Con V2.0 HW-584, I am interested in your project.
After installing several HW-584 modules in the same network, I have a problem with them .
The HW-584 network modules change their configurations on their own without my intervention, change are mac adrr, name, IO Type and Invert Boot State, are changed to incorrect ones .
Button refresh and reboot do not restore the settings, only disconnecting the power causes a proper start to restore the settings.
It also happens that it stops working completely, there is no communication and only restoring to the initial settings (5s reset) revives the module.
When I read data from processor (settings) area of the processor, incredibly random data is stored in the eeprom.
The hw 584 network modules are connected via the MQTT protocol with Home Assistant.
Only Home Assistant controls my network modules.
I use last version NetworkModule-MQTT-Home.
router is CISCO RV042G
I bought several modules from different sellers, all of them had the same problem.
Each network module is connected to its 16 Relay Module Low Lewel Trigger.
When I used one HW-584 network module , the settings changed once in 1 months.
In my project I need to use 5 modules in different locations on the same network.
When there were already 4 HW-584 network modules in my network , the settings changed randomly in different modules, very often even twice in 3 days.
Currently, I have 2 active HW-584 network modules and I have noticed that the settings change less frequently, about once a two week, sometimes more often.
I have set an individual IP address and an individual MAC address for each HW-584 network device.
There are a dozen or so computers and about twenty IP cameras in the network
I appreciate your dedication and contribution to this project and the very detailed documentation .
Please help me solve my problem.
Sorry for my English.
Tomasz
The text was updated successfully, but these errors were encountered: