-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I2C bus fault observed when reading RF channel temperature #128
Comments
If you configure booster to power-on channels by default (e.g. save settings with a channel I'll continue running booster locally to see if I can reproduce these faults, but I'm suspicious that it may be related to more RF channels being installed. |
Thanks. I have already made all channels power on by default. I can confirm that Booster stopped putting RF out for 10s or so, the lights went blank, the fans spun up and the lights + RF came back |
If the RF came back up without any user interaction, that indicates that this is likely a |
👍 thanks! |
Alright, I was able to see a watchdog occur on my bench set up using the |
👍 thanks |
@ryan-summers one other data point here...I haven't confirmed if all three Boosters I currently have display this behaviour or not. But, I've noticed that the one we're using has a red LED on one channel that seems to appear after some time (and is cleared by the crash/restart). I haven't had a chance to look at what causes this error, but will open an issue when I do. |
The red LED is indicative of the |
How does that work? Is it a software detect of the current going too high? or do we try to check if the LDO goes into foldback. |
I believe that's a software limit we program into the supply monitoring IC.
I'll have to look over it monday. By all means, open a separate issue for
that one as well
…On Fri, Jan 15, 2021, 18:59 hartytp ***@***.***> wrote:
My guess would be that it's the overcurrent alert tripping.
How does that work? Is it a software detect of the current going too high?
or do we try to check if the LDO goes into foldback.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#128 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACC5O6RQXQ4TVRFTNL5H6WTS2B7ALANCNFSM4WEJ3RBA>
.
|
Doesn't this imply that this issue is a panic and not a watchdog issue? |
It very well might. I'm adding service information reporting (e.g. stored |
ace! |
@hartytp We just merged in changes to add service reporting information. I would ask you to do the following:
Note that if you type Output should be similar to the following (ideally with specified panic info and/or watchdog information): |
As a data point (I'm unaware as to whether or not this is a real issue), I saw a panic report indicating failed communications to the I2C temperature sensor during development. This fault may have been caused by development processor cycling, but I figured I'd list it here just in case. I only saw it right before reflashing booster, and other tests have had booster running for a number of hours without reporting any faults. |
|
So something to do with the temperature code Lines 798 to 803 in 90dd3b8
|
It appears that I2C communications with the temperature sensor MAX6642. This is the I2C sensor on the analog RF channel, so it's the furthest I2C device in the chain. This is identical to the fault that I reported above. I'll take a look at some of the underlying I2C code to see if I see any issues here. A potential workaround is to implement a retry of the communications, but I'd like to get to the underlying fault here. |
Agreed on all fronts. Thanks. My guess is that #130 is the same underlying issue. |
@ryan-summers there is no room for any significant hardware debugging here. Let's just retry. That would also be the correct approach if the sensor is signalling busy through a nak. |
Data corruption would in all likelihood indicate a hardware issue outside of the scope of the firmware development. |
It might do, it's hard to know without further info. Presumably this is the kind of thing where sticking a scope on the bus would give us a lot of information in a relatively time-efficient manner. e.g. if the bus has a bit much capacitance so we can't actually drive it quite as fast as we are, we can discuss whether that's a hardware issue or a firmware issue, but either way it's fixed by slowing things down a little. Either way, the current status is that I have RED LEDs showing up on channels on a daily basis which makes this firmware somewhat unusable in a production environment... |
I'm adding a change to gather more descriptive I2C errors in #139, but it's difficult at this point to determine a correct course of action. There could be a lot of failure modes that result in these symptoms. For example:
Each failure mode above has a different course of correction in my opinion. We do not yet know if #130 is related to these I2C communication issues, although it appears likely, Determining whether #130 is caused by the /ALERT power supply monitor fault or an out-of-temperature detection will also inform this investigation. |
I'll post a device status next time I catch the errors. At the moment they're being cleared by the panics which makes life a bit harder. I might hack something in to remove the panics and make it easier to catch the errors. Either way I'll post something before long. |
@hartytp we have gone through a lot of additional unexpected trouble to work around known and unknown hardware issues, design errors, lacking/incorrect documentation. And as you see we are continuing to support that. We are currently not planning to do any significant hardware debugging. The fact that we would be able to do it doesn't mean that we should or need to do it. I suggest that you or TS/CTI look at potential hardware issues. |
@hartytp I'm currently testing the patch on my local booster and will run the test overnight. If you would like to help out, the PR with the changes is on #151. I've created a PR to update the STM32F4xx-HAL at stm32-rs/stm32f4xx-hal#275 |
Great, thanks. I'll flash that tomorrow morning and let you know how I get on. |
okay, I've set a booster up with all channels set to enable on power on, but one channel manually disabled. If we hit a panic then all green LEDs will be on. I'll keep an eye on this until Monday. Based on past experience, we should expect to see a panic before then if the issue is not resolved |
Hi, as I mentioned I've been running a modified Booster firmware without applying @ryan-summers' #151 or stm32-rs/stm32f4xx-hal#275, with all RF channels disabled, and the Booster didn't panic during the 42 hours of test run (with the MQTT client connected to a broker uninterruptedly). The temperatures reported on the MQTT messages are also normal (e.g. around the room temperature, no extreme values). My modification is as follows: diff --git a/src/rf_channel.rs b/src/rf_channel.rs
index 67b7db3..bf153b0 100644
--- a/src/rf_channel.rs
+++ b/src/rf_channel.rs
@@ -792,10 +792,17 @@ impl RfChannel {
/// Get the temperature of the channel in celsius.
pub fn get_temperature(&mut self) -> f32 {
- self.i2c_devices
- .temperature_monitor
- .get_remote_temperature()
- .unwrap()
+ loop {
+ match self.i2c_devices
+ .temperature_monitor
+ .get_remote_temperature() {
+ Ok(temp_c) => return temp_c,
+ // See: ST document ES0182 Rev 13, Section 2.5.6
+ // (https://www.st.com/resource/en/errata_sheet/dm00037591-stm32f405407xx-and-stm32f415417xx-device-limitations-stmicroelectronics.pdf)
+ Err(max6642::Error::Interface(stm32f4xx_hal::i2c::Error::BUS)) => {},
+ Err(e) => panic!("Error when reading RF channel temperature: {:?}", e)
+ }
+ }
}
/// Set the bias of the channel. Unlike Ryan's PR, my changes only applies to the MAX6642 I2C transactions, but it appears that it is already good enough to stop Booster from panicking. Does that mean it's more likely a hardware issue then? Currently I wouldn't have time to test with all other I2C devices but it might be a good idea for us to aim at eliminating any possible spurious errors across all devices. |
Not necessarily. The errata indicates that the chip erroneously detects bus fault conditions. That means that it's likely that some common characteristic of the MAX6642 connection appears to be more likely to trip this errata than the other buses. I wouldn't go so far as to say that there is an actual hardware issue without further investigation. In #151, I have updated the HAL such that no I2C transactions listen to the BUS_ERROR bit ( |
out of curiosity, what idle I2C transactions are there apart from those temperature sensors? Just the fans? I assumed that the majority of panics are related to the temperature sensors because they are the only chips on the bus with significant traffic. |
I agree with this. Without understanding more about the nature of the bug in the STM32F4 I2C controller it's hard to reason about. I need to check what measurements Greg and others have made. My understanding is that he has checked the I2C bus voltage levels, noise and timings (inc rise/fall time) and not seen anything suspicious. I believe he's also checked the supply to the MAX6642 and, again, not seen anything suspicious. It's always hard to prove a negative (this isn't a hardware issue beyond the SM32 I2C issue) but so far we've done a decent amount of background work and not seen anything suspicious. At present, the best plan I have is (assuming @ryan-summers's fix resolves the panic) to leave Booster logging for an extended period (month or longer) and see if we spot any glitches or anything else suspicious. If not, I'm inclined to say this is all the I2C core (but I'm open to other suggestions). |
However...
|
@hartytp Since the telemetry task routinely publishes the measurement results from calling Lines 944 to 966 in 8247e54
And I also encountered this error before, but only on one out of 4 Booster units that I've tested. So far, I don't see a pattern for it to happen. If I find more clues I'll share them over on #140. And thank you @ryan-summers and @hartytp for sharing your perspectives on the issue. It helps me learn how to deal with unexpected behaviours when many hardware interactions are taking place. |
Of course, thank you! I had forgotten about those. |
@ryan-summers I've just hit
This is with
Am I doing something wrong here, or does this suggest that #151 does not fix the bus errors? |
It should be impossible for the F4 HAL to generate |
any suggestions for how to debug? |
Can you confirm
|
Actually, based on your service log, you're not incorporating the most up-to-date commits from #151 |
@ryan-summers aah, yes, I hadn't noticed that you'd pushed another commit to that branch. I'll pull that as well and see how I get on. |
Running at 50kHz I2C with the latest I2C patch for ~24 hours without issue. That's a good start but not conclusive given how rare the I2C problems are. I'll keep an eye on it |
Let's not mix hypotheses and analyze one thing at a time. You appear to have already confirmed that changing the I2C frequency does not help. |
I'm not sure exactly what point you're making, but my current take on things is this: There appear to be two symptoms: bus errors and NACks both of which manifest with the current master. The observations so far are that with the reduced I2C clock frequency and the #151 applied I have not seen either of these issues after a couple of days of continuous operation. With the reduced I2C clock frequency alone I did see bus errors, I didn't confirm whether the NACK errors were also present. If by Monday I still haven't seen any issues with this configuration I will revert the I2C clock frequency change and see what happens. My hunch is that I will see NACKs but no bus errors, but we won't know until we have the data. |
You did see a BUS with 50 kHz in #128 (comment). |
Indeed I did, which is exactly what I said above: With the reduced I2C clock frequency alone I did see bus errors, I didn't confirm whether the NACK errors were also present.
I'm not sure if you intend that to come across as dismissively as I would naturally read it, but anyway...sure we can move a few of these posts to another issue. |
It's not trying to be dismissive, it's that the issue of NACKs is not covered by this issue, but rather #140. It is undesirable and counter productive to attempt to debug both issues at once, as it is easy to conflate one fix for another. This issue is specifically only in regard to bus errors. |
Man, you are really difficult. I'm only trying to keep things simple and separate the BUS errors from other issues. That's why I'm suggesting to exclude your clock changes from the discussion here. |
Right now the only data I have is with two patches applied simultaneously. Maybe it would have been better if I had applied them one at a time, but I didn't. The data I have is clearly relevant here as well as on the other issue so as far as I'm concerned either are valid places to post it. If you have independently tested and are happy that your patch resolves this issue, that's fine, but again, I haven't explicitly tested that. |
The update to the HAL makes a BUS error impossible to generate, so I am confident that this issue is resolved (as it should no longer be possible to occur). If we ever see a BUS error, this fix was improperly applied and we need a different investigative avenue. I have independently tested this patch locally as well and have not observed BUS errors. |
@hartytp Your data is relevant insofar as it excludes reducing I2C clock speed as a valid fix to this issue. |
Booster still seems to restart intermittently maybe once per hour (judging from the fan noise). This is when it's being left alone
The text was updated successfully, but these errors were encountered: