Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I2C bus fault observed when reading RF channel temperature #128

Closed
hartytp opened this issue Jan 15, 2021 · 82 comments · Fixed by #151
Closed

I2C bus fault observed when reading RF channel temperature #128

hartytp opened this issue Jan 15, 2021 · 82 comments · Fixed by #151
Assignees
Labels
bug Something isn't working hardware i2c

Comments

@hartytp
Copy link

hartytp commented Jan 15, 2021

Booster still seems to restart intermittently maybe once per hour (judging from the fan noise). This is when it's being left alone

@ryan-summers
Copy link
Member

ryan-summers commented Jan 15, 2021

If you configure booster to power-on channels by default (e.g. save settings with a channel enabled), it will not enable on reboot if a watchdog occurs because of https://github.com/quartiq/booster/blob/develop/src/rf_channel.rs#L523. This is a potential case to figure out if this is a watchdog or a panic elsewhere in the code.

I'll continue running booster locally to see if I can reproduce these faults, but I'm suspicious that it may be related to more RF channels being installed.

@hartytp
Copy link
Author

hartytp commented Jan 15, 2021

Thanks.

I have already made all channels power on by default. I can confirm that Booster stopped putting RF out for 10s or so, the lights went blank, the fans spun up and the lights + RF came back

@ryan-summers
Copy link
Member

If the RF came back up without any user interaction, that indicates that this is likely a panic as opposed to a watchdog (which is a good thing, as it should be easier to root cause). I'll poke around

@hartytp
Copy link
Author

hartytp commented Jan 15, 2021

👍 thanks!

@ryan-summers
Copy link
Member

Alright, I was able to see a watchdog occur on my bench set up using the debug build when I increased loops to occur 4 times over channels (since I have only 2 installed, looping 4 times should be roughly similar to scanning 8 channels). I'll do some debugging and analysis on Monday to see what's causing this.

@ryan-summers ryan-summers added the bug Something isn't working label Jan 15, 2021
@ryan-summers ryan-summers self-assigned this Jan 15, 2021
@hartytp
Copy link
Author

hartytp commented Jan 15, 2021

👍 thanks

@ryan-summers ryan-summers changed the title Booster restarting Booster watchdogs encountered with all RF channels installed Jan 15, 2021
@hartytp
Copy link
Author

hartytp commented Jan 15, 2021

@ryan-summers one other data point here...I haven't confirmed if all three Boosters I currently have display this behaviour or not. But, I've noticed that the one we're using has a red LED on one channel that seems to appear after some time (and is cleared by the crash/restart).

I haven't had a chance to look at what causes this error, but will open an issue when I do.

@ryan-summers
Copy link
Member

The red LED is indicative of the BLOCKED channel state. This only occurs if the temperature is measured out of range (e.g. over 60 C or below 5 C) or if the overcurrent alert pin is low. It is intended to be cleared by device reset. My guess would be that it's the overcurrent alert tripping.

@hartytp
Copy link
Author

hartytp commented Jan 15, 2021

My guess would be that it's the overcurrent alert tripping.

How does that work? Is it a software detect of the current going too high? or do we try to check if the LDO goes into foldback.

@ryan-summers
Copy link
Member

ryan-summers commented Jan 15, 2021 via email

@hartytp
Copy link
Author

hartytp commented Jan 18, 2021

@ryan-summers

If you configure booster to power-on channels by default (e.g. save settings with a channel enabled), it will not enable on reboot if a watchdog occurs because of https://github.com/quartiq/booster/blob/develop/src/rf_channel.rs#L523. This is a potential case to figure out if this is a watchdog or a panic elsewhere in the code.

Doesn't this imply that this issue is a panic and not a watchdog issue?

@ryan-summers
Copy link
Member

ryan-summers commented Jan 18, 2021

Doesn't this imply that this issue is a panic and not a watchdog issue?

It very well might. I'm adding service information reporting (e.g. stored panic and watchdog detection) in #131 so we can determine if a panic or watchdog is occurring. If it is a panic, it should give us some information as to what exactly is happening.

@hartytp
Copy link
Author

hartytp commented Jan 18, 2021

ace!

@ryan-summers ryan-summers changed the title Booster watchdogs encountered with all RF channels installed Booster resets encountered with all RF channels installed Jan 20, 2021
@ryan-summers
Copy link
Member

ryan-summers commented Jan 20, 2021

@hartytp We just merged in changes to add service reporting information. I would ask you to do the following:

  • Pull latest firmware + flash booster
  • Wait until you believe a reset has occurred
  • Connect to booster's USB port
  • Type service to get the service information and post it here

Note that if you type service, it will clear any pending panic-info or watchdog detection, so it won't persist after a query.

Output should be similar to the following (ideally with specified panic info and/or watchdog information):
image

@ryan-summers
Copy link
Member

ryan-summers commented Jan 20, 2021

As a data point (I'm unaware as to whether or not this is a real issue), I saw a panic report indicating failed communications to the I2C temperature sensor during development. This fault may have been caused by development processor cycling, but I figured I'd list it here just in case. I only saw it right before reflashing booster, and other tests have had booster running for a number of hours without reporting any faults.

@hartytp
Copy link
Author

hartytp commented Jan 21, 2021

service
Version             : v0.1.0-57-gb50bfa9
Git revision        : b50bfa9c19ca7ae06459bedd1d30af71fc7618b1
Features            :
Panic Info          : panicked at 'called `Result::unwrap()` on an `Err` value: Interface', src/rf_channel.rs:802:14

Watchdog Detected   : false

@hartytp
Copy link
Author

hartytp commented Jan 21, 2021

So something to do with the temperature code

booster/src/rf_channel.rs

Lines 798 to 803 in 90dd3b8

pub fn get_temperature(&mut self) -> f32 {
self.i2c_devices
.temperature_monitor
.get_remote_temperature()
.unwrap()
}

@ryan-summers
Copy link
Member

It appears that I2C communications with the temperature sensor MAX6642. This is the I2C sensor on the analog RF channel, so it's the furthest I2C device in the chain. This is identical to the fault that I reported above. I'll take a look at some of the underlying I2C code to see if I see any issues here. A potential workaround is to implement a retry of the communications, but I'd like to get to the underlying fault here.

@hartytp
Copy link
Author

hartytp commented Jan 21, 2021

It appears that I2C communications with the temperature sensor MAX6642. This is the I2C sensor on the analog RF channel, so it's the furthest I2C device in the chain. This is identical to the fault that I reported above. I'll take a look at some of the underlying I2C code to see if I see any issues here. A potential workaround is to implement a retry of the communications, but I'd like to get to the underlying fault here.

Agreed on all fronts. Thanks.

My guess is that #130 is the same underlying issue.

@jordens
Copy link
Member

jordens commented Jan 21, 2021

@ryan-summers there is no room for any significant hardware debugging here. Let's just retry. That would also be the correct approach if the sensor is signalling busy through a nak.

@hartytp
Copy link
Author

hartytp commented Jan 21, 2021

@jordens assuming #130 does turn out to be a data corruption issue it won't be solved by retrying, will it?

@jordens
Copy link
Member

jordens commented Jan 21, 2021

Data corruption would in all likelihood indicate a hardware issue outside of the scope of the firmware development.

@hartytp
Copy link
Author

hartytp commented Jan 21, 2021

It might do, it's hard to know without further info. Presumably this is the kind of thing where sticking a scope on the bus would give us a lot of information in a relatively time-efficient manner. e.g. if the bus has a bit much capacitance so we can't actually drive it quite as fast as we are, we can discuss whether that's a hardware issue or a firmware issue, but either way it's fixed by slowing things down a little.

Either way, the current status is that I have RED LEDs showing up on channels on a daily basis which makes this firmware somewhat unusable in a production environment...

@ryan-summers
Copy link
Member

I'm adding a change to gather more descriptive I2C errors in #139, but it's difficult at this point to determine a correct course of action. There could be a lot of failure modes that result in these symptoms. For example:

  • there may be a supply disruption on the sensor power rail which causes it to temporarily power down and respond with NACK
  • there may be a temporary power supply failure that causes the I2C bus to hang
  • we could be receiving unexpected NACKs
  • the I2C line may be noisy.

Each failure mode above has a different course of correction in my opinion.

We do not yet know if #130 is related to these I2C communication issues, although it appears likely, Determining whether #130 is caused by the /ALERT power supply monitor fault or an out-of-temperature detection will also inform this investigation.

@hartytp
Copy link
Author

hartytp commented Jan 21, 2021

We do not yet know if #130 is related to these I2C communication issues, although it appears likely, Determining whether #130 is caused by the /ALERT power supply monitor fault or an out-of-temperature detection will also inform this investigation.

I'll post a device status next time I catch the errors. At the moment they're being cleared by the panics which makes life a bit harder. I might hack something in to remove the panics and make it easier to catch the errors. Either way I'll post something before long.

@jordens
Copy link
Member

jordens commented Jan 21, 2021

@hartytp we have gone through a lot of additional unexpected trouble to work around known and unknown hardware issues, design errors, lacking/incorrect documentation. And as you see we are continuing to support that. We are currently not planning to do any significant hardware debugging. The fact that we would be able to do it doesn't mean that we should or need to do it. I suggest that you or TS/CTI look at potential hardware issues.
The slow uptake of this firmware makes significant additional investments not viable for us.

@ryan-summers
Copy link
Member

@hartytp I'm currently testing the patch on my local booster and will run the test overnight. If you would like to help out, the PR with the changes is on #151. I've created a PR to update the STM32F4xx-HAL at stm32-rs/stm32f4xx-hal#275

@hartytp
Copy link
Author

hartytp commented Mar 17, 2021

Great, thanks. I'll flash that tomorrow morning and let you know how I get on.

@hartytp
Copy link
Author

hartytp commented Mar 18, 2021

okay, I've set a booster up with all channels set to enable on power on, but one channel manually disabled. If we hit a panic then all green LEDs will be on.

I'll keep an eye on this until Monday. Based on past experience, we should expect to see a panic before then if the issue is not resolved

@HarryMakes
Copy link
Contributor

HarryMakes commented Mar 19, 2021

Hi, as I mentioned I've been running a modified Booster firmware without applying @ryan-summers' #151 or stm32-rs/stm32f4xx-hal#275, with all RF channels disabled, and the Booster didn't panic during the 42 hours of test run (with the MQTT client connected to a broker uninterruptedly). The temperatures reported on the MQTT messages are also normal (e.g. around the room temperature, no extreme values). My modification is as follows:

diff --git a/src/rf_channel.rs b/src/rf_channel.rs
index 67b7db3..bf153b0 100644
--- a/src/rf_channel.rs
+++ b/src/rf_channel.rs
@@ -792,10 +792,17 @@ impl RfChannel {
 
     /// Get the temperature of the channel in celsius.
     pub fn get_temperature(&mut self) -> f32 {
-        self.i2c_devices
-            .temperature_monitor
-            .get_remote_temperature()
-            .unwrap()
+        loop {
+            match self.i2c_devices
+                .temperature_monitor
+                .get_remote_temperature() {
+                    Ok(temp_c) => return temp_c,
+                    // See: ST document ES0182 Rev 13, Section 2.5.6
+                    // (https://www.st.com/resource/en/errata_sheet/dm00037591-stm32f405407xx-and-stm32f415417xx-device-limitations-stmicroelectronics.pdf)
+                    Err(max6642::Error::Interface(stm32f4xx_hal::i2c::Error::BUS)) => {},
+                    Err(e) => panic!("Error when reading RF channel temperature: {:?}", e)
+                }
+        }
     }
 
     /// Set the bias of the channel.

Unlike Ryan's PR, my changes only applies to the MAX6642 I2C transactions, but it appears that it is already good enough to stop Booster from panicking. Does that mean it's more likely a hardware issue then? Currently I wouldn't have time to test with all other I2C devices but it might be a good idea for us to aim at eliminating any possible spurious errors across all devices.

@ryan-summers
Copy link
Member

ryan-summers commented Mar 19, 2021

Unlike Ryan's PR, my changes only applies to the MAX6642 I2C transactions, but it appears that it is already good enough to stop Booster from panicking. Does that mean it's more likely a hardware issue then?

Not necessarily. The errata indicates that the chip erroneously detects bus fault conditions. That means that it's likely that some common characteristic of the MAX6642 connection appears to be more likely to trip this errata than the other buses. I wouldn't go so far as to say that there is an actual hardware issue without further investigation.

In #151, I have updated the HAL such that no I2C transactions listen to the BUS_ERROR bit (stm32f4xx_hal::i2c::Error::BUS will never occur), as that is what the errata workaround recommends. We will see if that results in erroneous operation in the future, but I have so far seen no faults on Booster in the last 24 hours of testing.

@hartytp
Copy link
Author

hartytp commented Mar 19, 2021

Unlike Ryan's PR, my changes only applies to the MAX6642 I2C transactions, but it appears that it is already good enough to stop Booster from panicking. Does that mean it's more likely a hardware issue then?

out of curiosity, what idle I2C transactions are there apart from those temperature sensors? Just the fans? I assumed that the majority of panics are related to the temperature sensors because they are the only chips on the bus with significant traffic.

@hartytp
Copy link
Author

hartytp commented Mar 19, 2021

I wouldn't go so far as to say that there is an actual hardware issue without further investigation.

I agree with this. Without understanding more about the nature of the bug in the STM32F4 I2C controller it's hard to reason about.

I need to check what measurements Greg and others have made. My understanding is that he has checked the I2C bus voltage levels, noise and timings (inc rise/fall time) and not seen anything suspicious. I believe he's also checked the supply to the MAX6642 and, again, not seen anything suspicious. It's always hard to prove a negative (this isn't a hardware issue beyond the SM32 I2C issue) but so far we've done a decent amount of background work and not seen anything suspicious.

At present, the best plan I have is (assuming @ryan-summers's fix resolves the panic) to leave Booster logging for an extended period (month or longer) and see if we spot any glitches or anything else suspicious. If not, I'm inclined to say this is all the I2C core (but I'm open to other suggestions).

@hartytp
Copy link
Author

hartytp commented Mar 19, 2021

However...

+----------------------+
| Booster Command Help :
+----------------------+
* `reset` - Resets the device
* `dfu` - Resets the device to DFU mode
* `read <PROP>` - Reads the value of PROP. PROP may be [ip-address, broker-address, mac, id, gateway, netmask]
* `write <PROP> <IP>` - Writes the value of <IP> to <PROP>. <PROP> may be [ip-address, broker-address, netmask, gateway] and <IP> must be an IP address (e.g.  192.168.1.1)
* `write id <ID>` - Write the MQTT client ID of the device. <ID> must be 23 or less ASCII characters.
* `service` - Read the service information. Service infromation clears once read.

> service
Version             : v0.2.0-3-g872fcf4
Git revision        : 872fcf4e98da8b0c86ab206af3f6ef30464ca4dd
Features            :
Panic Info          : panicked at 'called `Result::unwrap()` on an `Err` value: Interface(NACK)', src/chassis_fans.rs:110:64

Watchdog Detected   : false

@HarryMakes
Copy link
Contributor

out of curiosity, what idle I2C transactions are there apart from those temperature sensors?

@hartytp Since the telemetry task routinely publishes the measurement results from calling rf_channel::RfChannel::get_status(), perhaps it is often busy monitoring the power via the MCP3221 and ADS7924 as well?

booster/src/rf_channel.rs

Lines 944 to 966 in 8247e54

/// Get the channel status.
pub fn get_status(&mut self, adc: &mut hal::adc::Adc<hal::stm32::ADC3>) -> ChannelStatus {
let power_measurements = self.get_supply_measurements();
let status = ChannelStatus {
reflected_overdrive: self.pins.reflected_overdrive.is_high().unwrap(),
output_overdrive: self.pins.output_overdrive.is_high().unwrap(),
alert: self.pins.alert.is_low().unwrap(),
temperature: self.get_temperature(),
p28v_current: power_measurements.i_p28v0ch,
p5v_current: power_measurements.i_p5v0ch,
p5v_voltage: power_measurements.v_p5v0mp,
input_power: self.get_input_power(),
output_power: self.get_output_power(adc),
reflected_power: self.get_reflected_power(adc),
reflected_overdrive_threshold: platform::MAXIMUM_REFLECTED_POWER_DBM,
output_overdrive_threshold: self.get_output_interlock_threshold(),
bias_voltage: self.get_bias_voltage(),
state: self.get_state(),
};
status
}

Panic Info : panicked at 'called `Result::unwrap()` on an `Err` value: Interface(NACK)', src/chassis_fans.rs:110:64

And I also encountered this error before, but only on one out of 4 Booster units that I've tested. So far, I don't see a pattern for it to happen. If I find more clues I'll share them over on #140.

And thank you @ryan-summers and @hartytp for sharing your perspectives on the issue. It helps me learn how to deal with unexpected behaviours when many hardware interactions are taking place.

@hartytp
Copy link
Author

hartytp commented Mar 19, 2021

@hartytp Since the telemetry task routinely publishes the measurement results from calling rf_channel::RfChannel::get_status(), perhaps it is often busy monitoring the power via the MCP3221 and ADS7924 as well?

Of course, thank you! I had forgotten about those.

@hartytp
Copy link
Author

hartytp commented Mar 24, 2021

@ryan-summers I've just hit

service
Version             : v0.2.0-3-g872fcf4
Git revision        : 872fcf4e98da8b0c86ab206af3f6ef30464ca4dd-dirty
Features            :
Panic Info          : panicked at 'called `Result::unwrap()` on an `Err` value: Interface(BUS)', src/rf_channel.rs:798:14

This is with

 git log
commit 872fcf4e98da8b0c86ab206af3f6ef30464ca4dd (HEAD -> rs/issue-128/bus-errors, origin/rs/issue-128/bus-errors)
Author: Ryan Summers <[email protected]>
Date:   Wed Mar 17 17:46:00 2021 +0100

    Adding CHANGELOG file

commit 8e7b50060f77d0867f7d4dfbb66db7fad399776c
Author: Ryan Summers <[email protected]>
Date:   Wed Mar 17 17:45:14 2021 +0100

    Updating HAL dependency

commit 8247e54f46465844383d730d1dc695c1cda66de9 (origin/develop, origin/HEAD)
Merge: 3c77019 eb2c887
Author: Ryan Summers <[email protected]>
Date:   Tue Feb 2 17:27:58 2021 +0100

    Merge pull request #149 from quartiq/master

    Re-merging master into develop

commit eb2c887cb0e3fd9a7a40db5056660538c5fbb5d1 (tag: v0.2.0)
Merge: 5d08a80 3c77019
Author: Ryan Summers <[email protected]>
Date:   Tue Feb 2 17:18:53 2021 +0100

    Merge pull request #148 from quartiq/develop

    Release 0.2.0
git diff
diff --git a/src/main.rs b/src/main.rs
index 612cb16..0638c08 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -215,7 +215,7 @@ const APP: () = {
                 let i2c = {
                     let scl = gpiob.pb6.into_alternate_af4_open_drain();
                     let sda = gpiob.pb7.into_alternate_af4_open_drain();
-                    hal::i2c::I2c::i2c1(c.device.I2C1, (scl, sda), 100.khz(), clocks)
+                    hal::i2c::I2c::i2c1(c.device.I2C1, (scl, sda), 50.khz(), clocks)
                 };

                 tca9548::Tca9548::default(i2c, &mut i2c_mux_reset, &mut delay).unwrap()

Am I doing something wrong here, or does this suggest that #151 does not fix the bus errors?

@ryan-summers
Copy link
Member

ryan-summers commented Mar 24, 2021

It should be impossible for the F4 HAL to generate Error::BUS at this point, so there's likely some dependency misconfiguration going on here.

@hartytp
Copy link
Author

hartytp commented Mar 24, 2021

any suggestions for how to debug?

@ryan-summers
Copy link
Member

Can you confirm Cargo.lock contains the proper git hash for stm32f4xx-hal? It should be along the lines of:

[[package]]
name = "stm32f4xx-hal"
version = "0.8.3"
source = "git+https://github.com/stm32-rs/stm32f4xx-hal#c8b11b200ac3c5f9be802f8f9e39755b1cb5933a"

@ryan-summers
Copy link
Member

Actually, based on your service log, you're not incorporating the most up-to-date commits from #151

@hartytp
Copy link
Author

hartytp commented Mar 24, 2021

@ryan-summers aah, yes, I hadn't noticed that you'd pushed another commit to that branch. I'll pull that as well and see how I get on.

@hartytp
Copy link
Author

hartytp commented Mar 25, 2021

Running at 50kHz I2C with the latest I2C patch for ~24 hours without issue. That's a good start but not conclusive given how rare the I2C problems are. I'll keep an eye on it

@jordens
Copy link
Member

jordens commented Mar 26, 2021

Let's not mix hypotheses and analyze one thing at a time. You appear to have already confirmed that changing the I2C frequency does not help.

@hartytp
Copy link
Author

hartytp commented Mar 26, 2021

Let's not mix hypotheses and analyze one thing at a time. You appear to have already confirmed that changing the I2C frequency does not help.

I'm not sure exactly what point you're making, but my current take on things is this:

There appear to be two symptoms: bus errors and NACks both of which manifest with the current master.

The observations so far are that with the reduced I2C clock frequency and the #151 applied I have not seen either of these issues after a couple of days of continuous operation. With the reduced I2C clock frequency alone I did see bus errors, I didn't confirm whether the NACK errors were also present.

If by Monday I still haven't seen any issues with this configuration I will revert the I2C clock frequency change and see what happens. My hunch is that I will see NACKs but no bus errors, but we won't know until we have the data.

@jordens
Copy link
Member

jordens commented Mar 26, 2021

You did see a BUS with 50 kHz in #128 (comment).
This issue is about BUS errors. It would be good to move your clock speed hunch over to the other issue, if you want to persue that.

@hartytp
Copy link
Author

hartytp commented Mar 26, 2021

You did see a BUS with 50 kHz in #128 (comment).

Indeed I did, which is exactly what I said above: With the reduced I2C clock frequency alone I did see bus errors, I didn't confirm whether the NACK errors were also present.

It would be good to move your clock speed hunch over to the other issue, if you want to persue that.

I'm not sure if you intend that to come across as dismissively as I would naturally read it, but anyway...sure we can move a few of these posts to another issue.

@ryan-summers
Copy link
Member

I'm not sure if you intend that to come across as dismissively as I would naturally read it, but anyway...sure we can move a few of these posts to another issue.

It's not trying to be dismissive, it's that the issue of NACKs is not covered by this issue, but rather #140. It is undesirable and counter productive to attempt to debug both issues at once, as it is easy to conflate one fix for another. This issue is specifically only in regard to bus errors.

@jordens
Copy link
Member

jordens commented Mar 26, 2021

Man, you are really difficult. I'm only trying to keep things simple and separate the BUS errors from other issues. That's why I'm suggesting to exclude your clock changes from the discussion here.

@ryan-summers
Copy link
Member

@hartytp I have just merged the fix for this issue - please continue discussion of the lower 50KHz I2C clock frequency in #140, as that is intended to fix NACKs encountered with the fan controllers.

@hartytp
Copy link
Author

hartytp commented Mar 26, 2021

Right now the only data I have is with two patches applied simultaneously. Maybe it would have been better if I had applied them one at a time, but I didn't. The data I have is clearly relevant here as well as on the other issue so as far as I'm concerned either are valid places to post it.

If you have independently tested and are happy that your patch resolves this issue, that's fine, but again, I haven't explicitly tested that.

@ryan-summers
Copy link
Member

The update to the HAL makes a BUS error impossible to generate, so I am confident that this issue is resolved (as it should no longer be possible to occur). If we ever see a BUS error, this fix was improperly applied and we need a different investigative avenue.

I have independently tested this patch locally as well and have not observed BUS errors.

@jordens
Copy link
Member

jordens commented Mar 26, 2021

@hartytp Your data is relevant insofar as it excludes reducing I2C clock speed as a valid fix to this issue.
@ryan-summers 's data supports the hypothesis that #151 is indeed a valid fix to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hardware i2c
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants