Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

u-boot: Fix the SPI and eMMC hang issue #570

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Jan Kiszka <[email protected]>
Date: Mon, 30 Oct 2023 17:20:29 +0100
Subject: [PATCH] spi: cadence-quadspi: Fix error message on stuck busy state

We are not iterating CQSPI_REG_RETRY, we are waiting 'timeout' ms, since
day 1.

Signed-off-by: Jan Kiszka <[email protected]>
Reviewed-by: Stefan Roese <[email protected]>
Reviewed-by: Jagan Teki <[email protected]>
---
drivers/spi/cadence_qspi_apb.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/spi/cadence_qspi_apb.c b/drivers/spi/cadence_qspi_apb.c
index 9ce2c0f254f3..d033184aa466 100644
--- a/drivers/spi/cadence_qspi_apb.c
+++ b/drivers/spi/cadence_qspi_apb.c
@@ -171,8 +171,7 @@ static unsigned int cadence_qspi_wait_idle(void *reg_base)
}

/* Timeout, still in busy mode. */
- printf("QSPI: QSPI is still busy after poll for %d times.\n",
- CQSPI_REG_RETRY);
+ printf("QSPI: QSPI is still busy after poll for %d ms.\n", timeout);
return 0;
}

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Ronald Wahl <[email protected]>
Date: Wed, 11 Dec 2024 21:51:04 +0100
Subject: [PATCH] spi: cadence-quadspi: fix potential malfunction after ~49
days uptime

The get_timer function returns an unsigned long which may be calculated
from the ARM system counter. This counter is reset only on a cold reset.
U-boot divides this counter down to a 1000 Hz counter that will cross
the 32bit barrier after a bit more than 49 days. Assigning the value to
an unsigned int will truncate it on 64bit systems.
Passing this truncated value back to the get_timer function will return
a very large value that is certainly larger than the timeout and so will
go down the error path and besides stopping U-Boot will lead to messages
like

"SPI: QSPI is still busy after poll for 5000 ms."

Signed-off-by: Ronald Wahl <[email protected]>
Cc: Vignesh R <[email protected]>
Cc: Pratyush Yadav <[email protected]>
---
drivers/spi/cadence_qspi_apb.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/spi/cadence_qspi_apb.c b/drivers/spi/cadence_qspi_apb.c
index d033184aa466..ecbd6f9d147d 100644
--- a/drivers/spi/cadence_qspi_apb.c
+++ b/drivers/spi/cadence_qspi_apb.c
@@ -152,9 +152,9 @@ static int cadence_qspi_set_protocol(struct cadence_spi_priv *priv,
/* Return 1 if idle, otherwise return 0 (busy). */
static unsigned int cadence_qspi_wait_idle(void *reg_base)
{
- unsigned int start, count = 0;
+ unsigned long start, count = 0;
/* timeout in unit of ms */
- unsigned int timeout = 5000;
+ unsigned long timeout = 5000;

start = get_timer(0);
for ( ; get_timer(start) < timeout ; ) {
@@ -171,7 +171,7 @@ static unsigned int cadence_qspi_wait_idle(void *reg_base)
}

/* Timeout, still in busy mode. */
- printf("QSPI: QSPI is still busy after poll for %d ms.\n", timeout);
+ printf("QSPI: QSPI is still busy after poll for %lu ms.\n", timeout);
return 0;
}

Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Ronald Wahl <[email protected]>
Date: Wed, 11 Dec 2024 21:52:00 +0100
Subject: [PATCH] mmc: Fix potential timer value truncation

On 64bit systems the timer value might be truncated to a 32bit value
causing malfunctions. For example on ARM the timer might start from 0
again only after a cold reset. The 32bit overflow occurs after a bit
more than 49 days (1000 Hz counter) so booting after that time may lead
to a surprise because the board might become stuck requiring a cold
reset.

Signed-off-by: Ronald Wahl <[email protected]>
Cc: Peng Fan <[email protected]>
Cc: Jaehoon Chung <[email protected]>
---
drivers/mmc/mmc.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/mmc/mmc.c b/drivers/mmc/mmc.c
index 31cfda288587..2779302b843e 100644
--- a/drivers/mmc/mmc.c
+++ b/drivers/mmc/mmc.c
@@ -713,7 +713,7 @@ static int mmc_send_op_cond(struct mmc *mmc)
{
int err, i;
int timeout = 1000;
- uint start;
+ ulong start;

/* Some cards seem to need this */
mmc_go_idle(mmc);
@@ -808,7 +808,8 @@ int mmc_send_ext_csd(struct mmc *mmc, u8 *ext_csd)
static int __mmc_switch(struct mmc *mmc, u8 set, u8 index, u8 value,
bool send_status)
{
- unsigned int status, start;
+ ulong start;
+ unsigned int status;
struct mmc_cmd cmd;
int timeout_ms = DEFAULT_CMD6_TIMEOUT_MS;
bool is_part_switch = (set == EXT_CSD_CMD_SET_NORMAL) &&
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Signed-off-by: Li Hua Qian <[email protected]>
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/mmc/mmc.c b/drivers/mmc/mmc.c
index dff0ff89a801..99a92f44bef4 100644
index 2779302b843e..7554730a0489 100644
--- a/drivers/mmc/mmc.c
+++ b/drivers/mmc/mmc.c
@@ -651,8 +651,16 @@ static int sd_send_op_cond(struct mmc *mmc, bool uhs_en)
Expand Down
7 changes: 4 additions & 3 deletions recipes-bsp/u-boot/u-boot-iot2050_2023.10.bb
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,10 @@ SRC_URI += " \
file://0009-dts-iot2050-Sync-kernel-dts-to-u-boot.patch \
file://0010-dts-iot2050-Support-new-IOT2050-SM-variant.patch \
file://0011-arm-dts-iot2050-Disable-lock-step-mode-for-all-iot20.patch \
file://0012-qspi-iot2050-Add-temporary-workaround-for-the-QSPI-i.patch \
file://0013-driver-iot2050-Add-a-temporary-workaround-for-the-eMMC.patch \
file://0014-driver-iot2050-Add-a-temporary-workaround-for-the-SD.patch \
file://0012-spi-cadence-quadspi-Fix-error-message-on-stuck-busy-.patch \
file://0013-spi-cadence-quadspi-fix-potential-malfunction-after-.patch \
file://0014-mmc-Fix-potential-timer-value-truncation.patch \
file://0015-driver-iot2050-Add-a-temporary-workaround-for-the-SD.patch \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need patch 15, given patch 14?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually during my speeded-up reproducing (by using a u16 start), I have not observed the SD card issue, this echoes the code exactly, no wrong sized timer variables are used in the SD driver part.

So I think the SD issue is a different one, so I keep the patch 15.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I actually tested the SD on an advanced device, IIRC, the SD card issue is reported on a basic device, let me try it on the basic device then let's see.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested on both PG1 & 2 basic devices, SD card booting does not have such issue pattern. So better to keep the patch 15.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • the mmc fails when also SPI failed
  • patch 15 also deals with timeouts
  • the patched code also used an "int" for holding the timeout
  • patch 14 changes some uint to and ulong in the same code file to address an overflow issue

Isn't all that screaming for a closer look? I suspect sd_send_op_cond is just missing the same fix as __mmc_switch and mmc_send_op_cond receive via patch 14. Please do so, upstream first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The int timeout in sd_send_op_cond does not bind to any get_timer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, but it does not change the key point: Either, patch 15 is no obsolete because the root cause was found and fixed, or we should look for any remaining timer overflows that caused that issues. It is more than likely that the same root cause applies to "our" mmc issue as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's also what I thought, however, after a deep dive into the SD card calling stack (not only, I've checked almost every get_timer calling in the u-boot -- indeed we are in lucky, there are few u32 cases, and most of them are command which does not matter in our issue...), I found there is no other place to overflow the timer, besides the reported place, so this SD card issue is either a false alarm or a different issue, in this case, better to keep the patch 15, since that patch won't cause anything harmful even if it does not fix the issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tune get_time to return an overflowing value until the mmc issue reoccurs and debug the root cause / validate that it is fixed by patch 14. Just keeping the hack of patch 15 is not really the way forward anymore, I would say.

"

SRC_URI[sha256sum] = "e00e6c6f014e046101739d08d06f328811cebcf5ae101348f409cbbd55ce6900"
Expand Down
Loading