-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug test suite unexpected failures against SiFive HiFive1 board #1047
Comments
@TommyMurphyTM1234 I don't have SiFive HiFive1 but maybe we can debug this... It may take few experiments.
I suggest we start investigating with answering this question. In #869 @JanMatCodasip established a baseline against: riscv-openocd commit: e0dd44a Could you take riscv-openocd from e0dd44a (mentioned above) and run testsuite against current riscv-tests TOT . I want to make sure that there were no incompatible changes in riscv-tests repository first. Also, could you please attach the logs you have when you get "all but one test failing" is failing. Maybe I can try and figure out the reason from the logs alone? |
Thanks for the suggestions @aap-sc. I built OpenOCD from e0dd44a:
and used the top of the tree RISC-V test suite and that gave the expected results:
In case they're needed I've attached the zipped logs for these tests: logs.zip Note that I did not use |
Sorry - I should have posted the logs in the first place. Just to clarify, this is using OpenOCD built from the top of the tree:
plus this fix for the 0.11 segfault: And the RISC-V tests from the top of the tree. Here are the results that I get:
And here are the zipped logs for these tests: logs.zip If you need any further information or tests run let me know. |
@TommyMurphyTM1234 I've glanced at the logs. Looks like there is a common issue with these tests:
I don't see this error in your previous logs. I'd like to request few adjustments to how you run tests:
I don't remember what
|
So basically I suggest to do a kind of bisection (with few "educated guesses" as a bisection steps, instead of a proper bisection). Just with |
Notes:
|
@TommyMurphyTM1234 good, this is something. And if you don't mind could you run the tests with the same command line option (with -d3) agains the baseline (openocd e0dd44a , riscv-tests TOT). I want to make sure that -d3 does not affect the run. |
Zipped logs: logs.zip |
Me and @JanMatCodasip have access to HiFive1 A01 board as well. I ran the current riscv-openocd master (3991492) EDIT: wrong branch, disregard, and riscv-tests master (6b1d7372d951ed75811e0a09c0fe9e065c141c2d) and got this result:
With accompanying logs: For completeness, I am running GCC (and GDB) version 12.2.0-3 (xpack) |
Thanks @MarekVCodasip.
I'm confused - how come you didn't hit this segfault: necessitating the manual integration of the patch to fix it?
At a glance these results look correct given that the fail/exception tests would presumably be masked if
What do you get for the Flash tests?
I wonder if some of my issues arise from the use of a virtual machine and maybe I should try on a physical machine instead? |
No idea. Could be that the undefined behavior manifests differently.
Logs: Given the large number of exceptions, I suspected one of the tests puts the target into some weird state OpenOCD cannot recover it from. Running the test suite again returns just exceptions without a single pass. Pressing the physical reset button on the board made some tests pass again. |
That's not what your logs say:
I can't find that commit (04154af) in this repo: |
Sorry, I have made a mistake. This was the |
Yes, it all now fails with exceptions built off commit 3991492
I will invetigate the master branch issue later. I think we should revise it in the whole repository if it is needed at all. |
@MarekVCodasip could you please double-check @TommyMurphyTM1234's results when running testsuite against b7e7a03 ? My guess that the massive failures on current TOT should be fixed by #1045 . However, looks like regression on v0.11 happened a while ago. |
I've opened a new issue for this: |
I think so too.
@MarekVCodasip's logs show this:
I'm not sure why this doesn't have the "Segmentation fault" error message but it seems to exit/crash at the same point. |
For said commit (and riscv-tests master) it seems that I also get full set of exceptions:
I did manually reset the board, just in case, before running the test suite. Logs: logs.zip |
I still need to get back to this but I'm curious as to why you suggest a manual/educated guess bisection approach rather than just using |
Well, I'm just worried that a proper bisect will take too much time and it will be easier to just guess the offending commit. If someone (like you :)) has the bandwidth to do a proper bisect, you are very much welcome.
I'll need some time to look at these logs and try to figure out what went wrong and then glance at the changes we had. This may take some time. |
As I mentioned elsewhere it seems to me a bad idea that the test suite doesn't seem to reset the target before each test.
So all tests bar Then if I run
But if I reset the board manually using the red button I get this:
So it looks like one test failing due to some anomaly/exception may cause subsequent tests to fail. |
I restricted the testing to just two tests - With the known good e0dd44a commit I get:
With the known bad commit b7e7a03 I get this:
I then used git bisect to pinpoint this commit:
The failure log with a build from commit 3f1339f is here: 20240422-183238-HiFive1-DebugBreakpoint.log I'll continue to look into this in more detail later. |
@TommyMurphyTM1234, AFAIR the issue with 3f1339f is #996 and it was addressed by #997. Please, try ec28cf0. |
Thanks @en-sc - yes - I was just going through the details/issues/PRs and noticed #997 alright.
At a glance of the log I can't really tell what goes wrong to cause the exception...
|
As I found before... If I run the
But if I do a
I probably need to capture the verbose log for the failing case... |
For some reason I don't see this in the verbose log:
and the halt address is different in the non verbose test:
versus the verbose test:
So I don't know if |
Similar tests but using telnet instead of GDB in case this helps at all... Without
With
|
@TommyMurphyTM1234, @MarekVCodasip, I would suggest to proceed by running
|
@TommyMurphyTM1234, I would also suggest to uncomment the line you have mentioned to reset the target (https://github.com/riscv-software-src/riscv-tests/blob/6b1d7372d951ed75811e0a09c0fe9e065c141c2d/debug/targets/SiFive/HiFive1.cfg#L29). |
Thanks @en-sc - I will follow up on that ASAP. In the meantime I modified
and the results look better but still not as expected:
This is not a proposed fix but simply an experimentation. Could there have been some change along the way that nullified some sort of implicit or explicit reset and thus caused problems with the tests? As I've said before though, at a high level it would seem logical to me that a test suite would ensure up front for each test that the target is in a known good state - and, for example, not in an anomalous state from a previous test or previous use - to ensure that the test is running from a clean slate. I don't know why none of the tests seem to do this. |
I will try that but my recollection is that I did before and it didn't improve matters. Edit: yes - if I uncomment the then even the
Only if I do |
|
Thanks! I will look into this, though I don't remember such changes from the top of my head.
This is true. However, the current behavior allows to check that the target is left in a consistent state after a disconnect, so I'm not sure this should be done without considerable extension to testsuite intended to cover such cases. |
@TommyMurphyTM1234 thanks for the logs. I think these should help. A little off topic, If you will, while we are on it - a few notes regarding the
So the initialization script initializes flash memory, performs examine (via init), halts the target and then performs some flash magic (via
Strictly speaking this sequence is incorrect since after the reset command the execution starts from the reset vector which is not necessary the same as an entry point to the application.
Just FYI. |
When I used
I don't really understand this.
Maybe so but, in practice, the above sequence resulted in most or all of the test programs executing correctly from Seems to me that the As I mentioned above that was just an experiment and not a suggestion of any sort of fix but I thought that it might be relevant that things worked "better" with
Why would the RAM based tests (which were the ones that I was focusing on here) be doing anything at all with flash? |
This is very strange.
I mean
Well it should. It's either the reset vector on you board matches the entry point, or if you read the PC via GDB, you just read a cached copy.
When we call |
OK - but I think that we can forget about the reset issues (which I've always found confusing in the OpenOCD context to be honest) for the moment here other than to note that with a build of OpenOCD at commit e0dd44a the test suite runs as expected but with a build of any subsequent commit it does not (althrough From the various logs/data that I have collected so far I unfortunately have no clearer idea where the problem lies that casues the test suite to fail. |
@TommyMurphyTM1234, I think I have found the issue! Will post an explanation and a patch ASAP. |
So the other issue with 3f1339f is redirection from |
Should be addressed by #1054.
@TommyMurphyTM1234, could you please check the second configuration (TOT RISC-V OpenOCD + #1046 + #1054)? |
Thanks a lot @en-sc. |
Thanks a lot @en-sc.
This looks a lot better although the results don't completely tally with expected fail/exception cases as currently detailed in the HiFive1 exclusion file (which itself I need to review):
And I don't know why that Python error cropped up. Thanks a lot - I would have struggled to identify the root cause myself. 👍 |
Just for completeness ... the following tests are unexpected failures/exceptions with respect to the current exclusions file:
I will look into these and maybe compare the logs against the previous known good OpenOCD at commit e0dd44a to try to diagnose them in case there is still some other OpenOCD issue at play here. But I think that this issue can be marked fixed by these PRs once merged: |
Quick update ... it looks to me like the Perhaps this merits its own issue but I'll post further diagnostic info here for now in a bit. Edit: ok - I split this out into a separate issue: |
New issue created to split off from this PR discussion:
Running the debug test suite:
against the SiFive HiFive1 board results in all but one test failing.
This does not tally with what I believe are the expected results:
(Ignore the minor discrepancy in the total number of tests in some of these transcripts as this is not of significance in this context).
Note also that the HiFive1 specific exclusion list can also be used to refine these results:
I looked at the first unexpected failure on the
DebugBreakpoint
test. Reproducing it by manually running OpenOCD and GDB and feeding the relevant commands I see that if I do a GDBsi
single instruction step from_start
then the program immediately goes off into the weeds at address0x2040016c
.However if I do
monitor reset init
early on in the debug session then this does not happen:Admittedly this is only on of the many unexpected failure cases.
However it may suggest that some sort of reset at the start of this (and other?) test(s) may be required?
But why did the test suite run OK before but not now?
Have changes to OpenOCD in the meantime caused some problem that results in almost all tests failing?
Why is the
reset
command commented out here?Uncommenting it doesn't really help though - other than making the test fail more quickly.
DebugBreakpoint
test failure log: HiFive1-DebugBreakpoint.logDebugBreakpoint
test failure log (withreset
enabled): HiFive1-DebugBreakpoint-with-reset.logThe text was updated successfully, but these errors were encountered: