Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AP_Scripting: fixed hard fault on handling of lua exception #27056

Merged
merged 2 commits into from
May 14, 2024

Conversation

tridge
Copy link
Contributor

@tridge tridge commented May 13, 2024

this PR adds a test that exercises rapid fault handling and fixes it by shifting the save of the extra registers before the setjmp call and adding save/restore of the first 16 fp registers
The test was derived from the networking web server. The reason the web server triggered this issue is it continues running after a fault, and continually calls async functions (server side lua calls) that may fault. Most lua scripts when they fault will stop, so only have one chance to trigger the bug. The web server gets lots of chances per second to trigger the bug.
without the fix this lua script hard faults in a few seconds
This bug has been confirmed to happen with 4.3.x and 4.4.x as well as current stable 4.5.x

@tridge tridge changed the title AP_Scripting: added example script that causes a hard fault AP_Scripting: fixed hard fault on handling of lua exception May 13, 2024
@tridge tridge added the ChibiOS label May 13, 2024
@IamPete1
Copy link
Member

IamPete1 commented May 13, 2024

I have done some testing on a CubeOrangePlus. Both with the example script in this PR and the math.random(1,0) with a scripting restart from #24231

Master:
random is OK and the example fails.

This PR:
Both random and example are OK.

Master with both the stash and pop removed:
Both random and example are OK.

Master with stash, pop and O0 optimize removed from lua:
Both random and example are OK.

Master with stash, pop and O0 optimize removed from lua and from the scripting thread:
example is OK, random fails.

So, from my testing its part of the last fix that causes this issue.

@tridge tridge force-pushed the pr-scripting-fault-fix branch from 1d60829 to c831b98 Compare May 13, 2024 21:08
@tridge
Copy link
Contributor Author

tridge commented May 13, 2024

@IamPete1 as discussed, I think the save of the extra fp registers is needed. A proper test suite for setjmp/longjmp would be nice to have!

@tpwrules
Copy link
Contributor

Registers s0-s15 are caller-saved, so theoretically luaD_rawrunprotected can do whatever it wants and does not need to preserve them. This assumes that it does not itself use them, which it shouldn't.

@tridge
Copy link
Contributor Author

tridge commented May 14, 2024

Sid and I did some validation of the fix like this:
image
we added code that setup the fp registers with known values and validated that before this PR it does not restore correctly and after the PR it does

@tridge
Copy link
Contributor Author

tridge commented May 14, 2024

Registers s0-s15 are caller-saved, so theoretically luaD_rawrunprotected can do whatever it wants and does not need to preserve them

can you point me at a reference for that?

@tridge
Copy link
Contributor Author

tridge commented May 14, 2024

this function with O0 opt:

Dump of assembler code for function luaD_rawrunprotected:
   0x080ff01c <+0>:     push    {r7, lr}
   0x080ff01e <+2>:     sub     sp, #120        ; 0x78
   0x080ff020 <+4>:     add     r7, sp, #0
   0x080ff022 <+6>:     str     r0, [r7, #12]
   0x080ff024 <+8>:     str     r1, [r7, #8]
   0x080ff026 <+10>:    str     r2, [r7, #4]
   0x080ff028 <+12>:    ldr     r3, [r7, #12]
   0x080ff02a <+14>:    ldrh.w  r3, [r3, #110]  ; 0x6e
   0x080ff02e <+18>:    strh.w  r3, [r7, #118]  ; 0x76
   0x080ff032 <+22>:    movs    r3, #0
   0x080ff034 <+24>:    str     r3, [r7, #112]  ; 0x70
   0x080ff036 <+26>:    ldr     r3, [r7, #12]
   0x080ff038 <+28>:    ldr     r3, [r3, #48]   ; 0x30
   0x080ff03a <+30>:    str     r3, [r7, #16]
   0x080ff03c <+32>:    ldr     r3, [r7, #12]
   0x080ff03e <+34>:    add.w   r2, r7, #16
   0x080ff042 <+38>:    str     r2, [r3, #48]   ; 0x30
   0x080ff044 <+40>:    mov.w   r9, #572662306  ; 0x22222222
   0x080ff048 <+44>:    vmov    s3, r9
   0x080ff04c <+48>:    mov.w   r9, #1145324612 ; 0x44444444
   0x080ff050 <+52>:    vmov    s17, r9
   0x080ff054 <+56>:    add.w   r3, r7, #16
   0x080ff058 <+60>:    adds    r3, #4
   0x080ff05a <+62>:    mov     r0, r3
   0x080ff05c <+64>:    bl      0x8021314 <setjmp>
   0x080ff060 <+68>:    mov     r3, r0
   0x080ff062 <+70>:    cmp     r3, #0
   0x080ff064 <+72>:    bne.n   0x80ff082 <luaD_rawrunprotected+102>
   0x080ff066 <+74>:    vpush   {s16-s31}
   0x080ff06a <+78>:    mov.w   r9, #858993459  ; 0x33333333
   0x080ff06e <+82>:    vmov    s3, r9
   0x080ff072 <+86>:    mov.w   r9, #2576980377 ; 0x99999999
   0x080ff076 <+90>:    vmov    s17, r9
   0x080ff07a <+94>:    ldr     r3, [r7, #8]
   0x080ff07c <+96>:    ldr     r1, [r7, #4]
   0x080ff07e <+98>:    ldr     r0, [r7, #12]
   0x080ff080 <+100>:   blx     r3
   0x080ff082 <+102>:   vpop    {s16-s31}
=> 0x080ff086 <+106>:   ldr     r2, [r7, #16]

with optimisation:

Dump of assembler code for function luaD_rawrunprotected:
   0x080ff12c <+0>:     cmp     r1, r2
   0x080ff12e <+2>:     it      cc
   0x080ff130 <+4>:     strcc   r2, [r3, #4]
   0x080ff132 <+6>:     ldr     r3, [r5, #108]  ; 0x6c
   0x080ff134 <+8>:     blx     r3
   0x080ff136 <+10>:    bl      0x813f3f8 <coerce_to_uint32_t(lua_State*, int)+44>
   0x080ff13a <+14>:    movs    r0, r0
   0x080ff13c <+16>:    push    {r0, r1, r2, r3, r4, r5, r6, lr}
   0x080ff13e <+18>:    ldr     r2, [r1, #0]
   0x080ff140 <+20>:    mov     r4, r1
   0x080ff142 <+22>:    mov     r5, r0
   0x080ff144 <+24>:    ldr     r3, [r2, #0]
   0x080ff146 <+26>:    subs    r1, r3, #1
   0x080ff148 <+28>:    str     r1, [r2, #0]
   0x080ff14a <+30>:    cbz     r3, 0x80ff17a <luaD_rawrunprotected+78>
   0x080ff14c <+32>:    ldr     r2, [r4, #0]
   0x080ff14e <+34>:    ldr     r3, [r2, #4]
   0x080ff150 <+36>:    adds    r1, r3, #1
   0x080ff152 <+38>:    str     r1, [r2, #4]
   0x080ff154 <+40>:    ldrb    r6, [r3, #0]
   0x080ff156 <+42>:    ldr     r3, [r4, #52]   ; 0x34
   0x080ff158 <+44>:    cbz     r3, 0x80ff184 <luaD_reallocstack+4>
   0x080ff15a <+46>:    mov     r0, r3
   0x080ff15c <+48>:    movs    r1, #116        ; 0x74
   0x080ff15e <+50>:    str     r3, [sp, #12]
   0x080ff160 <+52>:    bl      0x815723c <strncasecmp+56>
   0x080ff164 <+56>:    ldr     r3, [sp, #12]
   0x080ff166 <+58>:    cbnz    r0, 0x80ff184 <luaD_reallocstack+4>
   0x080ff168 <+60>:    ldr     r2, [pc, #60]   ; (0x80ff1a8 <luaD_reallocstack+40>)
   0x080ff16a <+62>:    mov     r0, r5
   0x080ff16c <+64>:    adds    r1, r2, #5
   0x080ff16e <+66>:    bl      0x8102484 <luaO_pushvfstring+76>
   0x080ff172 <+70>:    movs    r1, #3
   0x080ff174 <+72>:    mov     r0, r5
   0x080ff176 <+74>:    bl      0x80ff0e0 <f_parser+40>
   0x080ff17a <+78>:    ldr     r0, [r4, #0]
   0x080ff17c <+80>:    bl      0x81093d0 <__tcf_1(void*)+12>
End of assembler dump.

@tridge
Copy link
Contributor Author

tridge commented May 14, 2024

Sid found the ref:
image
we don't need to save s0 to s15

the register save must happen before the setjmp() call, which means
outside of the LUAI_TRY() macro. We also should be saving all 32
floating point registers
@tridge tridge force-pushed the pr-scripting-fault-fix branch from c831b98 to ea5418e Compare May 14, 2024 02:01
@tridge
Copy link
Contributor Author

tridge commented May 14, 2024

I test flew this today on a CubeOrange quadplane with 4 lua scripts

@tridge
Copy link
Contributor Author

tridge commented May 14, 2024

CI is being very slow, but this PR has passed CI in my personal repo:
https://github.com/tridge/ardupilot/actions?page=2&query=branch%3Apr-scripting-fault-fix
merging

@tridge tridge merged commit 188df13 into ArduPilot:master May 14, 2024
92 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 4.5.3-beta1
Status: 4.4.4-beta1
Development

Successfully merging this pull request may close these issues.

6 participants