Scripting: backport memory handling changes to 4.6 #28680

tridge · 2024-11-19T20:35:36Z

this is a backport of this PR:
#26517
I think this is important for 4.6 as users are more commonly relying on scripting now, and we have a vulnerability where an allocation in one heap cannot be expanded if that heap is exhausted even if there is plenty of memory in the other heap. The user has no way of knowing how the scripting allocation is spread across heaps, so they have no way of knowing if their script could fail with an out of memory condition in flight

make functions for lua heap allocation suitable for use in all non-ChibiOS HALs

this allows for new heaps to be added at runtime for lua scripting if you run out of memory while armed

we want the lua garbage collector to be used to re-use memory where possible. This implements a suggestion from Thomas to avoid heap expansion unless the last allocation failed

implement in AP_MultiHeap instead

this is a standalone (no-HAL based) implementation of MultiHeap

IamPete1 · 2024-11-19T21:51:08Z

I'm not sure we need to rush to backport this. We have had various issues with scripting allocation in the past, I would feel more comfortable if it were to be in master for a while.

I don't think I would even call this a vulnerability fix. This is really just a "get out of jail free" for not setting your heap size large enough. This is very unlikely that this would hit existing scripts that have been well tested (of course memory usage does increase a little with each binding we add, so it is theoretically possible that a script that was only just running on 4.5 does not run on 4.6 (but we have also saved a bunch of memory recently due to a binding rework)).

I think this is a feature that is more a benefit to developers and those writing scripts than users. Even then, I don't recall the last time I actually had to increase from the default heap size (admittedly using H7s almost exclusively).

tpwrules · 2024-11-20T03:22:29Z

Note that this does also fix an issue which would cause an allocation expansion to fail if the first sub-heap is full no matter how large your heap is. I don't know how large that first sub-heap is typically on H7, but it's surely getting smaller. Backporting just that change wouldn't really be possible, all the rework is necessary to make that not an issue, then adding the auto expansion on top is pretty trivial.

I am a bit wary of back-porting something so big and critical as well but there is justification for those with important Lua scripts.

tridge · 2024-11-20T04:57:46Z

I don't think I would even call this a vulnerability fix. This is really just a "get out of jail free" for not setting your heap size large enough

it doesn't matter how large you set SCR_HEAP_SIZE. It can be set to 400k and fail at 100k due to the way the allocation works between heaps. When scripting calls change_size() on a pointer if the new size is not available within the current heap then the allocation will fail. That current heap can be quite small, and the chances of that is rising as other subsystems take more of the 512k AXI SRAM on the H7.
What this means is that safety critical scripts, which are getting more common, can fail unexpectedly.

tridge · 2024-11-20T07:10:12Z

@IamPete1 when would you want it to go in? It isn't clear from your comment

rmackay9 · 2024-11-20T07:19:15Z

Hi @tpwrules, how do you feel about putting an approved on this to add your vote to merging?

IamPete1 · 2024-11-20T10:04:26Z

@IamPete1 when would you want it to go in? It isn't clear from your comment

I think it should be in master for a month or two, maybe aim for 4.6.1. Just seems two soon to go straight into a release with only the testing that was done on the original PR.

IamPete1 · 2024-11-20T16:34:08Z

What this means is that safety critical scripts, which are getting more common, can fail unexpectedly.

Did we get any reports of this happening in the wild?

tridge · 2024-11-20T21:58:31Z

Did we get any reports of this happening in the wild?

we have no easy way of knowing for sure, as even with the scripting logging we can't really tell what the memory layout is. We have had some cases though where I suspect it is happening, for example on a recent partner onboarding call they reported they could not get the quicktune lua script to run on CubeOrangePlus, so they had (quite independently of the effort in the current PR) worked on porting quicktune to C++ themselves. That one really surprised me as CubeOrangePlus uses the primary H7 memory layout, which keeps the 512k AXI SRAM region away from being chewed up by other things like the main firmware data.
The boards where we are most likely to see this are ones which don't have a clean AXI SRAM when scripting starts. For example, the 6X (CUAV, Holybro, ARK etc) put the AXI SRAM as primary memory, which means by the time scripting starts a very large chunk of that is gone, leaving only much smaller chunks to allocate from. This stems from the 6X needing to use the "ALT_RAM_MAP" due to a limitation of the PX4 bootloader, which needs to be shipped on the boards due to trademark issues.
The situation is even worse on F7, and I have spent a lot of frustrating time trying to help users get the quicktune lua to run at all on a F7 quadplane with CAN enabled. The 512k of ram should be plenty, but it just always gets an out of memory error. I suspect that is caused by this issue, but I don't have any proof of that, just a lack of any other explanation.

tridge · 2024-11-20T22:28:22Z

I think it should be in master for a month or two, maybe aim for 4.6.1

that violates another one of our principles of trying to avoid major changes in the point releases. We do it sometimes, but we try to avoid it if we can as it violates user expectations of the release process.

rmackay9 · 2024-11-21T00:22:22Z

I think if we're eventually going to merge this to 4.6 then the sooner the better in order to get more beta testers using it.

Probably the best way to reduce the risk is to have devs pour over it and try and spot issues so I look forward to @tpwrules finishing his review. The second line of defense is the alpha and then beta testing and so maximising the time it spends in beta testing should add safety for our general community. Those testing the beta are knowingly taking on extra risks so they should be ready for any issues and are more likely to report back

timtuxworth · 2024-11-21T00:41:58Z

We are still in Beta on 4.6 if I'm not wrong - now is the time to do it. IMO we should put this in for Beta 2, so it becomes part of the 4.6 release.

rmackay9 · 2024-11-24T23:58:07Z

@tpwrules, you didn't have a chance to look over this again did you? Sorry, I'm just not personally qualified to review it but as a release manager I would like to get it into 4.6 earlier rather than later. No real pressure though of course, we've all got things to do

tpwrules

Looks good, tried it on Cube Orange in sim on HW without problems.

Let's get this in and baking.

rmackay9

Now that @tpwrules has approved, I'm OK as well

tridge added 21 commits November 20, 2024 07:33

AP_Scripting: cleanup debug option handling

b585ba2

AP_Scripting: added ability to expand heap at runtime if armed

ba62673

AP_HAL: rework heap allocation functions

7f1fcd4

make functions for lua heap allocation suitable for use in all non-ChibiOS HALs

AP_Common: allow expansion of heaps in MultiHeap

d5309fd

this allows for new heaps to be added at runtime for lua scripting if you run out of memory while armed

AP_HAL_ChibiOS: implement new scripting heap APIs

2cc10fe

AP_HAL_ESP32: implement new scripting heap APIs

77a1d53

AP_HAL_Linux: implement new scripting heap APIs

fd8bbff

AP_HAL_QURT: implement new scripting heap APIs

f36fff3

AP_HAL_SITL: implement new scripting heap APIs

3e34d6b

AP_Scripting: added warning on heap expansion

8454f59

AP_Common: added last_failed for leveraging lua GC

17d67fe

we want the lua garbage collector to be used to re-use memory where possible. This implements a suggestion from Thomas to avoid heap expansion unless the last allocation failed

AP_HAL: removed heap APIs

5aadef9

implement in AP_MultiHeap instead

AP_HAL_ChibiOS: removed heap APIs

cf27f8a

AP_HAL_ESP32: removed heap APIs

e01dec5

AP_HAL_SITL: removed heap APIs

3d5b0c4

AP_Common: removed old MultiHeap code

4c41036

AP_MultiHeap: added library

d676497

this is a standalone (no-HAL based) implementation of MultiHeap

AP_Scripting: use AP_MultiHeap

9eeb818

waf: added AP_MultiHeap

65ddeb0

AP_Periph: fixed build with scripting

dc399b2

AP_MultiHeap: added simple unit test

d7d53c2

tridge added the Scripting label Nov 19, 2024

tridge added the DevCallEU label Nov 20, 2024

peterbarker removed the DevCallEU label Nov 20, 2024

tridge added the DevCallTopic label Nov 24, 2024

tpwrules approved these changes Nov 25, 2024

View reviewed changes

rmackay9 approved these changes Nov 25, 2024

View reviewed changes

tridge merged commit 04b8d36 into ArduPilot:ArduPilot-4.6 Nov 25, 2024
100 checks passed

IamPete1 removed the DevCallTopic label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripting: backport memory handling changes to 4.6 #28680

Scripting: backport memory handling changes to 4.6 #28680

tridge commented Nov 19, 2024 •

edited

Loading

IamPete1 commented Nov 19, 2024

tpwrules commented Nov 20, 2024

tridge commented Nov 20, 2024

tridge commented Nov 20, 2024

rmackay9 commented Nov 20, 2024

IamPete1 commented Nov 20, 2024

IamPete1 commented Nov 20, 2024

tridge commented Nov 20, 2024

tridge commented Nov 20, 2024

rmackay9 commented Nov 21, 2024

timtuxworth commented Nov 21, 2024

rmackay9 commented Nov 24, 2024

tpwrules left a comment

rmackay9 left a comment

Scripting: backport memory handling changes to 4.6 #28680

Scripting: backport memory handling changes to 4.6 #28680

Conversation

tridge commented Nov 19, 2024 • edited Loading

IamPete1 commented Nov 19, 2024

tpwrules commented Nov 20, 2024

tridge commented Nov 20, 2024

tridge commented Nov 20, 2024

rmackay9 commented Nov 20, 2024

IamPete1 commented Nov 20, 2024

IamPete1 commented Nov 20, 2024

tridge commented Nov 20, 2024

tridge commented Nov 20, 2024

rmackay9 commented Nov 21, 2024

timtuxworth commented Nov 21, 2024

rmackay9 commented Nov 24, 2024

tpwrules left a comment

Choose a reason for hiding this comment

rmackay9 left a comment

Choose a reason for hiding this comment

tridge commented Nov 19, 2024 •

edited

Loading