Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripting: backport memory handling changes to 4.6 #28680

Merged
merged 21 commits into from
Nov 25, 2024

Conversation

tridge
Copy link
Contributor

@tridge tridge commented Nov 19, 2024

this is a backport of this PR:
#26517
I think this is important for 4.6 as users are more commonly relying on scripting now, and we have a vulnerability where an allocation in one heap cannot be expanded if that heap is exhausted even if there is plenty of memory in the other heap. The user has no way of knowing how the scripting allocation is spread across heaps, so they have no way of knowing if their script could fail with an out of memory condition in flight

make functions for lua heap allocation suitable for use in all
non-ChibiOS HALs
this allows for new heaps to be added at runtime for lua scripting if
you run out of memory while armed
we want the lua garbage collector to be used to re-use memory where
possible. This implements a suggestion from Thomas to avoid heap
expansion unless the last allocation failed
implement in AP_MultiHeap instead
this is a standalone (no-HAL based) implementation of MultiHeap
@IamPete1
Copy link
Member

I'm not sure we need to rush to backport this. We have had various issues with scripting allocation in the past, I would feel more comfortable if it were to be in master for a while.

I don't think I would even call this a vulnerability fix. This is really just a "get out of jail free" for not setting your heap size large enough. This is very unlikely that this would hit existing scripts that have been well tested (of course memory usage does increase a little with each binding we add, so it is theoretically possible that a script that was only just running on 4.5 does not run on 4.6 (but we have also saved a bunch of memory recently due to a binding rework)).

I think this is a feature that is more a benefit to developers and those writing scripts than users. Even then, I don't recall the last time I actually had to increase from the default heap size (admittedly using H7s almost exclusively).

@tpwrules
Copy link
Contributor

Note that this does also fix an issue which would cause an allocation expansion to fail if the first sub-heap is full no matter how large your heap is. I don't know how large that first sub-heap is typically on H7, but it's surely getting smaller. Backporting just that change wouldn't really be possible, all the rework is necessary to make that not an issue, then adding the auto expansion on top is pretty trivial.

I am a bit wary of back-porting something so big and critical as well but there is justification for those with important Lua scripts.

@tridge
Copy link
Contributor Author

tridge commented Nov 20, 2024

I don't think I would even call this a vulnerability fix. This is really just a "get out of jail free" for not setting your heap size large enough

it doesn't matter how large you set SCR_HEAP_SIZE. It can be set to 400k and fail at 100k due to the way the allocation works between heaps. When scripting calls change_size() on a pointer if the new size is not available within the current heap then the allocation will fail. That current heap can be quite small, and the chances of that is rising as other subsystems take more of the 512k AXI SRAM on the H7.
What this means is that safety critical scripts, which are getting more common, can fail unexpectedly.

@tridge
Copy link
Contributor Author

tridge commented Nov 20, 2024

@IamPete1 when would you want it to go in? It isn't clear from your comment

@rmackay9
Copy link
Contributor

Hi @tpwrules, how do you feel about putting an approved on this to add your vote to merging?

@IamPete1
Copy link
Member

@IamPete1 when would you want it to go in? It isn't clear from your comment

I think it should be in master for a month or two, maybe aim for 4.6.1. Just seems two soon to go straight into a release with only the testing that was done on the original PR.

@IamPete1
Copy link
Member

What this means is that safety critical scripts, which are getting more common, can fail unexpectedly.

Did we get any reports of this happening in the wild?

@tridge
Copy link
Contributor Author

tridge commented Nov 20, 2024

Did we get any reports of this happening in the wild?

we have no easy way of knowing for sure, as even with the scripting logging we can't really tell what the memory layout is. We have had some cases though where I suspect it is happening, for example on a recent partner onboarding call they reported they could not get the quicktune lua script to run on CubeOrangePlus, so they had (quite independently of the effort in the current PR) worked on porting quicktune to C++ themselves. That one really surprised me as CubeOrangePlus uses the primary H7 memory layout, which keeps the 512k AXI SRAM region away from being chewed up by other things like the main firmware data.
The boards where we are most likely to see this are ones which don't have a clean AXI SRAM when scripting starts. For example, the 6X (CUAV, Holybro, ARK etc) put the AXI SRAM as primary memory, which means by the time scripting starts a very large chunk of that is gone, leaving only much smaller chunks to allocate from. This stems from the 6X needing to use the "ALT_RAM_MAP" due to a limitation of the PX4 bootloader, which needs to be shipped on the boards due to trademark issues.
The situation is even worse on F7, and I have spent a lot of frustrating time trying to help users get the quicktune lua to run at all on a F7 quadplane with CAN enabled. The 512k of ram should be plenty, but it just always gets an out of memory error. I suspect that is caused by this issue, but I don't have any proof of that, just a lack of any other explanation.

@tridge
Copy link
Contributor Author

tridge commented Nov 20, 2024

I think it should be in master for a month or two, maybe aim for 4.6.1

that violates another one of our principles of trying to avoid major changes in the point releases. We do it sometimes, but we try to avoid it if we can as it violates user expectations of the release process.

@rmackay9
Copy link
Contributor

I think if we're eventually going to merge this to 4.6 then the sooner the better in order to get more beta testers using it.

Probably the best way to reduce the risk is to have devs pour over it and try and spot issues so I look forward to @tpwrules finishing his review. The second line of defense is the alpha and then beta testing and so maximising the time it spends in beta testing should add safety for our general community. Those testing the beta are knowingly taking on extra risks so they should be ready for any issues and are more likely to report back

@timtuxworth
Copy link
Contributor

We are still in Beta on 4.6 if I'm not wrong - now is the time to do it. IMO we should put this in for Beta 2, so it becomes part of the 4.6 release.

@rmackay9
Copy link
Contributor

@tpwrules, you didn't have a chance to look over this again did you? Sorry, I'm just not personally qualified to review it but as a release manager I would like to get it into 4.6 earlier rather than later. No real pressure though of course, we've all got things to do

Copy link
Contributor

@tpwrules tpwrules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, tried it on Cube Orange in sim on HW without problems.

Let's get this in and baking.

Copy link
Contributor

@rmackay9 rmackay9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that @tpwrules has approved, I'm OK as well

@tridge tridge merged commit 04b8d36 into ArduPilot:ArduPilot-4.6 Nov 25, 2024
100 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Pending
Development

Successfully merging this pull request may close these issues.

6 participants