Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large file/low memory mode #586

Open
sevaa opened this issue Dec 7, 2024 · 14 comments
Open

Large file/low memory mode #586

sevaa opened this issue Dec 7, 2024 · 14 comments

Comments

@sevaa
Copy link
Contributor

sevaa commented Dec 7, 2024

Won't be the first time users are calling for a low memory mode for handling DWARF in extra large binaries. See also: #540, #576, #479; #481 for a slurp alternative.

I think the low memory mode should be explicitly opt in - I don't like to impose size thresholds on other people, let the library consumers do it. The large-file flag will be off by default, would go as an optional parameter into ELFFile.get_dwarf_info() and into the DWARFInfo() constructor, and will be kept there for all the parsing logic to see.

Eliminating the mandatory caching of CUs and DIEs would be rather trivial (but might break some API, or make them crazily slow - I'll have to try and see). But we still have the problem of slurping whole sections into memory. We kind of discussed it in #557, and settled that memory mapping is the way to go, however, how would you prefer to deal with transforms? If there is compression and/or relocation in DWARF sections, would you rather:

  • throw an exception that the binary is not compatible with the low memory mode
  • do the transforms anyway, even if it will eat a bunch of memory
  • let the user choose by either:
    • introducing another optional, false by default boolean flag
    • making the large-file flag a three way one (True|False|None or an enum)

Also, how would you like to test the large-file mode?

OBTW, any reason why _read_dwarf_section() writes the section data (a bytes) into BytesIO rather than constructing the BytesIO around the data?

@eliben
Copy link
Owner

eliben commented Dec 16, 2024

It's tricky. I worry this will significantly complicate the library just for the sake of some edge use cases. I wonder if this should be in scope at all...

OBTW, any reason why _read_dwarf_section() writes the section data (a bytes) into BytesIO rather than constructing the BytesIO around the data?

Not that I can think of now. You can try changing it, if all tests pass we should be fine

@sevaa
Copy link
Contributor Author

sevaa commented Dec 17, 2024

Looks like you are not not too hot for this. I sympathize; breaking well tested, working code on purpose is a mental block for me also. Okay, let's try to scale back the impact.

How about a no-slurp (memory-map instead) mode but with the caches intact? This way, the only bits that have to change are _read_dwarf_section(), the debug section stream will be a custom, BytesIO-like object instead of BytesIO proper (similar to what we have in #481), and a couple of spots that rely on the result of read() being bytes will have to be patched. Another side effect will be that the binary will have to remain open while DWARFInfo is alive (because of the memory mapping); to do this right, we'd have to slap the context management methods on DWARFInfo.

@eliben
Copy link
Owner

eliben commented Dec 20, 2024

That could be worth trying, to see what savings we get with large DWARF data and what performance effect it will have in normal mode

@sevaa
Copy link
Contributor Author

sevaa commented Dec 20, 2024

The questions from the original post still stand.

@eliben
Copy link
Owner

eliben commented Dec 20, 2024

We can start with the simplest option "throw an exception that the binary is not compatible with the low memory mode", and work from there.

Also, how would you like to test the large-file mode?

We can start by just testing this mode on small files, no? We don't necessarily need very large files in the test suite

If there are additional questions, please point them out explicitly, I did my best to locate them :)

@sevaa
Copy link
Contributor Author

sevaa commented Dec 20, 2024

I'm wondering more along the lines of how does this fit into the autotest. Running the readelf/dwarfdump tests twice - once in no-slurp mode - would provide sufficient coverage, but would also be time consuming.

@eliben
Copy link
Owner

eliben commented Dec 20, 2024

I'm wondering more along the lines of how does this fit into the autotest. Running the readelf/dwarfdump tests twice - once in no-slurp mode - would provide sufficient coverage, but would also be time consuming.

I'm not too worried about this, TBH.

First of all, running everything twice may be not too bad because it's all perfectly parallelizable.
Second, we don't really have to add all the tests in no-slurp mode, since the actual delta exercised between the two is small, it may be enough to re-run a small subset of tests in this mode.

@sevaa
Copy link
Contributor Author

sevaa commented Dec 20, 2024

How would you prefer to identify the subset of test files to be run in no-slurp mode? I am not thrilled about introducing another folder with copies of the corpus binaries. Symlinks are not particularly source control friendly. A hard coded file list in a Python source smells too.

@eliben
Copy link
Owner

eliben commented Dec 21, 2024

We already have some mentioning of specific test file names in the full tests, so a hard-coded list would be a reasonable start.

I'm trying to be pragmatic here. Maybe we can design a whole higher-level system for our tests where we can define exceptions like this (special skips or logic, re-running certain test files with certain modes, etc.) but right now I don't think it's worth the complexity. We can reconsider later if we see the complexity of this logic is growing and is starting to become a problem. Since it's just the internal tests and not a public API, I feel confident about being able to tweak and refactor it as we go.

@sevaa
Copy link
Contributor Author

sevaa commented Dec 31, 2024

I have a working implementation now. API wise, it's an extra flag to get_dwarf_info, and an extra method to check for transforms. On the subset of the corpus that allows straightforward memory mapping of the DWARF sections - 10 binaries - the readelf test passes. From my analysis, all the return values from the API are identical to those in the regular mode.

I didn't do any comparisons of memory consumption on extra large binaries, since I don't have any. On small binaries, I suspect memory mapping is more or less equivalent to slurping anyway. The whole purpose of this endeavor was fixing OOM on large binaries, and we don't have any repro on that. Any idea how to approach this?

@eliben
Copy link
Owner

eliben commented Jan 2, 2025

Cool!

Reports of problems with large files pop up in issues once in a while; maybe worth checking with these folks if they can upload the giant binaries they were having issues with somewhere?

Another option is build something like LLVM from source; IIRC the binaries (e.g. clang) are pretty sizable, especially when built in debug mode.

@sevaa
Copy link
Contributor Author

sevaa commented Jan 5, 2025

Put together a sample binary with about 50 MB of DWARF data, ran with some profiling. After a full CU/DIE scroll, memory usage is dominated by the caches in the CU/DWARFInfo object - the process memory usage goes to 2.5 GB.

Looks like the memory consumption problem won't be solved by memory mapping alone.

That said, merely nuking DWARFInfo._cu_cache followed by a forced garbage collection frees most of it.

@eliben
Copy link
Owner

eliben commented Jan 6, 2025

Thanks for checking. Interesting results. 2.6 GiB of memory usage for caching of a 50 MiB file seems excessive; maybe it can be optimized?

It may be that we should just keep pyelftools for the smaller binaries and not worry about large mode too much; it's a lot of complexity and may need very deep changes to support. But maybe some work can be done to make the caches more efficient, thus increasing the size "horizon" pyelftools supports out of the box.

@sevaa
Copy link
Contributor Author

sevaa commented Jan 22, 2025

The memory usage mostly checks out. After scrolling through the binary, the total size of Python objects in the garbage collector is 1.9 GB. Out of those:

  • OrderedDict (attributes in DIE objects): - 841,102,576
  • AttributeValue - 607,439,096
  • ListContainer (DIE caches in CU objects) - 105,619,392
  • DIE - 57,624,144
  • bytes (values of string valued attributes) - 58,329,710

That accounts for 1.6 GB. Measured via sys.getsizeof().

Now, the next question I will be trying to answer, were CU._dielist/_diemap not there, how much of the API would need to be significantly reimplemented. The DWARFInfo._cu_cache/_cu_offsets_map are superficial as it is, they were added relatively recently, and CUs don't link to each other as much as DIEs do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants