Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Rust specific strings from binaries #791 #836

Merged
merged 44 commits into from
Aug 23, 2023

Conversation

Arker123
Copy link
Collaborator

This pull request introduces a new feature to extract Rust specific strings from binaries.
Solves: #791

@Arker123
Copy link
Collaborator Author

There are still several tasks pending, such as updating references from source code, comments, etc. Therefore, I'm marking this pull request as a draft for now.

@mr-tz
Copy link
Collaborator

mr-tz commented Jul 18, 2023

On first glance this appears to be identical or very close to the Go extract script. Can we just reuse that code or functions from it, so we don't repeat the same code twice?

@Arker123
Copy link
Collaborator Author

On first glance this appears to be identical or very close to the Go extract script. Can we just reuse that code or functions from it, so we don't repeat the same code twice?

Agreed! Let's consolidate the code and reuse the functions from the Go extract script in a separate PR after merging this one. We can create a file something like language/utils.py to house the shared functions. Thanks for the suggestion!

@Arker123
Copy link
Collaborator Author

The current algorithm is based on pull request #791 (comment), and it's progressing well.
TODO: Extract references from the .text segment as well and improve code style and comments.

On a positive note, the algorithm appears to be splitting the strings quite effectively, as shown in the attached image.
image
image

floss/language/rust/extract.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
@Arker123
Copy link
Collaborator Author

Arker123 commented Aug 5, 2023

TODO: Refactor comments and type hints

@Arker123 Arker123 marked this pull request as ready for review August 5, 2023 09:00
@Arker123
Copy link
Collaborator Author

Arker123 commented Aug 5, 2023

The current outcome appears promising.
image

floss/language/rust/coverage.py Outdated Show resolved Hide resolved
floss/language/rust/coverage.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
.github/mypy/mypy.ini Outdated Show resolved Hide resolved
floss/language/rust/coverage.py Outdated Show resolved Hide resolved
floss/language/rust/coverage.py Outdated Show resolved Hide resolved
floss/language/rust/coverage.py Outdated Show resolved Hide resolved
floss/language/rust/coverage.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
@Arker123
Copy link
Collaborator Author

Or we can consider adding a new field "offset_end" to the StaticString class instead of managing both ref_data and Static_string separately like in latest commit? I'd like to know your opinion.

floss/language/rust/extract.py Outdated Show resolved Hide resolved
floss/language/rust/extract.py Outdated Show resolved Hide resolved
floss/results.py Outdated Show resolved Hide resolved
* simplify rdata section extraction

* add initial Rust extract tests

* simplify rust string extraction
Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I think I didn't properly format my changes (in the PR to the fork)

@Arker123
Copy link
Collaborator Author

No problem at all! Thanks for the help! 😄

Comment on lines 108 to 110
xrefs_lea = find_lea_xrefs(pe)
xrefs_push = find_push_xrefs(pe)
xrefs_mov = find_mov_xrefs(pe)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to do all of these all the time (for both architectures or specific to 32/64 bit?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Push and mov operations are only specific to the i386 architecture, and they have been handled in the latest commit. Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please share some current coverage output for 32 and 64 bit samples?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, both are at approximately 90% coverage.
coverage32.txt
coverage64.txt

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you repeat this for a few random samples and just share the stats here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On random samples from VT:-
result.txt
Average:- 94.5%
Low:- 88%
High:- 99%

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, that looks pretty promising!

code = section.get_data()

if pe.FILE_HEADER.Machine == pefile.MACHINE_TYPE["IMAGE_FILE_MACHINE_AMD64"]:
xrefs: Iterable[VA] = [] # no push instructions on amd64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be sure string references would never be pushed? probably, but can you add some context/references around this? the comment is a little misleading currently

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've carefully checked amd64 binary files, and as of now, I haven't come across any mov or push instructions related to string references. However, considering the complexity of the Instruction set, I can't rule out the possibility of some rare cases escaping my notice. To ensure completeness, I've included this in the latest commit.

code = section.get_data()

if pe.FILE_HEADER.Machine == pefile.MACHINE_TYPE["IMAGE_FILE_MACHINE_AMD64"]:
xrefs: Iterable[VA] = [] # no mov instructions on amd64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

Comment on lines 467 to 468
# It is 0 in case of Rust extraction because we want to extract all strings from binary file
# while in case of Go extraction, we want to extract only large strings (len > 2800) from binary file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's only leave the respective comments where this is called, so please remove here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

@mr-tz mr-tz merged commit 567fc95 into mandiant:master Aug 23, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants