-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract Rust specific strings from binaries #791 #836
Conversation
There are still several tasks pending, such as updating references from source code, comments, etc. Therefore, I'm marking this pull request as a draft for now. |
On first glance this appears to be identical or very close to the Go extract script. Can we just reuse that code or functions from it, so we don't repeat the same code twice? |
Agreed! Let's consolidate the code and reuse the functions from the Go extract script in a separate PR after merging this one. We can create a file something like |
The current algorithm is based on pull request #791 (comment), and it's progressing well. On a positive note, the algorithm appears to be splitting the strings quite effectively, as shown in the attached image. |
TODO: Refactor comments and type hints |
Co-authored-by: Moritz <[email protected]>
Or we can consider adding a new field "offset_end" to the StaticString class instead of managing both ref_data and Static_string separately like in latest commit? I'd like to know your opinion. |
* simplify rdata section extraction * add initial Rust extract tests * simplify rust string extraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I think I didn't properly format my changes (in the PR to the fork)
No problem at all! Thanks for the help! 😄 |
floss/language/rust/extract.py
Outdated
xrefs_lea = find_lea_xrefs(pe) | ||
xrefs_push = find_push_xrefs(pe) | ||
xrefs_mov = find_mov_xrefs(pe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to do all of these all the time (for both architectures or specific to 32/64 bit?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Push and mov operations are only specific to the i386 architecture, and they have been handled in the latest commit. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please share some current coverage output for 32 and 64 bit samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, both are at approximately 90% coverage.
coverage32.txt
coverage64.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you repeat this for a few random samples and just share the stats here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On random samples from VT:-
result.txt
Average:- 94.5%
Low:- 88%
High:- 99%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, that looks pretty promising!
Co-authored-by: Moritz <[email protected]>
floss/language/utils.py
Outdated
code = section.get_data() | ||
|
||
if pe.FILE_HEADER.Machine == pefile.MACHINE_TYPE["IMAGE_FILE_MACHINE_AMD64"]: | ||
xrefs: Iterable[VA] = [] # no push instructions on amd64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we be sure string references would never be pushed? probably, but can you add some context/references around this? the comment is a little misleading currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've carefully checked amd64 binary files, and as of now, I haven't come across any mov or push instructions related to string references. However, considering the complexity of the Instruction set, I can't rule out the possibility of some rare cases escaping my notice. To ensure completeness, I've included this in the latest commit.
floss/language/utils.py
Outdated
code = section.get_data() | ||
|
||
if pe.FILE_HEADER.Machine == pefile.MACHINE_TYPE["IMAGE_FILE_MACHINE_AMD64"]: | ||
xrefs: Iterable[VA] = [] # no mov instructions on amd64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
floss/language/utils.py
Outdated
# It is 0 in case of Rust extraction because we want to extract all strings from binary file | ||
# while in case of Go extraction, we want to extract only large strings (len > 2800) from binary file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's only leave the respective comments where this is called, so please remove here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done :)
This pull request introduces a new feature to extract Rust specific strings from binaries.
Solves: #791