Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Rust specific strings from binaries #791 #836

Merged
merged 44 commits into from
Aug 23, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
a75661e
Initial implementation of Rust specific strings
Arker123 Jul 17, 2023
e7c7595
New algorithm
Arker123 Jul 28, 2023
8a394bb
code style
Arker123 Jul 28, 2023
e9ca68e
Implemented separation of references from .text segment
Arker123 Aug 5, 2023
45978ea
Added rust coverage script
Arker123 Aug 5, 2023
4cbffaf
Introduced shared functions into language/utils.py
Arker123 Aug 5, 2023
f128d19
Refractored Go and Rust extraction files
Arker123 Aug 5, 2023
80dce99
Removed unused functions
Arker123 Aug 5, 2023
13c8920
Modularized code into separate functions
Arker123 Aug 5, 2023
27958fb
Merge remote-tracking branch 'origin/master' into rust-strings
Arker123 Aug 5, 2023
e074722
Refractored comments and type hints
Arker123 Aug 5, 2023
dbf7ad1
Tweaks
Arker123 Aug 5, 2023
bbd3d53
Update coverage Script
Arker123 Aug 11, 2023
4839543
Tweaks
Arker123 Aug 11, 2023
3ebd075
Minor fixes
Arker123 Aug 11, 2023
226486e
code style
Arker123 Aug 11, 2023
c46410e
Apply suggestions from code review
Arker123 Aug 14, 2023
8fabe4b
Tweaks
Arker123 Aug 14, 2023
8bd3711
Minor fixes
Arker123 Aug 14, 2023
74f3a91
code style
Arker123 Aug 14, 2023
2d5bf95
Update coverage script
Arker123 Aug 15, 2023
76d5f84
Update coverage script
Arker123 Aug 17, 2023
b02fc6a
Tweaks
Arker123 Aug 17, 2023
39e814c
Apply suggestions from code review
Arker123 Aug 19, 2023
02288d7
Tweaks
Arker123 Aug 19, 2023
797e5e3
Minor fixes
Arker123 Aug 19, 2023
657d497
Design Tweaks
Arker123 Aug 21, 2023
73afe8b
Refractored Design
Arker123 Aug 21, 2023
267862e
Improved Design
Arker123 Aug 22, 2023
9fe75c7
Further Improvised Design
Arker123 Aug 22, 2023
a67f9f2
Tweaks
Arker123 Aug 22, 2023
07a7558
Design Tweaks
Arker123 Aug 22, 2023
5a6fdb6
Updated Design Structure
Arker123 Aug 22, 2023
1650f8b
Cleanup
Arker123 Aug 22, 2023
6cdccb3
Rust updates (#7)
mr-tz Aug 22, 2023
62405fe
Added push and mov xrefs for i386 arch and test updates
Arker123 Aug 22, 2023
c98450d
Tweaks
Arker123 Aug 23, 2023
57fc902
Update floss/language/go/coverage.py
Arker123 Aug 23, 2023
ef27592
Add push and mov for amd64
Arker123 Aug 23, 2023
1909255
Merge branch 'rust-strings' of https://github.com/Arker123/flare-flos…
Arker123 Aug 23, 2023
6011ea7
Update Comments
Arker123 Aug 23, 2023
890ba55
Tweaks
Arker123 Aug 23, 2023
df20ec1
Comment Tweaks
Arker123 Aug 23, 2023
2fdb823
Tweaks
Arker123 Aug 23, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions floss/language/rust/extract.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please share some current coverage output for 32 and 64 bit samples?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, both are at approximately 90% coverage.
coverage32.txt
coverage64.txt

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you repeat this for a few random samples and just share the stats here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On random samples from VT:-
result.txt
Average:- 94.5%
Low:- 88%
High:- 99%

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, that looks pretty promising!

Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@

import pefile
import binary2strings as b2s

from floss.results import StaticString, StringEncoding
from floss.language.utils import find_lea_xrefs, get_struct_string_candidates
from floss.language.utils import find_lea_xrefs, find_mov_xrefs, find_push_xrefs, get_struct_string_candidates

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -68,7 +69,7 @@ def split_strings(static_strings: List[StaticString], address: int) -> None:
return


def extract_rust_strings(sample: str, min_length: int) -> List[StaticString]:
def extract_rust_strings(sample: pathlib.Path, min_length: int) -> List[StaticString]:
"""
Extract Rust strings from a sample
"""
Expand Down Expand Up @@ -104,9 +105,11 @@ def get_string_blob_strings(pe: pefile.PE, min_length: int) -> Iterable[StaticSt
static_strings = filter_and_transform_utf8_strings(strings, start_rdata)

struct_string_addrs = map(lambda c: c.address, get_struct_string_candidates(pe))
xrefs = find_lea_xrefs(pe)
xrefs_lea = find_lea_xrefs(pe)
xrefs_push = find_push_xrefs(pe)
xrefs_mov = find_mov_xrefs(pe)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to do all of these all the time (for both architectures or specific to 32/64 bit?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Push and mov operations are only specific to the i386 architecture, and they have been handled in the latest commit. Thanks


for addr in itertools.chain(struct_string_addrs, xrefs):
for addr in itertools.chain(struct_string_addrs, xrefs_lea, xrefs_push, xrefs_mov):
address = addr - image_base - virtual_address + pointer_to_raw_data

if not (start_rdata <= address < end_rdata):
Expand Down
103 changes: 103 additions & 0 deletions floss/language/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,109 @@ def find_lea_xrefs(pe: pefile.PE) -> Iterable[VA]:
yield xref


def find_i386_push_xrefs(buf: bytes) -> Iterable[VA]:
"""
scan the given data found at the given base address
to find all the 32-bit PUSH instructions,
extracting the target virtual address.
"""
push_insn_re = re.compile(
rb"""
(
\x68 # 68 aa aa 00 00 push 0xaaaa
)
(?P<address>....)
""",
re.DOTALL + re.VERBOSE,
)

for match in push_insn_re.finditer(buf):
address_bytes = match.group("address")
address = struct.unpack("<I", address_bytes)[0]

yield address


def find_push_xrefs(pe: pefile.PE) -> Iterable[VA]:
"""
scan the executable sections of the given PE file
for PUSH instructions that reference valid memory addresses,
yielding the virtual addresses.
"""
low, high = get_image_range(pe)

for section in pe.sections:
if not section.IMAGE_SCN_MEM_EXECUTE:
continue

code = section.get_data()

if pe.FILE_HEADER.Machine == pefile.MACHINE_TYPE["IMAGE_FILE_MACHINE_AMD64"]:
xrefs: Iterable[VA] = [] # no push instructions on amd64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be sure string references would never be pushed? probably, but can you add some context/references around this? the comment is a little misleading currently

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've carefully checked amd64 binary files, and as of now, I haven't come across any mov or push instructions related to string references. However, considering the complexity of the Instruction set, I can't rule out the possibility of some rare cases escaping my notice. To ensure completeness, I've included this in the latest commit.

elif pe.FILE_HEADER.Machine == pefile.MACHINE_TYPE["IMAGE_FILE_MACHINE_I386"]:
xrefs = find_i386_push_xrefs(code)
else:
raise ValueError("unhandled architecture")

for xref in xrefs:
if low <= xref < high:
yield xref


def find_i386_mov_xrefs(buf: bytes) -> Iterable[VA]:
"""
scan the given data found at the given base address
to find all the 32-bit MOV instructions,
extracting the target virtual address.
"""
mov_insn_re = re.compile(
rb"""
(
\xB9 # b9 aa aa 00 00 mov ecx,0xaaaa
| \xBB # bb aa aa 00 00 mov ebx,0xaaaa
| \xBA # ba aa aa 00 00 mov edx,0xaaaa
| \xB8 # b8 aa aa 00 00 mov eax,0xaaaa
| \xBE # be aa aa 00 00 mov esi,0xaaaa
| \xBF # bf aa aa 00 00 mov edi,0xaaaa
)
(?P<address>....)
""",
re.DOTALL + re.VERBOSE,
)

for match in mov_insn_re.finditer(buf):
address_bytes = match.group("address")
address = struct.unpack("<I", address_bytes)[0]

yield address


def find_mov_xrefs(pe: pefile.PE) -> Iterable[VA]:
"""
scan the executable sections of the given PE file
for MOV instructions that reference valid memory addresses,
yielding the virtual addresses.
"""
low, high = get_image_range(pe)

for section in pe.sections:
if not section.IMAGE_SCN_MEM_EXECUTE:
continue

code = section.get_data()

if pe.FILE_HEADER.Machine == pefile.MACHINE_TYPE["IMAGE_FILE_MACHINE_AMD64"]:
xrefs: Iterable[VA] = [] # no mov instructions on amd64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

elif pe.FILE_HEADER.Machine == pefile.MACHINE_TYPE["IMAGE_FILE_MACHINE_I386"]:
xrefs = find_i386_mov_xrefs(code)
else:
raise ValueError("unhandled architecture")

for xref in xrefs:
if low <= xref < high:
yield xref


def get_max_section_size(pe: pefile.PE) -> int:
"""get the size of the largest section, as seen on disk."""
return max(map(lambda s: s.SizeOfRawData, pe.sections))
Expand Down
30 changes: 28 additions & 2 deletions tests/test_language_extract_rust.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import pathlib

import pytest

from floss.results import StaticString, StringEncoding
from floss.language.rust.extract import extract_rust_strings

Expand Down Expand Up @@ -42,8 +43,6 @@ def test_data_string_offset(request, string, offset, encoding, rust_strings):
@pytest.mark.parametrize(
"string,offset,encoding,rust_strings",
[
# TODO
# pytest.param("hello world", 0xA03E1, StringEncoding.UTF8, "rust_strings32"),
# .text:0000000140021155 4C 8D 05 2C DA 09 lea r8, aAccesserror ; "AccessError"
# .text:000000014002115C 48 8D 74 24 20 lea rsi, [rsp+38h+var_18]
# .text:0000000140021161 41 B9 0B 00 00 00 mov r9d, 11
Expand All @@ -53,3 +52,30 @@ def test_data_string_offset(request, string, offset, encoding, rust_strings):
)
def test_lea_mov(request, string, offset, encoding, rust_strings):
assert StaticString(string=string, offset=offset, encoding=encoding) in request.getfixturevalue(rust_strings)


@pytest.mark.parametrize(
"string,offset,encoding,rust_strings",
[
# .text:0041EF8C 68 50 08 4B 00 push offset unk_4B0850 ; "AccessError"
# .text:0041EFB8 68 5B 08 4B 00 push offset unk_4B085B "already destroyed"
pytest.param("AccessError", 0xAE850, StringEncoding.UTF8, "rust_strings32"),
pytest.param("already destroyed", 0xAE85B, StringEncoding.UTF8, "rust_strings32"),
],
)
def test_push(request, string, offset, encoding, rust_strings):
assert StaticString(string=string, offset=offset, encoding=encoding) in request.getfixturevalue(rust_strings)


@pytest.mark.parametrize(
"string,offset,encoding,rust_strings",
[
# .text:0046B04A BA 1A 00 00 00 mov edx, 1Ah ; jumptable 0046A19C case 8752
# .text:0046B04F B9 A0 C2 4B 00 mov ecx, offset unk_4BC2A0
# .text:0046B054 E9 93 F8 FF FF jmp loc_46A8EC ; jumptable 0046A1CA case 0
pytest.param("DW_AT_SUN_return_value_ptr", 0xBA2A0, StringEncoding.UTF8, "rust_strings32"),
pytest.param("DW_AT_SUN_c_vla", 0xBA2BA, StringEncoding.UTF8, "rust_strings32"),
],
)
def test_mov_jmp(request, string, offset, encoding, rust_strings):
assert StaticString(string=string, offset=offset, encoding=encoding) in request.getfixturevalue(rust_strings)