Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More i386 xrefs #899

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion floss/language/rust/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
find_mov_xrefs,
find_push_xrefs,
get_rdata_section,
get_raw_xrefs_rdata_i386,
get_struct_string_candidates,
)

Expand Down Expand Up @@ -168,7 +169,8 @@ def get_string_blob_strings(pe: pefile.PE, min_length: int) -> Iterable[StaticSt
xrefs_lea = find_lea_xrefs(pe)
xrefs_push = find_push_xrefs(pe)
xrefs_mov = find_mov_xrefs(pe)
xrefs = itertools.chain(struct_string_addrs, xrefs_lea, xrefs_push, xrefs_mov)
xrefs_raw_rdata = get_raw_xrefs_rdata_i386(pe, rdata_section.get_data())
xrefs = itertools.chain(struct_string_addrs, xrefs_lea, xrefs_push, xrefs_mov, xrefs_raw_rdata)

elif pe.FILE_HEADER.Machine == pefile.MACHINE_TYPE["IMAGE_FILE_MACHINE_AMD64"]:
xrefs_lea = find_lea_xrefs(pe)
Expand Down
42 changes: 42 additions & 0 deletions floss/language/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -473,6 +473,48 @@ def get_struct_string_candidates(pe: pefile.PE) -> Iterable[StructString]:
# dozens of seconds or more (suspect many minutes).


def get_raw_xrefs_rdata_i386(pe: pefile.PE, buf: bytes) -> Iterable[VA]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a few tests for these strings

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_raw_xrefs_rdata_i386(pe: pefile.PE, buf: bytes) -> Iterable[VA]:
def get_raw_xrefs_rdata(pe: pefile.PE, buf: bytes) -> Iterable[VA]:

This routine doesn't seem limited to i386 so lets remove that from the function name. Otherwise, we should add a check to the PE architecture to restrict it to i386.

If the data are virtual addresses (rather than RVAs), we could additionally use relocation entries to find pointers and/or verify this data is in fact a pointer.

"""
scan for raw xrefs in .rdata section.
raw xrefs are 32-bit absolute addresses to strings in .rdata section (i386).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raw xrefs are 32-bit absolute addresses to strings in .rdata section (i386).

This routine doesn't validate that the destination is string-like data. Should it? If not, lets remove this part of the documentation.

They are not encoded as struct String instances.

example:
.rdata:004D6234 dd offset unk_4C85C9
.rdata:004D6238 dd offset unk_4C85C3
.rdata:004D623C dd offset unk_4C85BB
.rdata:004D6240 dd offset unk_4C85B3

From the disassembly, they are called as follows:
.text:00498E56 push ds:off_4D61E0[ecx*4]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does the length get stored?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not; the raw xrefs are stored without explicit length information. Lengths are not included in this context. Do we need them, or is there a specific reason for considering length storage?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I don't think we need them. but I wondered how Go is able to use the string data without an associated length.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I couldn't find any specific information for Go, but I did come across a similar approach in Rust. I've kept it in the utils.py file, considering the possibility that we might encounter a similar scenario in the future when exploring other languages.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still curious how this data can be used as a string without the length being stored somewhere.


The above are not struct String instances, but are references to strings in .rdata section.
They can be used to divide the string blobs into smaller chunks.
"""
format = "I"

if not buf:
return

low, high = get_image_range(pe)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears to be the range of the entire file, not the .rdata section. please update the logic or documentation to make things consistent.


# using array module as a high-performance way to access the data as fixed-sized words.
words = iter(array.array(format, buf))

last = next(words)
for current in words:
address = last
last = current

if address == 0x0:
continue

if not (low <= address < high):
continue

yield address
Arker123 marked this conversation as resolved.
Show resolved Hide resolved


def get_extract_stats(
pe: pefile, all_ss_strings: List[StaticString], lang_strings: List[StaticString], min_len: int, min_blob_len=0
) -> float:
Expand Down
18 changes: 18 additions & 0 deletions tests/test_language_extract_rust.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,3 +80,21 @@ def test_push(request, string, offset, encoding, rust_strings):
)
def test_mov_jmp(request, string, offset, encoding, rust_strings):
assert StaticString(string=string, offset=offset, encoding=encoding) in request.getfixturevalue(rust_strings)


@pytest.mark.parametrize(
"string,offset,encoding,rust_strings",
[
# .rdata:004BFA48 dd offset unk_4BA13A
# .rdata:004BFA4C dd offset unk_4BA100
pytest.param("Invalid branch target in DWARF expression", 0xB813A, StringEncoding.UTF8, "rust_strings32"),
pytest.param(
"Expected to find an FDE pointer, but found a CIE pointer instead.",
0xB8163,
StringEncoding.UTF8,
"rust_strings32",
),
],
)
def test_raw_xrefs(request, string, offset, encoding, rust_strings):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its good that we have integration tests that show the whole system working together to find the strings. I think we should also have some tests for the specific routines that you added, so we can verify their behavior directly. something like test_get_raw_xrefs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, what should be the best approach to that? As there are not just test_get_raw_xrefs, but also others such as find_i386_push_xrefs, find_lea_xrefs, etc., should we test them too in a separate file or another PR? What are your thoughts?

assert StaticString(string=string, offset=offset, encoding=encoding) in request.getfixturevalue(rust_strings)