Exercise `MFTINDX.py` with test image and extracted MFT #83

ajnelson-nist · 2023-10-19T15:02:26Z

This patch series has three effects:

Fixing a few minor typing issues.
Replacing usage of array.array as function parameters' types, instead using MutableSequence[int].
1. Changing several methods of yielding byte-strings to instead use bytes() as a converter. This is a side-effect of transitioning type-system guarantees from array.array to MutableSequence[int].
Adding a test-set for MFTINDX.py, generating results for two of its three input forms and four of its run modes.
1. A TODO is left for adding INDX snippets for analysis.

Each patch has a log message documenting its changes.

On trying to run `MFTINDX.py` against the sample disk image in this repository, an error was raised, with this (trimmed) call stack: ``` File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed File "...[snip].../indxparse/MFTINDX.py", line 284, in <module> buf: array.array[Any], TypeError: 'type' object is not subscriptable ``` (Using Python 3.9.) This type signature was an attempt to satisfy a review note from `mypy --strict`, that `array.array` is a generic and needed a further type specialization, e.g. as visible with `mypy --strict indxparse/BinaryParser.py` (which is known to also raise other unrelated issues at the moment): ``` error: Missing type parameters for generic type "array" [type-arg] ``` This type signature specialization was not runtime-tested before implementing, and turns out to be incorrect. A StackOverflow post suggested replacing `array.array` in type signatures with `collections.abc.MutableSequence`, which is also a generic but can be specialized. This patch replaces `array.array` (with and without further type specialization) in type signatures with `MutableSequence[int]`, because a single member of a `bytes` object and most other `bytes`-like objects is an integer. While this is not entirely true for `array.array`, all of the usage of `array.array` in this code base is of the `"B"` form, unsigned byte, so `MutableSequence[int]` is appropriate. Note that `array.array` is still used for variables' assignment; it is just not used in type signatures anymore. A side-effect of switching away from `array.array` in type signatures is that `MutableSequence` does not satisfy the `Buffer` protocol expected from `struct.unpack`. `.tobytes()` is also not guaranteed to be available. `mypy` highlights these issues without `--strict`. The `bytes()` built-in method is used to cast all values now only guaranteed to be `MutableSequence[int]`. The `Block.unpack_guid` method was modified to remove `ord`, which happens to not be type-compatible with `MutableSequence[int]` when using `map`. Disclaimer: Participation by NIST in the creation of the documentation of mentioned software is not intended to imply a recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that any specific software is necessarily the best available for the purpose. References: * https://stackoverflow.com/a/67775675 Signed-off-by: Alex Nelson <[email protected]>

`types.MethodType` in Python 3 takes 2 arguments to instantiate, per line 447 of the current `types.pyi` in `typeshed`. I'm not sure when this switched from 3 arguments as written down to 2; using `git blame` to step back in the history of that line only goes back to `337abed05a` in 2015, and doesn't show a specific time where the "name" parameter this patch removes was expected. This patch removes the third parameter because it raises a runtime error. References: * https://github.com/python/typeshed/blob/21fcd8960f1dae5ec4563dd99860d0918efe5cff/stdlib/types.pyi#L447 Signed-off-by: Alex Nelson <[email protected]>

…r string Use of an empty character string with a "B" `array` raises a runtime error, with the stack trace ending with this: ``` TypeError: cannot use a str to initialize an array with typecode 'B' ``` Signed-off-by: Alex Nelson <[email protected]>

This is added in partial satisfaction of Issue 41. A follow-on patch will regenerate Make-managed files. References: * williballenthin#41 Signed-off-by: Alex Nelson <[email protected]>

Signed-off-by: Alex Nelson <[email protected]>

ajnelson-nist · 2023-10-19T15:08:51Z

Apologies for the noise - I tested offline with Python 3.9, but forgot to test 3.8. Version 1 of this PR turns out to work in only 3.9 and later.

williballenthin · 2023-10-19T15:09:49Z

indxparse/BinaryParser.py

@@ -229,7 +230,7 @@ def align(offset, alignment):
    return offset + (alignment - (offset % alignment))


-def dosdate(dosdate: array.array, dostime: array.array) -> datetime:
+def dosdate(dosdate: MutableSequence[int], dostime: MutableSequence[int]) -> datetime:


i would really like to work towards annotations that better differentiate between mutable vs immutable data. when i see a function that takes a MutableSequence then i expect it may change the data. for these purely functional parsing routines i think this is the wrong hint. most of these routines should take bytes (or Sequence[byte] or similar abstraction). maybe we should introduce casts in a few key places to make the hints work better.

thoughts?

i think collections.abc.Buffer (3.12+) and typing.ByteString (3.9-) are probably the appropriate hints.

https://docs.python.org/3/library/typing.html#typing.ByteString

I have been finding this to be a challenging type annotation exercise. I am happy to explore options. I'll be back to you soon, but probably not before your end of day.

Update: I've added two patches. The first fixes the Python 3.8 issue. If you'd like me to rework the series so CI passes on each commit, let me know and I'll file another PR instead of force-pushing to overwrite.

The second patch applies your suggestion to differentiate mutable vs. immutable data. It induced some non-trivial updates to the code; that patch's log message document what happened.

I also think the patch ended up suggesting some code-style revisions, e.g. reviewing whether the read_* functions still need the offset parameter with bytes() used before calling; and, reviewing read_byte. But I think these are style matters and don't necessarily need to be addressed now, so I'm marking this PR as ready for review & merge if you agree and are fine with the half-passing early commits.

williballenthin · 2023-10-19T15:14:48Z

indxparse/BinaryParser.py

@@ -335,13 +336,13 @@ def read_byte(
    - `OverrunBufferException`
    """
    try:
-        return struct.unpack_from("<B", buf, offset)[0]
+        return struct.unpack_from("<B", bytes(buf), offset)[0]


these calls to bytes(buf) make mypy happy but they aren't side effect free: i think they'll realize the mmap/array/whatever into a copy of all the data. for large byte arrays, this could be problematic.

How would you feel about a slice on buf before feeding into bytes()?

Inlining slicing ended up making the most sense to me.

The `typing` module offers a symbol that supports subscripting as a type signature without confusing the runtime; but, it was deprecated in Python 3.9. This patch was tested in Pythons 3.8 and 3.12. References: * https://docs.python.org/3.12/library/typing.html#typing.MutableSequence Signed-off-by: Alex Nelson <[email protected]>

This patch reviews usage of `MutableSequence[int]`, and restricts functions that should not modify their inputs (e.g. `BinaryParser.read_dword`) to use a read-only argument signature instead of the read-write `MutableSequence[int]`. An initial suggestion of `typing.ByteString` and/or `collections.abc.Buffer` led to review of their definitions for whether they are read-only or read-write types. The end result is that for "Read-write" function parameters, `MutableSequence[int]` is still used, and for "Read-only" parameters, `bytes` is used. The foundations for this decision are that the `bytes` class is immutable, `bytearray` is mutable, and the suggested types `ByteString` and `Buffer` end up supporting `bytearray`. Prior review of `typeshed`'s documentation on "readable buffers" and "writable buffers" indicate there is no "read-only Buffer" generic type currently in Python. `collections.abc.Buffer` is the union of `bytes`, `bytearray`, and `memoryview`. Because the union includes `bytes` and `bytearray`, `Buffer` can be mutable. (This definition was found in `typing_extensions`; source block cited in references.) `memoryview` may reference an object that supports the buffer protocol. This includes `bytes` and `bytearray`. (This mutability/immutability detail is documented in both 3.8 and 3.12; 3.8 is cited below.) `typing.ByteString` is defined as the same union as `collections.abc.Buffer`, so it can be mutable. Given all the above, this patch implements "read-only" buffer restrictions on function parameters with the `bytes` type. A consequence of using types to restrict read-only vs. read-write is that several call paths to the read-only functions start from read-write buffers. This patch handles this by using `bytes` to cast slices of the mutable buffers, done to satisfy type restrictions and to avoid casting larger buffers than necessary. No effects were observed on Make-managed files. References: * https://docs.python.org/3/library/stdtypes.html#bytearray-objects * https://docs.python.org/3/library/stdtypes.html#bytes-objects * https://docs.python.org/3.8/library/stdtypes.html#memoryview * https://github.com/python/typeshed/blob/21fcd8960f1dae5ec4563dd99860d0918efe5cff/stdlib/_typeshed/__init__.pyi#L239-L247 * https://github.com/python/typing_extensions/blob/04f98954ba63a5e8a09c12171be24785298276b6/src/typing_extensions.py#L2522-L2548 Requested-by: Willi Ballenthin <[email protected]> Signed-off-by: Alex Nelson <[email protected]>

ajnelson-nist · 2023-10-25T14:00:23Z

CI continues to pass after the merge of #84 .

williballenthin · 2023-10-25T14:43:05Z

indxparse/BinaryParser.py

@@ -335,13 +341,13 @@ def read_byte(
    - `OverrunBufferException`
    """
    try:
-        return struct.unpack_from("<B", buf, offset)[0]
+        return struct.unpack_from("<B", bytes(buf), offset)[0]


i'm still concerned about these calls to bytes since they may make copies of a large buffer protocol object (like a mmap) unnecessarily.

i understand these calls are to satisfy mypy. could we replace them with typing.cast(bytes)?

I agree, cast() seems to be the right call here. I've pushed two patches to adopt that strategy.

No effects were observed on Make-managed files. Signed-off-by: Alex Nelson <[email protected]>

This patch avoids potential copies of large buffers from feeding through `bytes()`, using `cast(bytes, _)` instead to satisfy type review. The slicing added in `fbe65927` has been reverted, because it had been added to focus the data being fed into `bytes()` calls. No effects were observed on Make-managed files. References: * https://docs.python.org/3/library/typing.html#typing.cast Requested-by: Willi Ballenthin <[email protected]> Signed-off-by: Alex Nelson <[email protected]>

williballenthin

ready to merge?

lgtm

williballenthin · 2023-10-25T20:54:27Z

indxparse/BinaryParser.py

@@ -675,7 +681,7 @@ def unpack_byte(self, offset: int) -> int:
        Throws:
        - `OverrunBufferException`
        """
-        return read_byte(self._buf, self._offset + offset)
+        return read_byte(cast(bytes, self._buf), self._offset + offset)


looks great!

williballenthin · 2023-10-25T20:55:20Z

indxparse/MFT.py

@@ -69,7 +74,7 @@ def __str__(self):
 class FixupBlock(Block):
    def __init__(
        self,
-        buf: array.array,
+        buf: MutableSequence[int],


this is an instance where the data needs to be mutable.

ajnelson-nist · 2023-10-25T21:22:34Z

Yes, I think this is ready to merge.

ajnelson-nist · 2023-11-01T11:53:20Z

Thank you!

ajnelson-nist added 5 commits October 18, 2023 18:07

Exercise MFTINDX.py

08794ea

This is added in partial satisfaction of Issue 41. A follow-on patch will regenerate Make-managed files. References: * williballenthin#41 Signed-off-by: Alex Nelson <[email protected]>

Regenerate Make-managed files

b263c97

Signed-off-by: Alex Nelson <[email protected]>

williballenthin reviewed Oct 19, 2023

View reviewed changes

ajnelson-nist added 2 commits October 19, 2023 14:49

ajnelson-nist marked this pull request as ready for review October 19, 2023 19:40

ajnelson-nist mentioned this pull request Oct 25, 2023

Expand Python version testing to include 3.12 #84

Merged

Merge branch 'master' into exercise_mftindx_py

2205d8e

williballenthin reviewed Oct 25, 2023

View reviewed changes

ajnelson-nist added 2 commits October 25, 2023 14:25

Remove bytes() calls that are redundant with type signatures

df0d122

No effects were observed on Make-managed files. Signed-off-by: Alex Nelson <[email protected]>

williballenthin approved these changes Oct 25, 2023

View reviewed changes

williballenthin merged commit 038e8ec into williballenthin:master Nov 1, 2023
2 checks passed

ajnelson-nist deleted the exercise_mftindx_py branch November 1, 2023 11:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exercise `MFTINDX.py` with test image and extracted MFT #83

Exercise `MFTINDX.py` with test image and extracted MFT #83

ajnelson-nist commented Oct 19, 2023

ajnelson-nist commented Oct 19, 2023

williballenthin Oct 19, 2023

williballenthin Oct 19, 2023

ajnelson-nist Oct 19, 2023

ajnelson-nist Oct 19, 2023

williballenthin Oct 19, 2023 •

edited

Loading

ajnelson-nist Oct 19, 2023

ajnelson-nist Oct 19, 2023

ajnelson-nist commented Oct 25, 2023

williballenthin Oct 25, 2023

ajnelson-nist Oct 25, 2023

williballenthin left a comment

williballenthin Oct 25, 2023

williballenthin Oct 25, 2023

ajnelson-nist Oct 25, 2023

ajnelson-nist commented Oct 25, 2023

ajnelson-nist commented Nov 1, 2023

Exercise MFTINDX.py with test image and extracted MFT #83

Exercise MFTINDX.py with test image and extracted MFT #83

Conversation

ajnelson-nist commented Oct 19, 2023

ajnelson-nist commented Oct 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williballenthin Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajnelson-nist commented Oct 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williballenthin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajnelson-nist commented Oct 25, 2023

ajnelson-nist commented Nov 1, 2023

Exercise `MFTINDX.py` with test image and extracted MFT #83

Exercise `MFTINDX.py` with test image and extracted MFT #83

williballenthin Oct 19, 2023 •

edited

Loading