-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exercise MFTINDX.py
with test image and extracted MFT
#83
Exercise MFTINDX.py
with test image and extracted MFT
#83
Conversation
On trying to run `MFTINDX.py` against the sample disk image in this repository, an error was raised, with this (trimmed) call stack: ``` File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed File "...[snip].../indxparse/MFTINDX.py", line 284, in <module> buf: array.array[Any], TypeError: 'type' object is not subscriptable ``` (Using Python 3.9.) This type signature was an attempt to satisfy a review note from `mypy --strict`, that `array.array` is a generic and needed a further type specialization, e.g. as visible with `mypy --strict indxparse/BinaryParser.py` (which is known to also raise other unrelated issues at the moment): ``` error: Missing type parameters for generic type "array" [type-arg] ``` This type signature specialization was not runtime-tested before implementing, and turns out to be incorrect. A StackOverflow post suggested replacing `array.array` in type signatures with `collections.abc.MutableSequence`, which is also a generic but can be specialized. This patch replaces `array.array` (with and without further type specialization) in type signatures with `MutableSequence[int]`, because a single member of a `bytes` object and most other `bytes`-like objects is an integer. While this is not entirely true for `array.array`, all of the usage of `array.array` in this code base is of the `"B"` form, unsigned byte, so `MutableSequence[int]` is appropriate. Note that `array.array` is still used for variables' assignment; it is just not used in type signatures anymore. A side-effect of switching away from `array.array` in type signatures is that `MutableSequence` does not satisfy the `Buffer` protocol expected from `struct.unpack`. `.tobytes()` is also not guaranteed to be available. `mypy` highlights these issues without `--strict`. The `bytes()` built-in method is used to cast all values now only guaranteed to be `MutableSequence[int]`. The `Block.unpack_guid` method was modified to remove `ord`, which happens to not be type-compatible with `MutableSequence[int]` when using `map`. Disclaimer: Participation by NIST in the creation of the documentation of mentioned software is not intended to imply a recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that any specific software is necessarily the best available for the purpose. References: * https://stackoverflow.com/a/67775675 Signed-off-by: Alex Nelson <[email protected]>
`types.MethodType` in Python 3 takes 2 arguments to instantiate, per line 447 of the current `types.pyi` in `typeshed`. I'm not sure when this switched from 3 arguments as written down to 2; using `git blame` to step back in the history of that line only goes back to `337abed05a` in 2015, and doesn't show a specific time where the "name" parameter this patch removes was expected. This patch removes the third parameter because it raises a runtime error. References: * https://github.com/python/typeshed/blob/21fcd8960f1dae5ec4563dd99860d0918efe5cff/stdlib/types.pyi#L447 Signed-off-by: Alex Nelson <[email protected]>
…r string Use of an empty character string with a "B" `array` raises a runtime error, with the stack trace ending with this: ``` TypeError: cannot use a str to initialize an array with typecode 'B' ``` Signed-off-by: Alex Nelson <[email protected]>
This is added in partial satisfaction of Issue 41. A follow-on patch will regenerate Make-managed files. References: * williballenthin#41 Signed-off-by: Alex Nelson <[email protected]>
Signed-off-by: Alex Nelson <[email protected]>
Apologies for the noise - I tested offline with Python 3.9, but forgot to test 3.8. Version 1 of this PR turns out to work in only 3.9 and later. |
indxparse/BinaryParser.py
Outdated
@@ -229,7 +230,7 @@ def align(offset, alignment): | |||
return offset + (alignment - (offset % alignment)) | |||
|
|||
|
|||
def dosdate(dosdate: array.array, dostime: array.array) -> datetime: | |||
def dosdate(dosdate: MutableSequence[int], dostime: MutableSequence[int]) -> datetime: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would really like to work towards annotations that better differentiate between mutable vs immutable data. when i see a function that takes a MutableSequence
then i expect it may change the data. for these purely functional parsing routines i think this is the wrong hint. most of these routines should take bytes
(or Sequence[byte]
or similar abstraction). maybe we should introduce casts in a few key places to make the hints work better.
thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think collections.abc.Buffer
(3.12+) and typing.ByteString
(3.9-) are probably the appropriate hints.
https://docs.python.org/3/library/typing.html#typing.ByteString
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been finding this to be a challenging type annotation exercise. I am happy to explore options. I'll be back to you soon, but probably not before your end of day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I've added two patches. The first fixes the Python 3.8 issue. If you'd like me to rework the series so CI passes on each commit, let me know and I'll file another PR instead of force-pushing to overwrite.
The second patch applies your suggestion to differentiate mutable vs. immutable data. It induced some non-trivial updates to the code; that patch's log message document what happened.
I also think the patch ended up suggesting some code-style revisions, e.g. reviewing whether the read_*
functions still need the offset
parameter with bytes()
used before calling; and, reviewing read_byte
. But I think these are style matters and don't necessarily need to be addressed now, so I'm marking this PR as ready for review & merge if you agree and are fine with the half-passing early commits.
indxparse/BinaryParser.py
Outdated
@@ -335,13 +336,13 @@ def read_byte( | |||
- `OverrunBufferException` | |||
""" | |||
try: | |||
return struct.unpack_from("<B", buf, offset)[0] | |||
return struct.unpack_from("<B", bytes(buf), offset)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these calls to bytes(buf)
make mypy happy but they aren't side effect free: i think they'll realize the mmap/array/whatever into a copy of all the data. for large byte arrays, this could be problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you feel about a slice on buf
before feeding into bytes()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inlining slicing ended up making the most sense to me.
The `typing` module offers a symbol that supports subscripting as a type signature without confusing the runtime; but, it was deprecated in Python 3.9. This patch was tested in Pythons 3.8 and 3.12. References: * https://docs.python.org/3.12/library/typing.html#typing.MutableSequence Signed-off-by: Alex Nelson <[email protected]>
This patch reviews usage of `MutableSequence[int]`, and restricts functions that should not modify their inputs (e.g. `BinaryParser.read_dword`) to use a read-only argument signature instead of the read-write `MutableSequence[int]`. An initial suggestion of `typing.ByteString` and/or `collections.abc.Buffer` led to review of their definitions for whether they are read-only or read-write types. The end result is that for "Read-write" function parameters, `MutableSequence[int]` is still used, and for "Read-only" parameters, `bytes` is used. The foundations for this decision are that the `bytes` class is immutable, `bytearray` is mutable, and the suggested types `ByteString` and `Buffer` end up supporting `bytearray`. Prior review of `typeshed`'s documentation on "readable buffers" and "writable buffers" indicate there is no "read-only Buffer" generic type currently in Python. `collections.abc.Buffer` is the union of `bytes`, `bytearray`, and `memoryview`. Because the union includes `bytes` and `bytearray`, `Buffer` can be mutable. (This definition was found in `typing_extensions`; source block cited in references.) `memoryview` may reference an object that supports the buffer protocol. This includes `bytes` and `bytearray`. (This mutability/immutability detail is documented in both 3.8 and 3.12; 3.8 is cited below.) `typing.ByteString` is defined as the same union as `collections.abc.Buffer`, so it can be mutable. Given all the above, this patch implements "read-only" buffer restrictions on function parameters with the `bytes` type. A consequence of using types to restrict read-only vs. read-write is that several call paths to the read-only functions start from read-write buffers. This patch handles this by using `bytes` to cast slices of the mutable buffers, done to satisfy type restrictions and to avoid casting larger buffers than necessary. No effects were observed on Make-managed files. References: * https://docs.python.org/3/library/stdtypes.html#bytearray-objects * https://docs.python.org/3/library/stdtypes.html#bytes-objects * https://docs.python.org/3.8/library/stdtypes.html#memoryview * https://github.com/python/typeshed/blob/21fcd8960f1dae5ec4563dd99860d0918efe5cff/stdlib/_typeshed/__init__.pyi#L239-L247 * https://github.com/python/typing_extensions/blob/04f98954ba63a5e8a09c12171be24785298276b6/src/typing_extensions.py#L2522-L2548 Requested-by: Willi Ballenthin <[email protected]> Signed-off-by: Alex Nelson <[email protected]>
CI continues to pass after the merge of #84 . |
indxparse/BinaryParser.py
Outdated
@@ -335,13 +341,13 @@ def read_byte( | |||
- `OverrunBufferException` | |||
""" | |||
try: | |||
return struct.unpack_from("<B", buf, offset)[0] | |||
return struct.unpack_from("<B", bytes(buf), offset)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm still concerned about these calls to bytes
since they may make copies of a large buffer protocol object (like a mmap) unnecessarily.
i understand these calls are to satisfy mypy. could we replace them with typing.cast(bytes)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, cast()
seems to be the right call here. I've pushed two patches to adopt that strategy.
No effects were observed on Make-managed files. Signed-off-by: Alex Nelson <[email protected]>
This patch avoids potential copies of large buffers from feeding through `bytes()`, using `cast(bytes, _)` instead to satisfy type review. The slicing added in `fbe65927` has been reverted, because it had been added to focus the data being fed into `bytes()` calls. No effects were observed on Make-managed files. References: * https://docs.python.org/3/library/typing.html#typing.cast Requested-by: Willi Ballenthin <[email protected]> Signed-off-by: Alex Nelson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ready to merge?
lgtm
@@ -675,7 +681,7 @@ def unpack_byte(self, offset: int) -> int: | |||
Throws: | |||
- `OverrunBufferException` | |||
""" | |||
return read_byte(self._buf, self._offset + offset) | |||
return read_byte(cast(bytes, self._buf), self._offset + offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great!
@@ -69,7 +74,7 @@ def __str__(self): | |||
class FixupBlock(Block): | |||
def __init__( | |||
self, | |||
buf: array.array, | |||
buf: MutableSequence[int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an instance where the data needs to be mutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Yes, I think this is ready to merge. |
Thank you! |
This patch series has three effects:
array.array
as function parameters' types, instead usingMutableSequence[int]
.bytes()
as a converter. This is a side-effect of transitioning type-system guarantees fromarray.array
toMutableSequence[int]
.MFTINDX.py
, generating results for two of its three input forms and four of its run modes.Each patch has a log message documenting its changes.