Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exercise MFTINDX.py with test image and extracted MFT #83

Merged

Conversation

ajnelson-nist
Copy link
Contributor

This patch series has three effects:

  1. Fixing a few minor typing issues.
  2. Replacing usage of array.array as function parameters' types, instead using MutableSequence[int].
    1. Changing several methods of yielding byte-strings to instead use bytes() as a converter. This is a side-effect of transitioning type-system guarantees from array.array to MutableSequence[int].
  3. Adding a test-set for MFTINDX.py, generating results for two of its three input forms and four of its run modes.
    1. A TODO is left for adding INDX snippets for analysis.

Each patch has a log message documenting its changes.

On trying to run `MFTINDX.py` against the sample disk image in this
repository, an error was raised, with this (trimmed) call stack:

```
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "...[snip].../indxparse/MFTINDX.py", line 284, in <module>
    buf: array.array[Any],
TypeError: 'type' object is not subscriptable
```

(Using Python 3.9.)

This type signature was an attempt to satisfy a review note from
`mypy --strict`, that `array.array` is a generic and needed a further
type specialization, e.g. as visible with
`mypy --strict indxparse/BinaryParser.py` (which is known to also raise
other unrelated issues at the moment):

```
error: Missing type parameters for generic type "array"  [type-arg]
```

This type signature specialization was not runtime-tested before
implementing, and turns out to be incorrect.

A StackOverflow post suggested replacing `array.array` in type
signatures with `collections.abc.MutableSequence`, which is also a
generic but can be specialized.

This patch replaces `array.array` (with and without further type
specialization) in type signatures with `MutableSequence[int]`, because
a single member of a `bytes` object and most other `bytes`-like objects
is an integer.  While this is not entirely true for `array.array`, all
of the usage of `array.array` in this code base is of the `"B"` form,
unsigned byte, so `MutableSequence[int]` is appropriate.  Note that
`array.array` is still used for variables' assignment; it is just not
used in type signatures anymore.

A side-effect of switching away from `array.array` in type signatures is
that `MutableSequence` does not satisfy the `Buffer` protocol expected
from `struct.unpack`.  `.tobytes()` is also not guaranteed to be
available.  `mypy` highlights these issues without `--strict`.  The
`bytes()` built-in method is used to cast all values now only guaranteed
to be `MutableSequence[int]`.

The `Block.unpack_guid` method was modified to remove `ord`, which
happens to not be type-compatible with `MutableSequence[int]` when using
`map`.

Disclaimer:
Participation by NIST in the creation of the documentation of mentioned
software is not intended to imply a recommendation or endorsement by the
National Institute of Standards and Technology, nor is it intended to
imply that any specific software is necessarily the best available for
the purpose.

References:
* https://stackoverflow.com/a/67775675

Signed-off-by: Alex Nelson <[email protected]>
`types.MethodType` in Python 3 takes 2 arguments to instantiate, per
line 447 of the current `types.pyi` in `typeshed`.  I'm not sure when
this switched from 3 arguments as written down to 2; using `git blame`
to step back in the history of that line only goes back to `337abed05a`
in 2015, and doesn't show a specific time where the "name" parameter
this patch removes was expected.

This patch removes the third parameter because it raises a runtime
error.

References:
* https://github.com/python/typeshed/blob/21fcd8960f1dae5ec4563dd99860d0918efe5cff/stdlib/types.pyi#L447

Signed-off-by: Alex Nelson <[email protected]>
…r string

Use of an empty character string with a "B" `array` raises a runtime
error, with the stack trace ending with this:

```
TypeError: cannot use a str to initialize an array with typecode 'B'
```

Signed-off-by: Alex Nelson <[email protected]>
This is added in partial satisfaction of Issue 41.

A follow-on patch will regenerate Make-managed files.

References:
* williballenthin#41

Signed-off-by: Alex Nelson <[email protected]>
@ajnelson-nist
Copy link
Contributor Author

Apologies for the noise - I tested offline with Python 3.9, but forgot to test 3.8. Version 1 of this PR turns out to work in only 3.9 and later.

@@ -229,7 +230,7 @@ def align(offset, alignment):
return offset + (alignment - (offset % alignment))


def dosdate(dosdate: array.array, dostime: array.array) -> datetime:
def dosdate(dosdate: MutableSequence[int], dostime: MutableSequence[int]) -> datetime:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would really like to work towards annotations that better differentiate between mutable vs immutable data. when i see a function that takes a MutableSequence then i expect it may change the data. for these purely functional parsing routines i think this is the wrong hint. most of these routines should take bytes (or Sequence[byte] or similar abstraction). maybe we should introduce casts in a few key places to make the hints work better.

thoughts?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think collections.abc.Buffer (3.12+) and typing.ByteString (3.9-) are probably the appropriate hints.

https://docs.python.org/3/library/typing.html#typing.ByteString

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been finding this to be a challenging type annotation exercise. I am happy to explore options. I'll be back to you soon, but probably not before your end of day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I've added two patches. The first fixes the Python 3.8 issue. If you'd like me to rework the series so CI passes on each commit, let me know and I'll file another PR instead of force-pushing to overwrite.

The second patch applies your suggestion to differentiate mutable vs. immutable data. It induced some non-trivial updates to the code; that patch's log message document what happened.

I also think the patch ended up suggesting some code-style revisions, e.g. reviewing whether the read_* functions still need the offset parameter with bytes() used before calling; and, reviewing read_byte. But I think these are style matters and don't necessarily need to be addressed now, so I'm marking this PR as ready for review & merge if you agree and are fine with the half-passing early commits.

@@ -335,13 +336,13 @@ def read_byte(
- `OverrunBufferException`
"""
try:
return struct.unpack_from("<B", buf, offset)[0]
return struct.unpack_from("<B", bytes(buf), offset)[0]
Copy link
Owner

@williballenthin williballenthin Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these calls to bytes(buf) make mypy happy but they aren't side effect free: i think they'll realize the mmap/array/whatever into a copy of all the data. for large byte arrays, this could be problematic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about a slice on buf before feeding into bytes()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlining slicing ended up making the most sense to me.

The `typing` module offers a symbol that supports subscripting as a type
signature without confusing the runtime; but, it was deprecated in Python 3.9.

This patch was tested in Pythons 3.8 and 3.12.

References:
* https://docs.python.org/3.12/library/typing.html#typing.MutableSequence

Signed-off-by: Alex Nelson <[email protected]>
This patch reviews usage of `MutableSequence[int]`, and restricts
functions that should not modify their inputs (e.g.
`BinaryParser.read_dword`) to use a read-only argument signature instead
of the read-write `MutableSequence[int]`.

An initial suggestion of `typing.ByteString` and/or
`collections.abc.Buffer` led to review of their definitions for whether
they are read-only or read-write types.  The end result is that for
"Read-write" function parameters, `MutableSequence[int]` is still used,
and for "Read-only" parameters, `bytes` is used.

The foundations for this decision are that the `bytes` class is
immutable, `bytearray` is mutable, and the suggested types `ByteString`
and `Buffer` end up supporting `bytearray`.  Prior review of
`typeshed`'s documentation on "readable buffers" and "writable buffers"
indicate there is no "read-only Buffer" generic type currently in
Python.

`collections.abc.Buffer` is the union of `bytes`, `bytearray`, and
`memoryview`.  Because the union includes `bytes` and `bytearray`,
`Buffer` can be mutable.  (This definition was found in
`typing_extensions`; source block cited in references.)

`memoryview` may reference an object that supports the buffer protocol.
This includes `bytes` and `bytearray`.  (This mutability/immutability
detail is documented in both 3.8 and 3.12; 3.8 is cited below.)

`typing.ByteString` is defined as the same union as
`collections.abc.Buffer`, so it can be mutable.

Given all the above, this patch implements "read-only" buffer
restrictions on function parameters with the `bytes` type.

A consequence of using types to restrict read-only vs. read-write is
that several call paths to the read-only functions start from read-write
buffers.  This patch handles this by using `bytes` to cast slices of the
mutable buffers, done to satisfy type restrictions and to avoid casting
larger buffers than necessary.

No effects were observed on Make-managed files.

References:
* https://docs.python.org/3/library/stdtypes.html#bytearray-objects
* https://docs.python.org/3/library/stdtypes.html#bytes-objects
* https://docs.python.org/3.8/library/stdtypes.html#memoryview
* https://github.com/python/typeshed/blob/21fcd8960f1dae5ec4563dd99860d0918efe5cff/stdlib/_typeshed/__init__.pyi#L239-L247
* https://github.com/python/typing_extensions/blob/04f98954ba63a5e8a09c12171be24785298276b6/src/typing_extensions.py#L2522-L2548

Requested-by: Willi Ballenthin <[email protected]>
Signed-off-by: Alex Nelson <[email protected]>
@ajnelson-nist ajnelson-nist marked this pull request as ready for review October 19, 2023 19:40
@ajnelson-nist
Copy link
Contributor Author

CI continues to pass after the merge of #84 .

@@ -335,13 +341,13 @@ def read_byte(
- `OverrunBufferException`
"""
try:
return struct.unpack_from("<B", buf, offset)[0]
return struct.unpack_from("<B", bytes(buf), offset)[0]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm still concerned about these calls to bytes since they may make copies of a large buffer protocol object (like a mmap) unnecessarily.

i understand these calls are to satisfy mypy. could we replace them with typing.cast(bytes)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, cast() seems to be the right call here. I've pushed two patches to adopt that strategy.

No effects were observed on Make-managed files.

Signed-off-by: Alex Nelson <[email protected]>
This patch avoids potential copies of large buffers from feeding through
`bytes()`, using `cast(bytes, _)` instead to satisfy type review.

The slicing added in `fbe65927` has been reverted, because it had been
added to focus the data being fed into `bytes()` calls.

No effects were observed on Make-managed files.

References:
* https://docs.python.org/3/library/typing.html#typing.cast

Requested-by: Willi Ballenthin <[email protected]>
Signed-off-by: Alex Nelson <[email protected]>
Copy link
Owner

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready to merge?

lgtm

@@ -675,7 +681,7 @@ def unpack_byte(self, offset: int) -> int:
Throws:
- `OverrunBufferException`
"""
return read_byte(self._buf, self._offset + offset)
return read_byte(cast(bytes, self._buf), self._offset + offset)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

@@ -69,7 +74,7 @@ def __str__(self):
class FixupBlock(Block):
def __init__(
self,
buf: array.array,
buf: MutableSequence[int],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an instance where the data needs to be mutable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@ajnelson-nist
Copy link
Contributor Author

Yes, I think this is ready to merge.

@williballenthin williballenthin merged commit 038e8ec into williballenthin:master Nov 1, 2023
2 checks passed
@ajnelson-nist
Copy link
Contributor Author

Thank you!

@ajnelson-nist ajnelson-nist deleted the exercise_mftindx_py branch November 1, 2023 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants