Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: parse_wheel_filename permits non-normalized versions. #873

Open
di opened this issue Feb 13, 2025 · 12 comments · May be fixed by #874
Open

Bug: parse_wheel_filename permits non-normalized versions. #873

di opened this issue Feb 13, 2025 · 12 comments · May be fixed by #874

Comments

@di
Copy link
Member

di commented Feb 13, 2025

https://packaging.python.org/en/latest/specifications/binary-distribution-format/ says:

Version numbers should be normalised according to the Version specifier specification.

Currently, parse_wheel_filename will raise InvalidWheelFilename for some invalid filenames, but https://packaging.python.org/en/latest/specifications/version-specifiers/#normalization has a long list of version normalizations that parse_wheel_filename does not enforce:

>>> from packaging.utils import parse_wheel_filename
>>> parse_wheel_filename('foo-01.0.0-py3-none-any.whl')
('foo', <Version('1.0.0')>, (), frozenset({<py3-none-any @ 140365619033920>}))
@notatallshaw
Copy link
Member

notatallshaw commented Feb 13, 2025

Does packaging have any sort of deprecation process for this sort of change? i.e. a public function that throws an error where it didn't used to.

Pip is migrating to use parse_wheel_filename as the primary way to parse wheel filenames in 25.1 and I'm concerned on impact to users.

I'm still learning how pip migrates to new packaging versions, I guess we could patch packaging to throw a warning instead of an exception for an appropriate deprecation period.

@di
Copy link
Member Author

di commented Feb 13, 2025

@notatallshaw You may be interested in this branch I am working on to provide a better API here around creating/parsing filenames, including "strict" and "non-strict" parsing: main...di:packaging:filename-api

From a cursory glance, it appears that most build backends are producing wheels with normalized filenames, so I wouldn't expect this to be too disruptive, but I could also be convinced to leave parse_wheel_filename alone until we have a better way to introduce this change.

@brettcannon
Copy link
Member

Does packaging have any sort of deprecation process for this sort of change?

We can raise a warning for a while if an analysis of PyPI suggests it would be disruptive.

@notatallshaw
Copy link
Member

notatallshaw commented Feb 13, 2025

From a cursory glance, it appears that most build backends are producing wheels with normalized filenames, so I wouldn't expect this to be too disruptive, but I could also be convinced to leave parse_wheel_filename alone until we have a better way to introduce this change.

I don't want to hold up packaging if this is a clear win for most use cases. But a bit of background:

Currently pip is using a custom regex to parse wheel filenames: https://github.com/pypa/pip/blob/25.0.1/src/pip/_internal/models/wheel.py#L21

There was a use case pip explicitly supported where _ was allowed in the version part, so we entered a depreciation cycle. So far I've seen two reports of users hitting this (pypa/pip#12938 (comment) and quic/aimet#3743)

These two use cases suggest the two scenarios are users hand rolling their own wheel filenames on private indexes / filesystems and perhaps an edge case in scikit-build-core, again for a project not uploaded to PyPI.

In general when pip is consuming wheel filenames I would rather it be non-strict in cases where the filename can be unambiguously normalized. But I also suppose we could lengthen the depreciation cycle and flag all issues to users (outside the explicitly supported _ in the version part). I would need to do some investigation on what the difference is between this additional strictness here and pip's current regex.

@di
Copy link
Member Author

di commented Feb 13, 2025

We can raise a warning for a while if an analysis of PyPI suggests it would be disruptive.

Taking a look at all filenames uploaded in the last month:

COPY (
    SELECT
        fe.time AS timestamp,
        fe.additional->>'filename' AS filename
    FROM
        file_events fe
    WHERE
        fe.time >= NOW() - INTERVAL '1 month'
        AND fe.tag = 'file:add'
        AND fe.additional->>'filename' LIKE '%.whl'
    ORDER BY fe.time
) TO STDOUT WITH (FORMAT CSV, HEADER);

Analyzing with:

import csv

from packaging.utils import InvalidWheelFilename, parse_wheel_filename

total_filenames = 0
exception_count = 0

with open("out.csv") as file:
    reader = csv.DictReader(file)  # Assumes CSV has a header

    for row in reader:
        total_filenames += 1
        filename = row["filename"]
        try:
            parse_wheel_filename(filename)
        except InvalidWheelFilename:
            exception_count += 1

percentage_exceptions = (exception_count / total_filenames) * 100
print(f"Total filenames processed: {total_filenames}")
print(f"Total exceptions: {exception_count}")
print(f"Percentage of exceptions: {percentage_exceptions:.2f}%")

With this branch, we get:

Total filenames processed: 210589
Total exceptions: 24
Percentage of exceptions: 0.01%

The invalid filenames in question are:

anki-25.01b1-cp39-abi3-macosx_12_0_arm64.whl
anki-25.01b1-cp39-abi3-macosx_12_0_x86_64.whl
anki-25.01b1-cp39-abi3-manylinux_2_35_aarch64.whl
anki-25.01b1-cp39-abi3-manylinux_2_35_x86_64.whl
anki-25.01b1-cp39-abi3-win_amd64.whl
anki-25.01rc1-cp39-abi3-macosx_12_0_arm64.whl
anki-25.01rc1-cp39-abi3-macosx_12_0_x86_64.whl
anki-25.01rc1-cp39-abi3-manylinux_2_35_aarch64.whl
anki-25.01rc1-cp39-abi3-manylinux_2_35_x86_64.whl
anki-25.01rc1-cp39-abi3-win_amd64.whl
anki-25.02-cp39-abi3-macosx_12_0_arm64.whl
anki-25.02-cp39-abi3-macosx_12_0_x86_64.whl
anki-25.02-cp39-abi3-manylinux_2_35_aarch64.whl
anki-25.02-cp39-abi3-manylinux_2_35_x86_64.whl
anki-25.02-cp39-abi3-win_amd64.whl
anki-25.02rc1-cp39-abi3-macosx_12_0_arm64.whl
anki-25.02rc1-cp39-abi3-macosx_12_0_x86_64.whl
anki-25.02rc1-cp39-abi3-manylinux_2_35_aarch64.whl
anki-25.02rc1-cp39-abi3-manylinux_2_35_x86_64.whl
anki-25.02rc1-cp39-abi3-win_amd64.whl
aqt-25.01b1-py3-none-any.whl
aqt-25.01rc1-py3-none-any.whl
aqt-25.02-py3-none-any.whl
aqt-25.02rc1-py3-none-any.whl

So it's really only two projects that are producing wheels with non-normalized versions, not 24 projects.

I can expand the analysis to a larger timeframe if anyone would like?

@di
Copy link
Member Author

di commented Feb 13, 2025

Also, seems reasonable to add a strict flag to parse_wheel_filename that disables this check, which pip could optionally set to handle their own deprecation period.

@notatallshaw
Copy link
Member

notatallshaw commented Feb 13, 2025

We can raise a warning for a while if an analysis of PyPI suggests it would be disruptive.

So it's really only two projects that are producing wheels with non-normalized versions

While helpful information, my experience supporting pip is that it's users who are not uploading to PyPI that are caught by changes to more strictly comply with the spec.

Also, seems reasonable to add a strict flag to parse_wheel_filename that disables this check, which pip could optionally set to handle their own deprecation period.

I think this would make it easier, and I definitely like the look of the filename API you linked earlier.

All that said, please don't consider it a show stopper, looking at all this code has made me think I need to re-review this and check additional scenarios before migrating to parse_wheel_filename, as I said we can always lengthen the depreciation cycle or patch packaging.

@di
Copy link
Member Author

di commented Feb 13, 2025

All that said, please don't consider it a show stopper, looking at all this code has made me think I need to re-review this and check additional scenarios before migrating to parse_wheel_filename, as I said we can always lengthen the depreciation cycle or patch packaging.

If the migration hasn't started yet, I would recommend waiting until we can get a better API for parsing filenames into this library. Note that parse_wheel_filename also doesn't enforce that the project name is normalized (a much more common issue) and that it's partner parse_sdist_filename has a bunch of issues: #527.

@notatallshaw
Copy link
Member

notatallshaw commented Feb 13, 2025

If the migration hasn't started yet, I would recommend waiting until we can get a better API for parsing filenames into this library. Note that parse_wheel_filename also doesn't enforce that the project name is normalized

We started the depreciation in pip 24.3 due to be complete in pip 25.1 (the next release), I'll discuss with the other pip maintainers, thanks.

and that it's partner parse_sdist_filename has a bunch of issues

I've not looked at sdists in general yet, that seemed like a much bigger can of worms. The only place I think we're using parse_sdist_filename right now is to filter filenames when a user points pip to a file directory to find packages, I added that before I understood that sdist and wheel names were tricky, fortunately that was done in pip 24.0 (a year ago) and we've not seen any complaints.

@brettcannon
Copy link
Member

If the migration hasn't started yet, I would recommend waiting until we can get a better API for parsing filenames into this library.

What API are you thinking of?

@di
Copy link
Member Author

di commented Feb 13, 2025

What API are you thinking of?

The one I mentioned here:

@notatallshaw You may be interested in this branch I am working on to provide a better API here around creating/parsing filenames, including "strict" and "non-strict" parsing: main...di:packaging:filename-api

@di
Copy link
Member Author

di commented Feb 13, 2025

I'm also OK waiting on that (or something like that) to land instead of merging #874, we can add this check to PyPI in the interim instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants