Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make bdist filename verification more strict #14602

Closed
wayphinder opened this issue Sep 20, 2023 · 5 comments
Closed

Make bdist filename verification more strict #14602

wayphinder opened this issue Sep 20, 2023 · 5 comments
Labels

Comments

@wayphinder
Copy link

What's the problem this feature will solve?

It’s currently possible to upload duplicate distributions like the following

Requests-1.0.0-py3-none-any.whl
requests-1.0.0-py3-none-any.whl
requests-01.0.0-py3-none-any.whl

When installing requests==1.0.0, Poetry would select the first distribution and Pip the last.

Describe the solution you'd like
For sdists this will be fixed by #12245 and the same type of enforcement can be made for bdists. I.e. enforce the normalization rules for the filenames, as well as verifying that the version in the filename matches the version in the metadata version.

Build numbers would still enable bdist filenames that are otherwise duplicates. I have not looked closely enough at tags to know if they could also be used to make duplicate distributions.

It could be possible to implement a check for duplicate distributions before the normalization rules are enforced for both bdists and sdists.

Additional context
Related meta issue: #12316

@wayphinder wayphinder added feature request requires triaging maintainers need to do initial inspection of issue labels Sep 20, 2023
@di di added bug 🐛 and removed requires triaging maintainers need to do initial inspection of issue feature request labels Sep 20, 2023
@di
Copy link
Member

di commented Sep 20, 2023

Marking this as a bug because this seems to be in violation of the current binary distribution specification, which says:

In distribution names, ...uppercase characters should be replaced with corresponding lowercase ones.

and:

Version numbers should be normalised according to PEP 440.

Before fixing this, we should determine what proportion of recently uploaded filenames would be considered invalid, determine which build backends (if any) are producing invalid filenames, and attempt to fix them to produce normalized filenames if possible.

We should probably also do a standard deprecation/warning period before blocking upload for these.

@TheDutchDevil
Copy link

Apologies for barging in here, but I came across the issue and this part of the question intrigued me:

Before fixing this, we should determine what proportion of recently uploaded filenames would be considered invalid

So took a stab at this through the bigquery instance (I have no clue if this is a complete record of package uploads, but according to the metadata of the table it was updated on October 2nd) and built a simple groupby on project name, version and lowercase filename to select all instances of duplicate filename uploads in 2023. A csv of the query results can be found here.

For reference this is the query:

SELECT b.name, b.version, b.filename from `bigquery-public-data.pypi.distribution_metadata` as b
RIGHT JOIN 

(SELECT name, version, COUNT(name) as versions, MIN(filename) as lower_filename FROM `bigquery-public-data.pypi.distribution_metadata` 
  WHERE 
    UPLOAD_TIME > TIMESTAMP(DATE "2023-01-01") AND
    packagetype = 'bdist_wheel'
GROUP BY name, version, lower(filename)
HAVING versions > 1
) AS a 
ON lower(a.lower_filename) = lower(b.filename)

@dimbleby
Copy link

results linked above include false positives eg the first two rows are both "start_ocr-0.0.3-py3-none-any.whl" (with no difference between them) and indeed there is only one such file at https://pypi.org/project/start-ocr/0.0.3/#files

nevertheless this mistake is a thing that really happens, I'm here because I ran across https://pypi.org/project/Pymem/1.13.1/#files, which has both "pymem-1.13.1-py3-none-any.whl" and "Pymem-1.13.1-py3-none-any.whl"

@di
Copy link
Member

di commented Aug 7, 2024

Thanks for doing the analysis! The number of duplicates is quite low, but we should really be checking for the occurrence of invalid filenames, even if there isn't a duplicate. I think there will probably be many more.

@di
Copy link
Member

di commented Feb 13, 2025

Closing this: we will enforce the wheel name with #17378 once that is unblocked, and the version normalization issue is an upstream bug in packaging: pypa/packaging#873.

@di di closed this as completed Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants