-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make bdist filename verification more strict #14602
Comments
Marking this as a bug because this seems to be in violation of the current binary distribution specification, which says:
and:
Before fixing this, we should determine what proportion of recently uploaded filenames would be considered invalid, determine which build backends (if any) are producing invalid filenames, and attempt to fix them to produce normalized filenames if possible. We should probably also do a standard deprecation/warning period before blocking upload for these. |
Apologies for barging in here, but I came across the issue and this part of the question intrigued me:
So took a stab at this through the bigquery instance (I have no clue if this is a complete record of package uploads, but according to the metadata of the table it was updated on October 2nd) and built a simple groupby on project name, version and lowercase filename to select all instances of duplicate filename uploads in 2023. A csv of the query results can be found here. For reference this is the query: SELECT b.name, b.version, b.filename from `bigquery-public-data.pypi.distribution_metadata` as b
RIGHT JOIN
(SELECT name, version, COUNT(name) as versions, MIN(filename) as lower_filename FROM `bigquery-public-data.pypi.distribution_metadata`
WHERE
UPLOAD_TIME > TIMESTAMP(DATE "2023-01-01") AND
packagetype = 'bdist_wheel'
GROUP BY name, version, lower(filename)
HAVING versions > 1
) AS a
ON lower(a.lower_filename) = lower(b.filename) |
results linked above include false positives eg the first two rows are both "start_ocr-0.0.3-py3-none-any.whl" (with no difference between them) and indeed there is only one such file at https://pypi.org/project/start-ocr/0.0.3/#files nevertheless this mistake is a thing that really happens, I'm here because I ran across https://pypi.org/project/Pymem/1.13.1/#files, which has both "pymem-1.13.1-py3-none-any.whl" and "Pymem-1.13.1-py3-none-any.whl" |
Thanks for doing the analysis! The number of duplicates is quite low, but we should really be checking for the occurrence of invalid filenames, even if there isn't a duplicate. I think there will probably be many more. |
Closing this: we will enforce the wheel name with #17378 once that is unblocked, and the version normalization issue is an upstream bug in |
What's the problem this feature will solve?
It’s currently possible to upload duplicate distributions like the following
When installing
requests==1.0.0
, Poetry would select the first distribution and Pip the last.Describe the solution you'd like
For sdists this will be fixed by #12245 and the same type of enforcement can be made for bdists. I.e. enforce the normalization rules for the filenames, as well as verifying that the version in the filename matches the version in the metadata version.
Build numbers would still enable bdist filenames that are otherwise duplicates. I have not looked closely enough at tags to know if they could also be used to make duplicate distributions.
It could be possible to implement a check for duplicate distributions before the normalization rules are enforced for both bdists and sdists.
Additional context
Related meta issue: #12316
The text was updated successfully, but these errors were encountered: