Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repo: allow_missing: Compare hash_info.value instead of hash_info #9839

Closed

Conversation

daavoo
Copy link
Contributor

@daavoo daavoo commented Aug 11, 2023

Closes #9818

@daavoo daavoo added bugfix fixes bug A: pipelines Related to the pipelines feature labels Aug 11, 2023
@daavoo daavoo self-assigned this Aug 11, 2023
@daavoo daavoo linked an issue Aug 11, 2023 that may be closed by this pull request
@daavoo daavoo requested a review from a team August 11, 2023 14:19
@@ -314,7 +314,8 @@ def _changed_deps(
if status:
if allow_missing and status[str(dep)] == "deleted":
if upstream and any(
dep.fs_path == out.fs_path and dep.hash_info != out.hash_info
dep.fs_path == out.fs_path
and dep.hash_info.value != out.hash_info.value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla do you think it is ok?

See the added test / linked issue for the scenario, is there something different we should do for 2.X .dvc files that are tracked as deps in 3.X dvc.lock files?

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Comment on lines +317 to +318
dep.fs_path == out.fs_path
and dep.hash_info.value != out.hash_info.value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not look into hash_info.value. 2.0 and 3.0 hashes are not compatible.

Copy link
Contributor Author

@daavoo daavoo Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know the theory:

#9818 (comment)

Do we care in practice (today, not if we introduce different hashing algorithm)?

I mean, isn't the same chance of a collision as today with 2 different 3.0 md5 hashes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without considering the collision, if I am understanding the branch of this code correctly, there is a chance of hashes being mismatched and returning modified, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo hash_info should be compared as a whole, comparing only values without name looks like a hack. Could you elaborate on why you need to do it this way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess new output is getting hashed as 3.x and deps is hashed as 2.x format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would be better to just ask to upgrade (e.g. with dvc commit) #9818 (comment) Otherwise we have to compute both hashes (which is undesirable) or hack like it is done in this PR, which makes it straight up error-prone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo hash_info should be compared as a whole, comparing only values without name looks like a hack.

It doesn't look, it is a hack 😄

Could you elaborate on why you need to do it this way?

It is in the added test / linked issue.

If there is 2.X .dvc referenced as dep in a 3.X dvc.lock, the hash_info comparison fails because of has_info.name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a chance of hashes being mismatched and returning modified, no?

Not sure I follow, sorry.
This is what happens before the P.R: an equivalent folder is mismatched because of hash_info.name but only when the content is missing. If you pull the data, the same .dvc and dvc.lock report it as not modified

@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.01% 🎉

Comparison is base (a051b8b) 90.81% compared to head (f4bb83d) 90.83%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9839      +/-   ##
==========================================
+ Coverage   90.81%   90.83%   +0.01%     
==========================================
  Files         473      473              
  Lines       35982    35988       +6     
  Branches     5189     5189              
==========================================
+ Hits        32678    32688      +10     
+ Misses       2713     2710       -3     
+ Partials      591      590       -1     
Files Changed Coverage Δ
dvc/stage/__init__.py 91.66% <ø> (ø)
tests/func/repro/test_repro_allow_missing.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skshetry skshetry changed the title repo: allow_missing: Compare hash_info.value instead of has_info. repo: allow_missing: Compare hash_info.value instead of hash_info. Aug 11, 2023
@skshetry skshetry changed the title repo: allow_missing: Compare hash_info.value instead of hash_info. repo: allow_missing: Compare hash_info.value instead of hash_info Aug 11, 2023
@daavoo daavoo marked this pull request as draft August 15, 2023 08:48
@daavoo
Copy link
Contributor Author

daavoo commented Aug 15, 2023

Discussed to follow a different approach:

  • dvc commit --force / dvc repro should update the .dvc file using the new hash.

@daavoo daavoo closed this Aug 15, 2023
@skshetry skshetry deleted the 9818-dvc-repro-dry-allow-missing-fails-on-missing-data branch September 24, 2023 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: pipelines Related to the pipelines feature bugfix fixes bug
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

dvc repro --dry --allow-missing: fails on missing data
4 participants