Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repo: allow_missing: Compare hash_info.value instead of hash_info #9839

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion dvc/stage/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,8 @@ def _changed_deps(
if status:
if allow_missing and status[str(dep)] == "deleted":
if upstream and any(
dep.fs_path == out.fs_path and dep.hash_info != out.hash_info
dep.fs_path == out.fs_path
and dep.hash_info.value != out.hash_info.value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla do you think it is ok?

See the added test / linked issue for the scenario, is there something different we should do for 2.X .dvc files that are tracked as deps in 3.X dvc.lock files?

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Comment on lines +317 to +318
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not look into hash_info.value. 2.0 and 3.0 hashes are not compatible.

Copy link
Contributor Author

@daavoo daavoo Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know the theory:

#9818 (comment)

Do we care in practice (today, not if we introduce different hashing algorithm)?

I mean, isn't the same chance of a collision as today with 2 different 3.0 md5 hashes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without considering the collision, if I am understanding the branch of this code correctly, there is a chance of hashes being mismatched and returning modified, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo hash_info should be compared as a whole, comparing only values without name looks like a hack. Could you elaborate on why you need to do it this way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess new output is getting hashed as 3.x and deps is hashed as 2.x format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would be better to just ask to upgrade (e.g. with dvc commit) #9818 (comment) Otherwise we have to compute both hashes (which is undesirable) or hack like it is done in this PR, which makes it straight up error-prone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo hash_info should be compared as a whole, comparing only values without name looks like a hack.

It doesn't look, it is a hack 😄

Could you elaborate on why you need to do it this way?

It is in the added test / linked issue.

If there is 2.X .dvc referenced as dep in a 3.X dvc.lock, the hash_info comparison fails because of has_info.name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a chance of hashes being mismatched and returning modified, no?

Not sure I follow, sorry.
This is what happens before the P.R: an equivalent folder is mismatched because of hash_info.name but only when the content is missing. If you pull the data, the same .dvc and dvc.lock report it as not modified

for stage in upstream
for out in stage.outs
):
Expand Down
53 changes: 53 additions & 0 deletions tests/func/repro/test_repro_allow_missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,59 @@ def test_repro_allow_missing(tmp_dir, dvc):
assert not ret


def test_repro_allow_missing_dry_with_dvc_2_file(tmp_dir, dvc):
"""https://github.com/iterative/dvc/issues/9818"""

# A .dvc file in 2.X format
(tmp_dir / "data.dvc").dump(
{
"outs": [
{
"md5": "14d187e749ee5614e105741c719fa185.dir",
"size": 18999874,
"nfiles": 183,
"path": "data",
}
]
}
)

(tmp_dir / "dvc.lock").dump(
{
"schema": "2.0",
"stages": {
"cat-data-foo": {
"cmd": "cat data/foo > foo",
"deps": [
{
# The file as dependency in 3.X format
"path": "data",
"hash": "md5",
"md5": "14d187e749ee5614e105741c719fa185.dir",
"size": 18999874,
"nfiles": 183,
}
],
"outs": [
{
"path": "foo",
"hash": "md5",
"md5": "069f9d4b9c418b935f7bec4a094ba535",
"size": 30,
}
],
}
},
}
)
dvc.stage.add(
name="cat-data-foo", cmd="cat data/foo > foo", deps=["data"], outs=["foo"]
)

ret = dvc.reproduce(allow_missing=True, dry=True)
assert not ret


def test_repro_allow_missing_and_pull(tmp_dir, dvc, mocker, local_remote):
tmp_dir.gen("fixed", "fixed")
dvc.stage.add(name="create-foo", cmd="echo foo > foo", deps=["fixed"], outs=["foo"])
Expand Down