Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zdtm: Distinguish between fail and crash of dump #2376

Merged
merged 2 commits into from
Jun 8, 2024

Conversation

bsach64
Copy link
Member

@bsach64 bsach64 commented Mar 28, 2024

This PR distinguishes criu dump crash from criu dump fail in the zdtm.py script. Now, zdtm reports criu dump crash as a test failure.

Very similiar to the work done in this PR (#2140)

But, to detect a rpc crash we see that if the response from the RPC is invalid, then we assume
the RPC crashed.

For detecting the crash for the CLI we can check the Popen.returncode
https://docs.python.org/3/library/subprocess.html#subprocess.Popen.returncode

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

Fixes: #350

@codecov-commenter
Copy link

codecov-commenter commented Mar 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.57%. Comparing base (00d7cdc) to head (6298e80).

❗ Current head 6298e80 differs from pull request most recent head 4bfaf6e. Consider uploading reports for the commit 4bfaf6e to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##           criu-dev    #2376      +/-   ##
============================================
- Coverage     70.73%   70.57%   -0.16%     
============================================
  Files           136      135       -1     
  Lines         32940    33449     +509     
============================================
+ Hits          23299    23606     +307     
- Misses         9641     9843     +202     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

test/zdtm.py Outdated Show resolved Hide resolved
@bsach64 bsach64 force-pushed the distinguish-crash-fail branch from 62ae140 to 4bfaf6e Compare March 29, 2024 21:47
test/zdtm.py Outdated Show resolved Hide resolved
@Snorch
Copy link
Member

Snorch commented Apr 16, 2024

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Adds a exit_signal static method to criu_cli, criu_config and criu_rpc
used to detect a crash.

Fixes: checkpoint-restore#350

Signed-off-by: Bhavik Sachdev <[email protected]>
@bsach64 bsach64 force-pushed the distinguish-crash-fail branch from 4bfaf6e to 2c9e91c Compare April 17, 2024 15:30
@bsach64
Copy link
Member Author

bsach64 commented Apr 19, 2024

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch,
I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.
On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh.
There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

@Snorch
Copy link
Member

Snorch commented Apr 22, 2024

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

@bsach64 bsach64 force-pushed the distinguish-crash-fail branch from 1400985 to 2590899 Compare April 24, 2024 10:18
@bsach64
Copy link
Member Author

bsach64 commented Apr 24, 2024

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

I don't think giving the fault injection an id >= 128 is what we want.
If I understand correctly, ids >= 128 refer to non-fatal fault injections (criu/include/fault-injection.h). Giving this fatal fault injection an id >= 128 causes tests to fail. I think currently zdtm retries all fatal fault injections. I will try to figure out a different way :)
Can you elaborate on why we don't want our fault to be retry-able?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

I have added vfork00 to criu-fault.sh.

@Snorch
Copy link
Member

Snorch commented Apr 26, 2024

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

I don't think giving the fault injection an id >= 128 is what we want. If I understand correctly, ids >= 128 refer to non-fatal fault injections (criu/include/fault-injection.h). Giving this fatal fault injection an id >= 128 causes tests to fail.

I may misunderstand something, correct me if I'm wrong, but isn't a test fail exactly the behavior we expect in case of criu crash? In other words: if criu fails to dump/restore crfail test zdtm reports PASS, if criu crashes zdtm should report FAIL, isn't it? (And in criu-fault.sh we should expect FAIL from zdtm.)

I think currently zdtm retries all fatal fault injections. I will try to figure out a different way :) Can you elaborate on why we don't want our fault to be retry-able?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

I have added vfork00 to criu-fault.sh.

test/jenkins/criu-fault.sh Outdated Show resolved Hide resolved
@bsach64 bsach64 force-pushed the distinguish-crash-fail branch from 2590899 to c787b51 Compare April 26, 2024 07:55
Copy link
Member

@Snorch Snorch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@bsach64
Copy link
Member Author

bsach64 commented Apr 26, 2024

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

I don't think giving the fault injection an id >= 128 is what we want. If I understand correctly, ids >= 128 refer to non-fatal fault injections (criu/include/fault-injection.h). Giving this fatal fault injection an id >= 128 causes tests to fail.

I may misunderstand something, correct me if I'm wrong, but isn't a test fail exactly the behavior we expect in case of criu crash? In other words: if criu fails to dump/restore crfail test zdtm reports PASS, if criu crashes zdtm should report FAIL, isn't it? (And in criu-fault.sh we should expect FAIL from zdtm.)

I think currently zdtm retries all fatal fault injections. I will try to figure out a different way :) Can you elaborate on why we don't want our fault to be retry-able?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

I have added vfork00 to criu-fault.sh.

Sorry, the misunderstanding was on my part. You are correct; the tests previously passed as they were re-run without the fault injection. Thank you for all your help @Snorch :)

@Snorch
Copy link
Member

Snorch commented Apr 26, 2024

Hm, test still does not pass with && fail as though we don't call fail function we still return non-zero exit code and with set -e it fails the script.

This seem to do the right trick:

if ./test/zdtm.py run -t zdtm/static/vfork00 --fault 136 --report report -f h ; then
        fail
fi

Also maybe someone else has better ideas?

@bsach64 bsach64 force-pushed the distinguish-crash-fail branch from c787b51 to d471052 Compare April 26, 2024 09:15
Copy link

A friendly reminder that this PR had no activity for 30 days.

@bsach64
Copy link
Member Author

bsach64 commented Jun 5, 2024

A friendly reminder that this PR had no activity for 30 days.

Can we remove the "stale-pr" label?

@adrianreber
Copy link
Member

A friendly reminder that this PR had no activity for 30 days.

Can we remove the "stale-pr" label?

Removed.

@adrianreber adrianreber added the no-auto-close Don't auto-close as a stale issue label Jun 5, 2024
@avagin avagin merged commit ced120a into checkpoint-restore:criu-dev Jun 8, 2024
36 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distinguish criu dump fail from criu dump crash in zdtm
6 participants