zdtm: Distinguish between fail and crash of dump #2376

bsach64 · 2024-03-28T18:58:30Z

This PR distinguishes criu dump crash from criu dump fail in the zdtm.py script. Now, zdtm reports criu dump crash as a test failure.

Very similiar to the work done in this PR (#2140)

But, to detect a rpc crash we see that if the response from the RPC is invalid, then we assume
the RPC crashed.

For detecting the crash for the CLI we can check the Popen.returncode
https://docs.python.org/3/library/subprocess.html#subprocess.Popen.returncode

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

Fixes: #350

codecov-commenter · 2024-03-28T19:25:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.57%. Comparing base (00d7cdc) to head (6298e80).

❗ Current head 6298e80 differs from pull request most recent head 4bfaf6e. Consider uploading reports for the commit 4bfaf6e to get more accurate results

Additional details and impacted files

@@             Coverage Diff              @@
##           criu-dev    #2376      +/-   ##
============================================
- Coverage     70.73%   70.57%   -0.16%     
============================================
  Files           136      135       -1     
  Lines         32940    33449     +509     
============================================
+ Hits          23299    23606     +307     
- Misses         9641     9843     +202

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

test/zdtm.py

Snorch · 2024-04-16T10:52:25Z

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Adds a exit_signal static method to criu_cli, criu_config and criu_rpc used to detect a crash. Fixes: checkpoint-restore#350 Signed-off-by: Bhavik Sachdev <[email protected]>

bsach64 · 2024-04-19T19:33:32Z

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch,
I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.
On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh.
There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Snorch · 2024-04-22T12:15:01Z

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

bsach64 · 2024-04-24T11:41:31Z

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

I don't think giving the fault injection an id >= 128 is what we want.
If I understand correctly, ids >= 128 refer to non-fatal fault injections (criu/include/fault-injection.h). Giving this fatal fault injection an id >= 128 causes tests to fail. I think currently zdtm retries all fatal fault injections. I will try to figure out a different way :)
Can you elaborate on why we don't want our fault to be retry-able?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

I have added vfork00 to criu-fault.sh.

Snorch · 2024-04-26T05:32:40Z

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

I don't think giving the fault injection an id >= 128 is what we want. If I understand correctly, ids >= 128 refer to non-fatal fault injections (criu/include/fault-injection.h). Giving this fatal fault injection an id >= 128 causes tests to fail.

I may misunderstand something, correct me if I'm wrong, but isn't a test fail exactly the behavior we expect in case of criu crash? In other words: if criu fails to dump/restore crfail test zdtm reports PASS, if criu crashes zdtm should report FAIL, isn't it? (And in criu-fault.sh we should expect FAIL from zdtm.)

I think currently zdtm retries all fatal fault injections. I will try to figure out a different way :) Can you elaborate on why we don't want our fault to be retry-able?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

I have added vfork00 to criu-fault.sh.

test/jenkins/criu-fault.sh

Snorch

Looks good.

bsach64 · 2024-04-26T08:11:51Z

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

I don't think giving the fault injection an id >= 128 is what we want. If I understand correctly, ids >= 128 refer to non-fatal fault injections (criu/include/fault-injection.h). Giving this fatal fault injection an id >= 128 causes tests to fail.

I may misunderstand something, correct me if I'm wrong, but isn't a test fail exactly the behavior we expect in case of criu crash? In other words: if criu fails to dump/restore crfail test zdtm reports PASS, if criu crashes zdtm should report FAIL, isn't it? (And in criu-fault.sh we should expect FAIL from zdtm.)

I think currently zdtm retries all fatal fault injections. I will try to figure out a different way :) Can you elaborate on why we don't want our fault to be retry-able?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

I have added vfork00 to criu-fault.sh.

Sorry, the misunderstanding was on my part. You are correct; the tests previously passed as they were re-run without the fault injection. Thank you for all your help @Snorch :)

Snorch · 2024-04-26T08:59:36Z

Hm, test still does not pass with && fail as though we don't call fail function we still return non-zero exit code and with set -e it fails the script.

This seem to do the right trick:

if ./test/zdtm.py run -t zdtm/static/vfork00 --fault 136 --report report -f h ; then
        fail
fi

Also maybe someone else has better ideas?

Signed-off-by: Bhavik Sachdev <[email protected]>

github-actions · 2024-05-27T00:42:41Z

A friendly reminder that this PR had no activity for 30 days.

bsach64 · 2024-06-05T15:42:04Z

A friendly reminder that this PR had no activity for 30 days.

Can we remove the "stale-pr" label?

adrianreber · 2024-06-05T15:43:29Z

A friendly reminder that this PR had no activity for 30 days.

Can we remove the "stale-pr" label?

Removed.

rst0git reviewed Mar 29, 2024

View reviewed changes

test/zdtm.py Outdated Show resolved Hide resolved

bsach64 force-pushed the distinguish-crash-fail branch from 62ae140 to 4bfaf6e Compare March 29, 2024 21:47

Snorch reviewed Apr 16, 2024

View reviewed changes

test/zdtm.py Outdated Show resolved Hide resolved

zdtm: Distinguish between fail and crash of dump

2c9e91c

Adds a exit_signal static method to criu_cli, criu_config and criu_rpc used to detect a crash. Fixes: checkpoint-restore#350 Signed-off-by: Bhavik Sachdev <[email protected]>

bsach64 force-pushed the distinguish-crash-fail branch from 4bfaf6e to 2c9e91c Compare April 17, 2024 15:30

bsach64 force-pushed the distinguish-crash-fail branch from 1400985 to 2590899 Compare April 24, 2024 10:18

Snorch reviewed Apr 26, 2024

View reviewed changes

test/jenkins/criu-fault.sh Outdated Show resolved Hide resolved

bsach64 force-pushed the distinguish-crash-fail branch from 2590899 to c787b51 Compare April 26, 2024 07:55

Snorch approved these changes Apr 26, 2024

View reviewed changes

test/dump-crash: check code path when dump crashes

d471052

Signed-off-by: Bhavik Sachdev <[email protected]>

bsach64 force-pushed the distinguish-crash-fail branch from c787b51 to d471052 Compare April 26, 2024 09:15

github-actions bot added the stale-pr label May 27, 2024

adrianreber removed the stale-pr label Jun 5, 2024

adrianreber added the no-auto-close Don't auto-close as a stale issue label Jun 5, 2024

avagin merged commit ced120a into checkpoint-restore:criu-dev Jun 8, 2024
36 of 39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zdtm: Distinguish between fail and crash of dump #2376

zdtm: Distinguish between fail and crash of dump #2376

bsach64 commented Mar 28, 2024

codecov-commenter commented Mar 28, 2024 •

edited

Loading

Snorch commented Apr 16, 2024

bsach64 commented Apr 19, 2024

Snorch commented Apr 22, 2024

bsach64 commented Apr 24, 2024

Snorch commented Apr 26, 2024

Snorch left a comment

bsach64 commented Apr 26, 2024

Snorch commented Apr 26, 2024 •

edited

Loading

github-actions bot commented May 27, 2024

bsach64 commented Jun 5, 2024

adrianreber commented Jun 5, 2024

zdtm: Distinguish between fail and crash of dump #2376

zdtm: Distinguish between fail and crash of dump #2376

Conversation

bsach64 commented Mar 28, 2024

codecov-commenter commented Mar 28, 2024 • edited Loading

Codecov Report

Snorch commented Apr 16, 2024

bsach64 commented Apr 19, 2024

Snorch commented Apr 22, 2024

bsach64 commented Apr 24, 2024

Snorch commented Apr 26, 2024

Snorch left a comment

Choose a reason for hiding this comment

bsach64 commented Apr 26, 2024

Snorch commented Apr 26, 2024 • edited Loading

github-actions bot commented May 27, 2024

bsach64 commented Jun 5, 2024

adrianreber commented Jun 5, 2024

codecov-commenter commented Mar 28, 2024 •

edited

Loading

Snorch commented Apr 26, 2024 •

edited

Loading