Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fio crashes with "failed to unlock overlap check mutex, err: 0:success" #1807

Open
NickiXsight opened this issue Aug 21, 2024 · 2 comments
Open

Comments

@NickiXsight
Copy link

Please acknowledge the following before creating a ticket

Description of the bug:
When running with huge FIO that involves multiple jobs with verify we run with serialize_overlap=1
then FIO aborts at some point with the error in the title.

After looking into the code I see 4 problems:

  1. all mutex operation errors report errno instead of pthread_mutex_X return code, which is wrong -- pthread_mutex doesn't set errno, at least in debian xs86_64
  2. the same error is reported from several functions and it makes it hard to identify which one really fired
  3. after adding some MACRO wrapper with func FILE LINE I figured out that ioengines.c : td_io_queue fires the messages -- and indeed, it unlocks the lock BEFORE it finally enqueues the request, so it can actually return FIO_Q_BUSY and then rate-submit.c:io_workqueue_fn will submit it again. But the lock is already released
  4. The issue is actually not only the lock -- when FIO_Q_BUSY was returned the io_u is cleared from its' td, so another worker can allocate that LBA, so prior to calling the td_io_queue again check_overlap should be executed again

Environment: debian x86_64

fio version: fio-3.37-86-g7bc1

Reproduction steps
[write-and-verify]
rw=randwrite
bs=4k
direct=1
ioengine=libaio
iodepth=128
verify=crc32c
verify_backlog=100000
verify_dump=1
verify_fatal=1
verify_async=4
serialize_overlap=1
io_submit_mode=offload
blocksize_range=4k-8k
runtime=6000
size=512m
numjobs=10
filename=/dev/nvme0n8:/dev/nvme0n7:/dev/nvme0n6:/dev/nvme0n5:/dev/nvme0n4:/dev/nvme0n3:/dev/nvme0n2:/dev/nvme0n1

@axboe
Copy link
Owner

axboe commented Aug 21, 2024

Since you already did the full analysis, care to send a fix for this?

@NickiXsight
Copy link
Author

NickiXsight commented Aug 21, 2024

The 1 and 2 are really simple, I can create a PR tomorrow.
But 3 and 4 require more attention -- just fixing the lock-unlock scheme is not enough, I have to create a good test that proves that indeed overlap conflict may happen during the requeue, and then -- if I push the check_overlap into the requeue loop, what is the performance impact of all this?
I'll do PR soon and we'll discuss it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants