Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
btrfs: avoid deadlock when reading a partial uptodate folio
[BUG] This is for a deadlock only possible after the out-of-tree patch "btrfs: allow buffered write to skip full page if it's sector aligned". For now it's impossible to hit the deadlock, the reason will be explained in [CAUSE] section. If the sector size is smaller than page size, and we allow btrfs to avoid reading the full page because the buffered write range is sector aligned, we can hit a hang with generic/095 runs: __switch_to+0xf8/0x168 __schedule+0x328/0x8a8 schedule+0x54/0x140 io_schedule+0x44/0x68 folio_wait_bit_common+0x198/0x3f8 __folio_lock+0x24/0x40 extent_write_cache_pages+0x2e0/0x4c0 [btrfs] btrfs_writepages+0x94/0x158 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x88/0xc8 __filemap_fdatawrite_range+0x6c/0xa8 filemap_fdatawrite_range+0x1c/0x30 btrfs_start_ordered_extent+0x264/0x2e0 [btrfs] btrfs_lock_and_flush_ordered_range+0x8c/0x160 [btrfs] __get_extent_map+0xa0/0x220 [btrfs] btrfs_do_readpage+0x1bc/0x5d8 [btrfs] btrfs_read_folio+0x50/0xa0 [btrfs] filemap_read_folio+0x54/0x110 filemap_update_page+0x2e0/0x3b8 filemap_get_pages+0x228/0x4d8 filemap_read+0x11c/0x3b8 btrfs_file_read_iter+0x74/0x90 [btrfs] new_sync_read+0xd0/0x1d0 vfs_read+0x1a0/0x1f0 There is also the minimal fio reproducer extracted from that test case to reproduce the deadlock: [global] bs=8k iodepth=1 randrepeat=1 size=256k directory=$mnt numjobs=1 [job1] ioengine=sync bs=512 direct=1 rw=randread filename=file1 [job2] ioengine=libaio rw=randwrite direct=1 filename=file1 [job3] ioengine=posixaio rw=randwrite filename=file1 [CAUSE] The above call trace shows that, during the folio read a writeback is triggered on the same folio. And since during btrfs_do_readpage(), the folio is locked, the writeback will never be able to lock the folio, thus it is waiting on itself thus causing the deadlock. The root cause is a little complex, the system is 64K page sized, with 4K sector size: 1) The folio has its range [48K, 64K) marked dirty by buffered write 0 16K 32K 48K 64K | |///////////| \- sector Uptodate|Dirty 2) Writeback finished for [48K, 64K), but ordered extent not yet finished 0 16K 32K 48K 64K | |///////////| \- sector Uptodate extent map PINNED OE still here 3) The folio is released from page cache This can be triggered by direct IO through the following call chain: iomap_dio_rw() \- kiocb_invalidate_pages() \- filemap_invalidate_pages() \- invalidate_inode_pages2_range() \- invalidate_complete_folio2() \- filemap_release_folio() \- btrfs_release_folio() \- __btrfs_release_folio() \- try_release_extent_mapping() Since there is no extent state with EXTENT_LOCKED flag in the folio range, btrfs allows the folio to be released. Now there is no folio->private to record which block is uptodate. But extent map and OE are still here. 0 16K 32K 48K 64K | |///////////| \- extent map PINNED OE still here 4) Buffered write dirtied range [0, 16K) Since it's sector aligned, btrfs didn't read the full folio from disk. 0 16K 32K 48K 64K |//////////| |///////////| \- sector Uptodate|Dirty \- extent map PINNED OE still here 5) Read on the folio is triggered For the range [0, 16K), since it's already uptodate, btrfs skips this range. For the range [16K, 48K), btrfs submit the read from disk. The problem comes to the range [48K, 64K), the following call chain happens: btrfs_do_readpage() \- __get_extent_map() \- btrfs_lock_and_flush_ordered_range() \- btrfs_start_ordered_extent() \- filemap_fdatawrite_range() \- btrfs_writepages() \- extent_write_cache_pages() \- folio_lock() Since the folio indeed has dirty sectors in range [0, 16K), the range will be written back. But the folio is already locked by the folio read, the writeback will never be able to lock the folio, thus lead to the deadlock. This sequence can only happen if all the following conditions are met: - The sector size is smaller than page size. Or we won't have mixed dirty blocks in the same folio we're reading. - We allow the buffered write to skip the folio read if it's sector aligned. This is done by the incoming patch "btrfs: allow buffered write to skip full page if it's sector aligned". The ultimate goal of that patch is to reduce unnecessary read for sector size < page size cases, and to pass generic/563. Otherwise the folio will be read from the disk during buffered write, before marking it dirty. Thus will not trigger the deadlock. [FIX] Break the step 5) of the above case. By passing an optional @locked_folio into btrfs_start_ordered_extent() and btrfs_lock_and_flush_ordered_range(). If we got such locked folio skip the writeback for ranges of that folio. Here we also do extra asserts to make sure the target range is already not dirty, or the ordered extent we wait will never be able to finish, since part of the ordered extent is never submitted. So far only the call site inside __get_extent_map() is passing the new parameter. Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
- Loading branch information