Skip to content

Commit 00e48f1

Browse files
dgchinnerguojinhui-liam
authored andcommitted
xfs, iomap: limit individual ioend chain lengths in writeback
commit ebb7fb1 upstream. Trond Myklebust reported soft lockups in XFS IO completion such as this: watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106] CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1 Workqueue: xfs-conv/md127 xfs_end_io [xfs] RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20 Call Trace: wake_up_page_bit+0x8a/0x110 iomap_finish_ioend+0xd7/0x1c0 iomap_finish_ioends+0x7f/0xb0 xfs_end_ioend+0x6b/0x100 [xfs] xfs_end_io+0xb9/0xe0 [xfs] process_one_work+0x1a7/0x360 worker_thread+0x1fa/0x390 kthread+0x116/0x130 ret_from_fork+0x35/0x40 Ioends are processed as an atomic completion unit when all the chained bios in the ioend have completed their IO. Logically contiguous ioends can also be merged and completed as a single, larger unit. Both of these things can be problematic as both the bio chains per ioend and the size of the merged ioends processed as a single completion are both unbound. If we have a large sequential dirty region in the page cache, write_cache_pages() will keep feeding us sequential pages and we will keep mapping them into ioends and bios until we get a dirty page at a non-sequential file offset. These large sequential runs can will result in bio and ioend chaining to optimise the io patterns. The pages iunder writeback are pinned within these chains until the submission chaining is broken, allowing the entire chain to be completed. This can result in huge chains being processed in IO completion context. We get deep bio chaining if we have large contiguous physical extents. We will keep adding pages to the current bio until it is full, then we'll chain a new bio to keep adding pages for writeback. Hence we can build bio chains that map millions of pages and tens of gigabytes of RAM if the page cache contains big enough contiguous dirty file regions. This long bio chain pins those pages until the final bio in the chain completes and the ioend can iterate all the chained bios and complete them. OTOH, if we have a physically fragmented file, we end up submitting one ioend per physical fragment that each have a small bio or bio chain attached to them. We do not chain these at IO submission time, but instead we chain them at completion time based on file offset via iomap_ioend_try_merge(). Hence we can end up with unbound ioend chains being built via completion merging. XFS can then do COW remapping or unwritten extent conversion on that merged chain, which involves walking an extent fragment at a time and running a transaction to modify the physical extent information. IOWs, we merge all the discontiguous ioends together into a contiguous file range, only to then process them individually as discontiguous extents. This extent manipulation is computationally expensive and can run in a tight loop, so merging logically contiguous but physically discontigous ioends gains us nothing except for hiding the fact the fact we broke the ioends up into individual physical extents at submission and then need to loop over those individual physical extents at completion. Hence we need to have mechanisms to limit ioend sizes and to break up completion processing of large merged ioend chains: 1. bio chains per ioend need to be bound in length. Pure overwrites go straight to iomap_finish_ioend() in softirq context with the exact bio chain attached to the ioend by submission. Hence the only way to prevent long holdoffs here is to bound ioend submission sizes because we can't reschedule in softirq context. 2. iomap_finish_ioends() has to handle unbound merged ioend chains correctly. This relies on any one call to iomap_finish_ioend() being bound in runtime so that cond_resched() can be issued regularly as the long ioend chain is processed. i.e. this relies on mechanism #1 to limit individual ioend sizes to work correctly. 3. filesystems have to loop over the merged ioends to process physical extent manipulations. This means they can loop internally, and so we break merging at physical extent boundaries so the filesystem can easily insert reschedule points between individual extent manipulations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reported-and-tested-by: Trond Myklebust <trondmy@hammerspace.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Li Lei <lilei.777@bytedance.com>
1 parent 5be345e commit 00e48f1

File tree

3 files changed

+65
-5
lines changed

3 files changed

+65
-5
lines changed

fs/iomap/buffered-io.c

Lines changed: 48 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121

2222
#include "../internal.h"
2323

24+
#define IOEND_BATCH_SIZE 4096
25+
2426
/*
2527
* Structure allocated for each page or THP when block size < page size
2628
* to track sub-page uptodate status and I/O completions.
@@ -1030,7 +1032,7 @@ iomap_finish_page_writeback(struct inode *inode, struct page *page,
10301032
* state, release holds on bios, and finally free up memory. Do not use the
10311033
* ioend after this.
10321034
*/
1033-
static void
1035+
static u32
10341036
iomap_finish_ioend(struct iomap_ioend *ioend, int error)
10351037
{
10361038
struct inode *inode = ioend->io_inode;
@@ -1039,6 +1041,7 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
10391041
u64 start = bio->bi_iter.bi_sector;
10401042
loff_t offset = ioend->io_offset;
10411043
bool quiet = bio_flagged(bio, BIO_QUIET);
1044+
u32 page_count = 0;
10421045

10431046
for (bio = &ioend->io_inline_bio; bio; bio = next) {
10441047
struct bio_vec *bv;
@@ -1054,9 +1057,11 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
10541057
next = bio->bi_private;
10551058

10561059
/* walk each page on bio, ending page IO on them */
1057-
bio_for_each_segment_all(bv, bio, iter_all)
1060+
bio_for_each_segment_all(bv, bio, iter_all) {
10581061
iomap_finish_page_writeback(inode, bv->bv_page, error,
10591062
bv->bv_len);
1063+
page_count++;
1064+
}
10601065
bio_put(bio);
10611066
}
10621067
/* The ioend has been freed by bio_put() */
@@ -1066,20 +1071,36 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
10661071
"%s: writeback error on inode %lu, offset %lld, sector %llu",
10671072
inode->i_sb->s_id, inode->i_ino, offset, start);
10681073
}
1074+
return page_count;
10691075
}
10701076

1077+
/*
1078+
* Ioend completion routine for merged bios. This can only be called from task
1079+
* contexts as merged ioends can be of unbound length. Hence we have to break up
1080+
* the writeback completions into manageable chunks to avoid long scheduler
1081+
* holdoffs. We aim to keep scheduler holdoffs down below 10ms so that we get
1082+
* good batch processing throughput without creating adverse scheduler latency
1083+
* conditions.
1084+
*/
10711085
void
10721086
iomap_finish_ioends(struct iomap_ioend *ioend, int error)
10731087
{
10741088
struct list_head tmp;
1089+
u32 completions;
1090+
1091+
might_sleep();
10751092

10761093
list_replace_init(&ioend->io_list, &tmp);
1077-
iomap_finish_ioend(ioend, error);
1094+
completions = iomap_finish_ioend(ioend, error);
10781095

10791096
while (!list_empty(&tmp)) {
1097+
if (completions > IOEND_BATCH_SIZE * 8) {
1098+
cond_resched();
1099+
completions = 0;
1100+
}
10801101
ioend = list_first_entry(&tmp, struct iomap_ioend, io_list);
10811102
list_del_init(&ioend->io_list);
1082-
iomap_finish_ioend(ioend, error);
1103+
completions += iomap_finish_ioend(ioend, error);
10831104
}
10841105
}
10851106
EXPORT_SYMBOL_GPL(iomap_finish_ioends);
@@ -1100,6 +1121,18 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next)
11001121
return false;
11011122
if (ioend->io_offset + ioend->io_size != next->io_offset)
11021123
return false;
1124+
/*
1125+
* Do not merge physically discontiguous ioends. The filesystem
1126+
* completion functions will have to iterate the physical
1127+
* discontiguities even if we merge the ioends at a logical level, so
1128+
* we don't gain anything by merging physical discontiguities here.
1129+
*
1130+
* We cannot use bio->bi_iter.bi_sector here as it is modified during
1131+
* submission so does not point to the start sector of the bio at
1132+
* completion.
1133+
*/
1134+
if (ioend->io_sector + (ioend->io_size >> 9) != next->io_sector)
1135+
return false;
11031136
return true;
11041137
}
11051138

@@ -1201,8 +1234,10 @@ iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc,
12011234
ioend->io_flags = wpc->iomap.flags;
12021235
ioend->io_inode = inode;
12031236
ioend->io_size = 0;
1237+
ioend->io_pages = 0;
12041238
ioend->io_offset = offset;
12051239
ioend->io_bio = bio;
1240+
ioend->io_sector = sector;
12061241
return ioend;
12071242
}
12081243

@@ -1243,6 +1278,13 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
12431278
return false;
12441279
if (sector != bio_end_sector(wpc->ioend->io_bio))
12451280
return false;
1281+
/*
1282+
* Limit ioend bio chain lengths to minimise IO completion latency. This
1283+
* also prevents long tight loops ending page writeback on all the
1284+
* folios in the ioend.
1285+
*/
1286+
if (wpc->ioend->io_pages >= IOEND_BATCH_SIZE)
1287+
return false;
12461288
return true;
12471289
}
12481290

@@ -1328,6 +1370,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
13281370
&submit_list);
13291371
count++;
13301372
}
1373+
if (count)
1374+
wpc->ioend->io_pages++;
13311375

13321376
WARN_ON_ONCE(!wpc->ioend && !list_empty(&submit_list));
13331377
WARN_ON_ONCE(!PageLocked(page));

fs/xfs/xfs_aops.c

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,20 @@ xfs_end_ioend(
136136
memalloc_nofs_restore(nofs_flag);
137137
}
138138

139-
/* Finish all pending io completions. */
139+
/*
140+
* Finish all pending IO completions that require transactional modifications.
141+
*
142+
* We try to merge physical and logically contiguous ioends before completion to
143+
* minimise the number of transactions we need to perform during IO completion.
144+
* Both unwritten extent conversion and COW remapping need to iterate and modify
145+
* one physical extent at a time, so we gain nothing by merging physically
146+
* discontiguous extents here.
147+
*
148+
* The ioend chain length that we can be processing here is largely unbound in
149+
* length and we may have to perform significant amounts of work on each ioend
150+
* to complete it. Hence we have to be careful about holding the CPU for too
151+
* long in this loop.
152+
*/
140153
void
141154
xfs_end_io(
142155
struct work_struct *work)
@@ -157,6 +170,7 @@ xfs_end_io(
157170
list_del_init(&ioend->io_list);
158171
iomap_ioend_try_merge(ioend, &tmp);
159172
xfs_end_ioend(ioend);
173+
cond_resched();
160174
}
161175
}
162176

include/linux/iomap.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -257,9 +257,11 @@ struct iomap_ioend {
257257
struct list_head io_list; /* next ioend in chain */
258258
u16 io_type;
259259
u16 io_flags; /* IOMAP_F_* */
260+
u32 io_pages; /* folios added to ioend */
260261
struct inode *io_inode; /* file being written to */
261262
size_t io_size; /* size of the extent */
262263
loff_t io_offset; /* offset in the file */
264+
sector_t io_sector; /* start sector of ioend */
263265
struct bio *io_bio; /* bio being built */
264266
struct bio io_inline_bio; /* MUST BE LAST! */
265267
};

0 commit comments

Comments
 (0)