-
Notifications
You must be signed in to change notification settings - Fork 0
cpuset/isolation: Honour kthreads preferred affinity #359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: linus-master_base
Are you sure you want to change the base?
Conversation
|
Upstream branch: e9a6fb0 |
f699346 to
83d3e2f
Compare
|
Upstream branch: 6da43bb |
ead9b69 to
7fcd861
Compare
83d3e2f to
00d5e5c
Compare
|
Upstream branch: f824272 |
7fcd861 to
a8ee8ab
Compare
00d5e5c to
d782508
Compare
|
Upstream branch: f824272 |
a8ee8ab to
ee17bc1
Compare
d782508 to
6099a4d
Compare
|
Upstream branch: e7c375b |
ee17bc1 to
94aed2b
Compare
6099a4d to
5121c4d
Compare
|
Upstream branch: e7c375b |
94aed2b to
a1bcf0b
Compare
5121c4d to
4458758
Compare
|
Upstream branch: 8b69055 |
a1bcf0b to
aaef43f
Compare
4458758 to
6f43942
Compare
HK_TYPE_DOMAIN will soon integrate cpuset isolated partitions and therefore be made modifiable at runtime. Synchronize against the cpumask update using RCU. The RCU locked section includes both the housekeeping CPU target election for the PCI probe work and the work enqueue. This way the housekeeping update side will simply need to flush the pending related works after updating the housekeeping mask in order to make sure that no PCI work ever executes on an isolated CPU. This part will be handled in a subsequent patch. Signed-off-by: Frederic Weisbecker <[email protected]>
1) The commit: 2b8272f ("cpu/hotplug: Prevent self deadlock on CPU hot-unplug") was added to fix an issue where the hotplug control task (BP) was throttled between CPUHP_AP_IDLE_DEAD and CPUHP_HRTIMERS_PREPARE waiting in the hrtimer blindspot for the bandwidth callback queued in the dead CPU. 2) Later on, the commit: 38685e2 ("cpu/hotplug: Don't offline the last non-isolated CPU") plugged on the target selection for the workqueue offloaded CPU down process to prevent from destroying the last CPU domain. 3) Finally: 5c0930c ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") removed entirely the conditions for the race exposed and partially fixed in 1). The offloading of the CPU down process to a workqueue on another CPU then becomes unnecessary. But the last CPU belonging to scheduler domains must still remain online. Therefore revert the now obsolete commit 2b8272f and move the housekeeping check under the cpu_hotplug_lock write held. Since HK_TYPE_DOMAIN will include both isolcpus and cpuset isolated partition, the hotplug lock will synchronize against concurrent cpuset partition updates. Signed-off-by: Frederic Weisbecker <[email protected]>
The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifiable at runtime. In order to synchronize against memcg workqueue to make sure that no asynchronous draining is pending or executing on a newly made isolated CPU, target and queue a drain work under the same RCU critical section. Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a memcg workqueue flush will also be issued in a further change to make sure that no work remains pending after a CPU has been made isolated. Signed-off-by: Frederic Weisbecker <[email protected]>
The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifiable at runtime. In order to synchronize against vmstat workqueue to make sure that no asynchronous vmstat work is pending or executing on a newly made isolated CPU, target and queue a vmstat work under the same RCU read side critical section. Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a vmstat workqueue flush will also be issued in a further change to make sure that no work remains pending after a CPU has been made isolated. Signed-off-by: Frederic Weisbecker <[email protected]>
HK_TYPE_DOMAIN will soon integrate not only boot defined isolcpus= CPUs but also cpuset isolated partitions. Housekeeping still needs a way to record what was initially passed to isolcpus= in order to keep these CPUs isolated after a cpuset isolated partition is modified or destroyed while containing some of them. Create a new HK_TYPE_DOMAIN_BOOT to keep track of those. Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed-by: Phil Auld <[email protected]>
boot_hk_cpus is an ad-hoc copy of HK_TYPE_DOMAIN_BOOT. Remove it and use the official version. Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed-by: Phil Auld <[email protected]> Reviewed-by: Chen Ridong <[email protected]>
…TYPE_DOMAIN_BOOT Make sure /sys/devices/system/cpu/isolated only prints what was passed through the isolcpus= parameter before HK_TYPE_DOMAIN will also integrate cpuset isolated partitions. Signed-off-by: Frederic Weisbecker <[email protected]>
RPS cpumask can be overriden through sysfs/syctl. The boot defined isolated CPUs are then excluded from that cpumask. However HK_TYPE_DOMAIN will soon integrate cpuset isolated CPUs updates and the RPS infrastructure needs more thoughts to be able to propagate such changes and synchronize against them. Keep handling only what was passed through "isolcpus=" for now. Signed-off-by: Frederic Weisbecker <[email protected]>
The block subsystem prevents running the workqueue to isolated CPUs, including those defined by cpuset isolated partitions. Since HK_TYPE_DOMAIN will soon contain both and be subject to runtime modifications, synchronize against housekeeping using the relevant lock. For full support of cpuset changes, the block subsystem may need to propagate changes to isolated cpumask through the workqueue in the future. Signed-off-by: Frederic Weisbecker <[email protected]>
cpuset modifies partitions, including isolated, while holding the cpu hotplug lock read-held. This means that write-holding the CPU hotplug lock is safe to synchronize against housekeeping cpumask changes. Provide a lockdep check to validate that. Signed-off-by: Frederic Weisbecker <[email protected]>
cpuset modifies partitions, including isolated, while holding the cpuset mutex. This means that holding the cpuset mutex is safe to synchronize against housekeeping cpumask changes. Provide a lockdep check to validate that. Signed-off-by: Frederic Weisbecker <[email protected]>
HK_TYPE_DOMAIN's cpumask will soon be made modifiable by cpuset. A synchronization mechanism is then needed to synchronize the updates with the housekeeping cpumask readers. Turn the housekeeping cpumasks into RCU pointers. Once a housekeeping cpumask will be modified, the update side will wait for an RCU grace period and propagate the change to interested subsystem when deemed necessary. Signed-off-by: Frederic Weisbecker <[email protected]>
Until now, HK_TYPE_DOMAIN used to only include boot defined isolated CPUs passed through isolcpus= boot option. Users interested in also knowing the runtime defined isolated CPUs through cpuset must use different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc... There are many drawbacks to that approach: 1) Most interested subsystems want to know about all isolated CPUs, not just those defined on boot time. 2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with concurrent cpuset changes. 3) Further cpuset modifications are not propagated to subsystems Solve 1) and 2) and centralize all isolated CPUs within the HK_TYPE_DOMAIN housekeeping cpumask. Subsystems can rely on RCU to synchronize against concurrent changes. The propagation mentioned in 3) will be handled in further patches. Signed-off-by: Frederic Weisbecker <[email protected]>
…change The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In order to synchronize against memcg workqueue to make sure that no asynchronous draining is still pending or executing on a newly made isolated CPU, the housekeeping susbsystem must flush the memcg workqueues. However the memcg workqueues can't be flushed easily since they are queued to the main per-CPU workqueue pool. Solve this with creating a memcg specific pool and provide and use the appropriate flushing API. Acked-by: Shakeel Butt <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
… change The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In order to synchronize against vmstat workqueue to make sure that no asynchronous vmstat work is still pending or executing on a newly made isolated CPU, the housekeeping susbsystem must flush the vmstat workqueues. This involves flushing the whole mm_percpu_wq workqueue, shared with LRU drain, introducing here a welcome side effect. Signed-off-by: Frederic Weisbecker <[email protected]>
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In order to synchronize against PCI probe works and make sure that no asynchronous probing is still pending or executing on a newly isolated CPU, the housekeeping subsystem must flush the PCI probe works. However the PCI probe works can't be flushed easily since they are queued to the main per-CPU workqueue pool. Solve this with creating a PCI probe-specific pool and provide and use the appropriate flushing API. Signed-off-by: Frederic Weisbecker <[email protected]>
…eeping Until now, cpuset would propagate isolated partition changes to workqueues so that unbound workers get properly reaffined. Since housekeeping now centralizes, synchronize and propagates isolation cpumask changes, perform the work from that subsystem for consolidation and consistency purposes. For simplification purpose, the target function is adapted to take the new housekeeping mask instead of the isolated mask. Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
The set of cpuset isolated CPUs is now included in HK_TYPE_DOMAIN housekeeping cpumask. There is no usecase left interested in just checking what is isolated by cpuset and not by the isolcpus= kernel boot parameter. Signed-off-by: Frederic Weisbecker <[email protected]>
It doesn't make sense to use nohz_full without also isolating the related CPUs from the domain topology, either through the use of isolcpus= or cpuset isolated partitions. And now HK_TYPE_DOMAIN includes all kinds of domain isolated CPUs. This means that HK_TYPE_KERNEL_NOISE (of which HK_TYPE_TICK is only an alias) should always be a subset of HK_TYPE_DOMAIN. Therefore if a CPU is not HK_TYPE_DOMAIN, it shouldn't be HK_TYPE_KERNEL_NOISE either. Testing the former is then enough. Simplify cpu_is_isolated() accordingly. Signed-off-by: Frederic Weisbecker <[email protected]>
It doesn't make sense to use nohz_full without also isolating the related CPUs from the domain topology, either through the use of isolcpus= or cpuset isolated partitions. And now HK_TYPE_DOMAIN includes all kinds of domain isolated CPUs. This means that HK_TYPE_KERNEL_NOISE (of which HK_TYPE_WQ is only an alias) should always be a subset of HK_TYPE_DOMAIN. Therefore sane configurations verify: HK_TYPE_KERNEL_NOISE | HK_TYPE_DOMAIN == HK_TYPE_DOMAIN Simplify the PCI probe target election accordingly. Signed-off-by: Frederic Weisbecker <[email protected]>
The kthreads preferred affinity related fields use "hotplug" as the base of their naming because the affinity management was initially deemed to deal with CPU hotplug. The scope of this role is going to broaden now and also deal with cpuset isolated partition updates. Switch the naming accordingly. Signed-off-by: Frederic Weisbecker <[email protected]>
The managed affinity list currently contains only unbound kthreads that have affinity preferences. Unbound kthreads globally affine by default are outside of the list because their affinity is automatically managed by the scheduler (through the fallback housekeeping mask) and by cpuset. However in order to preserve the preferred affinity of kthreads, cpuset will delegate the isolated partition update propagation to the housekeeping and kthread code. Prepare for that with including all unbound kthreads in the managed affinity list. Signed-off-by: Frederic Weisbecker <[email protected]>
The unbound kthreads affinity management performed by cpuset is going to be imported to the kthread core code for consolidation purposes. Treat kthreadd just like any other kthread. Signed-off-by: Frederic Weisbecker <[email protected]>
Unbound kthreads want to run neither on nohz_full CPUs nor on domain isolated CPUs. And since nohz_full implies domain isolation, checking the latter is enough to verify both. Therefore exclude kthreads from domain isolation. Signed-off-by: Frederic Weisbecker <[email protected]>
Tasks that have all their allowed CPUs offline don't want their affinity to fallback on either nohz_full CPUs or on domain isolated CPUs. And since nohz_full implies domain isolation, checking the latter is enough to verify both. Therefore exclude domain isolation from fallback task affinity. Signed-off-by: Frederic Weisbecker <[email protected]>
…eping Currently the user can set up isolated cpus via cpuset and nohz_full in such a way that leaves no housekeeping CPU (i.e. no CPU that is neither domain isolated nor nohz full). This can be a problem for other subsystems (e.g. the timer wheel imgration). Prevent this configuration by blocking any assignation that would cause the union of domain isolated cpus and nohz_full to covers all CPUs. Acked-by: Frederic Weisbecker <[email protected]> Reviewed-by: Waiman Long <[email protected]> Signed-off-by: Gabriele Monaco <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
When none of the allowed CPUs of a task are online, it gets migrated to the fallback cpumask which is all the non nohz_full CPUs. However just like nohz_full CPUs, domain isolated CPUs don't want to be disturbed by tasks that have lost their CPU affinities. And since nohz_full rely on domain isolation to work correctly, the housekeeping mask of domain isolated CPUs should always be a superset of the housekeeping mask of nohz_full CPUs (there can be CPUs that are domain isolated but not nohz_full, OTOH there shouldn't be nohz_full CPUs that are not domain isolated): HK_TYPE_DOMAIN | HK_TYPE_KERNEL_NOISE == HK_TYPE_DOMAIN Therefore use HK_TYPE_DOMAIN as the appropriate fallback target for tasks and since this cpumask can be modified at runtime, make sure that 32 bits support CPUs on ARM64 mismatched systems are not isolated by cpusets. Signed-off-by: Frederic Weisbecker <[email protected]>
When cpuset isolated partitions get updated, unbound kthreads get indifferently affine to all non isolated CPUs, regardless of their individual affinity preferences. For example kswapd is a per-node kthread that prefers to be affine to the node it refers to. Whenever an isolated partition is created, updated or deleted, kswapd's node affinity is going to be broken if any CPU in the related node is not isolated because kswapd will be affine globally. Fix this with letting the consolidated kthread managed affinity code do the affinity update on behalf of cpuset. Signed-off-by: Frederic Weisbecker <[email protected]>
…) call It may not appear obvious why kthread_affine_node() is not called before the kthread creation completion instead of after the first wake-up. The reason is that kthread_affine_node() applies a default affinity behaviour that only takes place if no affinity preference have already been passed by the kthread creation call site. Add a comment to clarify that. Reported-by: Peter Zijlstra <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
The documentation of this new API has been overlooked during its introduction. Fill the gap. Signed-off-by: Frederic Weisbecker <[email protected]>
|
Upstream branch: fd95357 |
Signed-off-by: Frederic Weisbecker <[email protected]>
aaef43f to
66dc4cb
Compare
Pull request for series with
subject: cpuset/isolation: Honour kthreads preferred affinity
version: 4
url: https://patchwork.kernel.org/project/linux-block/list/?series=1020109