diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index cd727cfc1b04..2a2b5bd8299a 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -31,6 +31,7 @@ the Linux memory management. idle_page_tracking ksm memory-hotplug + multigen_lru nommu-mmap numa_memory_policy numaperf diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst new file mode 100644 index 000000000000..33e068830497 --- /dev/null +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -0,0 +1,162 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Multi-Gen LRU +============= +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Quick start +=========== +Build the kernel with the following configurations. + +* ``CONFIG_LRU_GEN=y`` +* ``CONFIG_LRU_GEN_ENABLED=y`` + +All set! + +Runtime options +=============== +``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the +following subsections. + +Kill switch +----------- +``enabled`` accepts different values to enable or disable the +following components. Its default value depends on +``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled +unless some of them have unforeseen side effects. Writing to +``enabled`` has no effect when a component is not supported by the +hardware, and valid values will be accepted even when the main switch +is off. + +====== =============================================================== +Values Components +====== =============================================================== +0x0001 The main switch for the multi-gen LRU. +0x0002 Clearing the accessed bit in leaf page table entries in large + batches, when MMU sets it (e.g., on x86). This behavior can + theoretically worsen lock contention (mmap_lock). If it is + disabled, the multi-gen LRU will suffer a minor performance + degradation for workloads that contiguously map hot pages, + whose accessed bits can be otherwise cleared by fewer larger + batches. +0x0004 Clearing the accessed bit in non-leaf page table entries as + well, when MMU sets it (e.g., on x86). This behavior was not + verified on x86 varieties other than Intel and AMD. If it is + disabled, the multi-gen LRU will suffer a negligible + performance degradation. +[yYnN] Apply to all the components above. +====== =============================================================== + +E.g., +:: + + echo y >/sys/kernel/mm/lru_gen/enabled + cat /sys/kernel/mm/lru_gen/enabled + 0x0007 + echo 5 >/sys/kernel/mm/lru_gen/enabled + cat /sys/kernel/mm/lru_gen/enabled + 0x0005 + +Thrashing prevention +-------------------- +Personal computers are more sensitive to thrashing because it can +cause janks (lags when rendering UI) and negatively impact user +experience. The multi-gen LRU offers thrashing prevention to the +majority of laptop and desktop users who do not have ``oomd``. + +Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of +``N`` milliseconds from getting evicted. The OOM killer is triggered +if this working set cannot be kept in memory. In other words, this +option works as an adjustable pressure relief valve, and when open, it +terminates applications that are hopefully not being used. + +Based on the average human detectable lag (~100ms), ``N=1000`` usually +eliminates intolerable janks due to thrashing. Larger values like +``N=3000`` make janks less noticeable at the risk of premature OOM +kills. + +The default value ``0`` means disabled. + +Experimental features +===================== +``/sys/kernel/debug/lru_gen`` accepts commands described in the +following subsections. Multiple command lines are supported, so does +concatenation with delimiters ``,`` and ``;``. + +``/sys/kernel/debug/lru_gen_full`` provides additional stats for +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from +evicted generations in this file. + +Working set estimation +---------------------- +Working set estimation measures how much memory an application needs +in a given time interval, and it is usually done with little impact on +the performance of the application. E.g., data centers want to +optimize job scheduling (bin packing) to improve memory utilizations. +When a new job comes in, the job scheduler needs to find out whether +each server it manages can allocate a certain amount of memory for +this new job before it can pick a candidate. To do so, the job +scheduler needs to estimate the working sets of the existing jobs. + +When it is read, ``lru_gen`` returns a histogram of numbers of pages +accessed over different time intervals for each memcg and node. +``MAX_NR_GENS`` decides the number of bins for each histogram. The +histograms are noncumulative. +:: + + memcg memcg_id memcg_path + node node_id + min_gen_nr age_in_ms nr_anon_pages nr_file_pages + ... + max_gen_nr age_in_ms nr_anon_pages nr_file_pages + +Each bin contains an estimated number of pages that have been accessed +within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages +and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of +the former is the largest and that of the latter is the smallest. + +Users can write the following command to ``lru_gen`` to create a new +generation ``max_gen_nr+1``: + + ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]`` + +``can_swap`` defaults to the swap setting and, if it is set to ``1``, +it forces the scan of anon pages when swap is off, and vice versa. +``force_scan`` defaults to ``1`` and, if it is set to ``0``, it +employs heuristics to reduce the overhead, which is likely to reduce +the coverage as well. + +A typical use case is that a job scheduler runs this command at a +certain time interval to create new generations, and it ranks the +servers it manages based on the sizes of their cold pages defined by +this time interval. + +Proactive reclaim +----------------- +Proactive reclaim induces page reclaim when there is no memory +pressure. It usually targets cold pages only. E.g., when a new job +comes in, the job scheduler wants to proactively reclaim cold pages on +the server it selected, to improve the chance of successfully landing +this new job. + +Users can write the following command to ``lru_gen`` to evict +generations less than or equal to ``min_gen_nr``. + + ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]`` + +``min_gen_nr`` should be less than ``max_gen_nr-1``, since +``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to +the active list) and therefore cannot be evicted. ``swappiness`` +overrides the default value in ``/proc/sys/vm/swappiness``. +``nr_to_reclaim`` limits the number of pages to evict. + +A typical use case is that a job scheduler runs this command before it +tries to land a new job on a server. If it fails to materialize enough +cold pages because of the overestimation, it retries on the next +server according to the ranking result obtained from the working set +estimation step. This less forceful approach limits the impacts on the +existing jobs. diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index eff5fbd492d0..9f80cc8a1cd9 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -41,6 +41,7 @@ descriptions of data structures and algorithms. ksm memory-model mmu_notifier + multigen_lru numa overcommit-accounting page_migration diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst new file mode 100644 index 000000000000..d7062c6a8946 --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,159 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Multi-Gen LRU +============= +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Design overview +=============== +Objectives +---------- +The design objectives are: + +* Good representation of access recency +* Try to profit from spatial locality +* Fast paths to make obvious choices +* Simple self-correcting heuristics + +The representation of access recency is at the core of all LRU +implementations. In the multi-gen LRU, each generation represents a +group of pages with similar access recency. Generations establish a +(time-based) common frame of reference and therefore help make better +choices, e.g., between different memcgs on a computer or different +computers in a data center (for job scheduling). + +Exploiting spatial locality improves efficiency when gathering the +accessed bit. A rmap walk targets a single page and does not try to +profit from discovering a young PTE. A page table walk can sweep all +the young PTEs in an address space, but the address space can be too +sparse to make a profit. The key is to optimize both methods and use +them in combination. + +Fast paths reduce code complexity and runtime overhead. Unmapped pages +do not require TLB flushes; clean pages do not require writeback. +These facts are only helpful when other conditions, e.g., access +recency, are similar. With generations as a common frame of reference, +additional factors stand out. But obvious choices might not be good +choices; thus self-correction is necessary. + +The benefits of simple self-correcting heuristics are self-evident. +Again, with generations as a common frame of reference, this becomes +attainable. Specifically, pages in the same generation can be +categorized based on additional factors, and a feedback loop can +statistically compare the refault percentages across those categories +and infer which of them are better choices. + +Assumptions +----------- +The protection of hot pages and the selection of cold pages are based +on page access channels and patterns. There are two access channels: + +* Accesses through page tables +* Accesses through file descriptors + +The protection of the former channel is by design stronger because: + +1. The uncertainty in determining the access patterns of the former + channel is higher due to the approximation of the accessed bit. +2. The cost of evicting the former channel is higher due to the TLB + flushes required and the likelihood of encountering the dirty bit. +3. The penalty of underprotecting the former channel is higher because + applications usually do not prepare themselves for major page + faults like they do for blocked I/O. E.g., GUI applications + commonly use dedicated I/O threads to avoid blocking rendering + threads. + +There are also two access patterns: + +* Accesses exhibiting temporal locality +* Accesses not exhibiting temporal locality + +For the reasons listed above, the former channel is assumed to follow +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is +present, and the latter channel is assumed to follow the latter +pattern unless outlying refaults have been observed. + +Workflow overview +================= +Evictable pages are divided into multiple generations for each +``lruvec``. The youngest generation number is stored in +``lrugen->max_seq`` for both anon and file types as they are aged on +an equal footing. The oldest generation numbers are stored in +``lrugen->min_seq[]`` separately for anon and file types as clean file +pages can be evicted regardless of swap constraints. These three +variables are monotonically increasing. + +Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` +bits in order to fit into the gen counter in ``folio->flags``. Each +truncated generation number is an index to ``lrugen->lists[]``. The +sliding window technique is used to track at least ``MIN_NR_GENS`` and +at most ``MAX_NR_GENS`` generations. The gen counter stores a value +within ``[1, MAX_NR_GENS]`` while a page is on one of +``lrugen->lists[]``; otherwise it stores zero. + +Each generation is divided into multiple tiers. A page accessed ``N`` +times through file descriptors is in tier ``order_base_2(N)``. Unlike +generations, tiers do not have dedicated ``lrugen->lists[]``. In +contrast to moving across generations, which requires the LRU lock, +moving across tiers only involves atomic operations on +``folio->flags`` and therefore has a negligible cost. A feedback loop +modeled after the PID controller monitors refaults over all the tiers +from anon and file types and decides which tiers from which types to +evict or protect. + +There are two conceptually independent procedures: the aging and the +eviction. They form a closed-loop system, i.e., the page reclaim. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, it +increments ``max_seq`` when ``max_seq-min_seq+1`` approaches +``MIN_NR_GENS``. The aging promotes hot pages to the youngest +generation when it finds them accessed through page tables; the +demotion of cold pages happens consequently when it increments +``max_seq``. The aging uses page table walks and rmap walks to find +young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` +and calls ``walk_page_range()`` with each ``mm_struct`` on this list +to scan PTEs, and after each iteration, it increments ``max_seq``. For +the latter, when the eviction walks the rmap and finds a young PTE, +the aging scans the adjacent PTEs. For both, on finding a young PTE, +the aging clears the accessed bit and updates the gen counter of the +page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, it +increments ``min_seq`` when ``lrugen->lists[]`` indexed by +``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to +evict from, it first compares ``min_seq[]`` to select the older type. +If both types are equally old, it selects the one whose first tier has +a lower refault percentage. The first tier contains single-use +unmapped clean pages, which are the best bet. The eviction sorts a +page according to its gen counter if the aging has found this page +accessed through page tables and updated its gen counter. It also +moves a page to the next generation, i.e., ``min_seq+1``, if this page +was accessed multiple times through file descriptors and the feedback +loop has detected outlying refaults from the tier this page is in. To +this end, the feedback loop uses the first tier as the baseline, for +the reason stated earlier. + +Summary +------- +The multi-gen LRU can be disassembled into the following parts: + +* Generations +* Rmap walks +* Page table walks +* Bloom filters +* PID controller + +The aging and the eviction form a producer-consumer model; +specifically, the latter drives the former by the sliding window over +generations. Within the aging, rmap walks drive page table walks by +inserting hot densely populated page tables to the Bloom filters. +Within the eviction, the PID controller uses refaults as the feedback +to select types to evict and tiers to protect. diff --git a/Makefile b/Makefile index b8b1fd3a7e9e..550c9bb9c6f2 100644 --- a/Makefile +++ b/Makefile @@ -1,8 +1,8 @@ # SPDX-License-Identifier: GPL-2.0 VERSION = 5 PATCHLEVEL = 10 -SUBLEVEL = 189 -EXTRAVERSION = +SUBLEVEL = 110 +EXTRAVERSION = Goku-sheng NAME = Dare mighty things # *DOCUMENTATION* diff --git a/android/abi_gki_aarch64_oplus b/android/abi_gki_aarch64_oplus index aef23e847499..0c29753e36a5 100644 --- a/android/abi_gki_aarch64_oplus +++ b/android/abi_gki_aarch64_oplus @@ -686,7 +686,6 @@ dma_unmap_resource dma_unmap_sg_attrs do_exit - do_traversal_all_lruvec do_wait_intr_irq down down_interruptible @@ -1820,7 +1819,6 @@ __page_pinner_migration_failed page_referenced page_symlink - page_to_lruvec panic panic_notifier_list panic_timeout @@ -2116,7 +2114,6 @@ rdev_get_regmap read_cache_page_gfp reboot_mode - reclaim_pages refcount_dec_and_lock refcount_dec_not_one refcount_warn_saturate @@ -2760,7 +2757,6 @@ __traceiter_android_rvh_v4l2subdev_set_selection __traceiter_android_rvh_wake_up_new_task __traceiter_android_vh_account_task_time - __traceiter_android_vh_add_page_to_lrulist __traceiter_android_vh_alloc_pages_slowpath_begin __traceiter_android_vh_alloc_pages_slowpath_end __traceiter_android_vh_allow_domain_state @@ -2793,7 +2789,6 @@ __traceiter_android_vh_check_bpf_syscall __traceiter_android_vh_check_file_open __traceiter_android_vh_check_mmap_file - __traceiter_android_vh_check_page_look_around_ref __traceiter_android_vh_check_uninterruptible_tasks __traceiter_android_vh_check_uninterruptible_tasks_dn __traceiter_android_vh_clear_mask_adjust @@ -2818,11 +2813,8 @@ __traceiter_android_vh_cpu_idle_enter __traceiter_android_vh_cpu_idle_exit __traceiter_android_vh_cpu_up - __traceiter_android_vh_del_page_from_lrulist __traceiter_android_vh_do_futex - __traceiter_android_vh_do_page_trylock __traceiter_android_vh_do_send_sig_info - __traceiter_android_vh_do_traversal_lruvec __traceiter_android_vh_dm_bufio_shrink_scan_bypass __traceiter_android_vh_drain_all_pages_bypass __traceiter_android_vh_em_cpu_energy @@ -2844,7 +2836,6 @@ __traceiter_android_vh_futex_wake_up_q_finish __traceiter_android_vh_get_from_fragment_pool __traceiter_android_vh_gpio_block_read - __traceiter_android_vh_handle_failed_page_trylock __traceiter_android_vh_include_reserved_zone __traceiter_android_vh_iommu_alloc_iova __traceiter_android_vh_iommu_free_iova @@ -2855,10 +2846,7 @@ __traceiter_android_vh_killed_process __traceiter_android_vh_kmalloc_slab __traceiter_android_vh_logbuf - __traceiter_android_vh_look_around - __traceiter_android_vh_look_around_migrate_page __traceiter_android_vh_madvise_cold_or_pageout_abort - __traceiter_android_vh_mark_page_accessed __traceiter_android_vh_mem_cgroup_alloc __traceiter_android_vh_mem_cgroup_css_offline __traceiter_android_vh_mem_cgroup_css_online @@ -2874,10 +2862,6 @@ __traceiter_android_vh_mutex_wait_start __traceiter_android_vh_override_creds __traceiter_android_vh_page_referenced_check_bypass - __traceiter_android_vh_page_should_be_protected - __traceiter_android_vh_page_trylock_clear - __traceiter_android_vh_page_trylock_get_result - __traceiter_android_vh_page_trylock_set __traceiter_android_vh_pcplist_add_cma_pages_bypass __traceiter_android_vh_prepare_update_load_avg_se __traceiter_android_vh_printk_hotplug @@ -2919,7 +2903,6 @@ __traceiter_android_vh_set_module_permit_before_init __traceiter_android_vh_setscheduler_uclamp __traceiter_android_vh_set_wake_flags - __traceiter_android_vh_show_mapcount_pages __traceiter_android_vh_show_max_freq __traceiter_android_vh_show_resume_epoch_val __traceiter_android_vh_show_stack_hash @@ -2927,7 +2910,6 @@ __traceiter_android_vh_shrink_node_memcgs __traceiter_android_vh_sync_txn_recvd __traceiter_android_vh_syscall_prctl_finished - __traceiter_android_vh_test_clear_look_around_ref __traceiter_android_vh_timer_calc_index __traceiter_android_vh_tune_inactive_ratio __traceiter_android_vh_tune_scan_type @@ -2935,7 +2917,6 @@ __traceiter_android_vh_ufs_compl_command __traceiter_android_vh_ufs_send_command __traceiter_android_vh_ufs_send_tm_command - __traceiter_android_vh_update_page_mapcount __traceiter_android_vh_update_topology_flags_workfn __traceiter_binder_transaction_received __traceiter_cpu_frequency @@ -3028,7 +3009,6 @@ __tracepoint_android_rvh_v4l2subdev_set_selection __tracepoint_android_rvh_wake_up_new_task __tracepoint_android_vh_account_task_time - __tracepoint_android_vh_add_page_to_lrulist __tracepoint_android_vh_alloc_pages_slowpath_begin __tracepoint_android_vh_alloc_pages_slowpath_end __tracepoint_android_vh_allow_domain_state @@ -3061,7 +3041,6 @@ __tracepoint_android_vh_check_bpf_syscall __tracepoint_android_vh_check_file_open __tracepoint_android_vh_check_mmap_file - __tracepoint_android_vh_check_page_look_around_ref __tracepoint_android_vh_check_uninterruptible_tasks __tracepoint_android_vh_check_uninterruptible_tasks_dn __tracepoint_android_vh_clear_mask_adjust @@ -3086,12 +3065,9 @@ __tracepoint_android_vh_cpu_idle_enter __tracepoint_android_vh_cpu_idle_exit __tracepoint_android_vh_cpu_up - __tracepoint_android_vh_del_page_from_lrulist __tracepoint_android_vh_dm_bufio_shrink_scan_bypass __tracepoint_android_vh_do_futex - __tracepoint_android_vh_do_page_trylock __tracepoint_android_vh_do_send_sig_info - __tracepoint_android_vh_do_traversal_lruvec __tracepoint_android_vh_drain_all_pages_bypass __tracepoint_android_vh_em_cpu_energy __tracepoint_android_vh_exclude_reserved_zone @@ -3112,7 +3088,6 @@ __tracepoint_android_vh_futex_wake_up_q_finish __tracepoint_android_vh_get_from_fragment_pool __tracepoint_android_vh_gpio_block_read - __tracepoint_android_vh_handle_failed_page_trylock __tracepoint_android_vh_include_reserved_zone __tracepoint_android_vh_iommu_alloc_iova __tracepoint_android_vh_iommu_free_iova @@ -3123,10 +3098,7 @@ __tracepoint_android_vh_killed_process __tracepoint_android_vh_kmalloc_slab __tracepoint_android_vh_logbuf - __tracepoint_android_vh_look_around - __tracepoint_android_vh_look_around_migrate_page __tracepoint_android_vh_madvise_cold_or_pageout_abort - __tracepoint_android_vh_mark_page_accessed __tracepoint_android_vh_mem_cgroup_alloc __tracepoint_android_vh_mem_cgroup_css_offline __tracepoint_android_vh_mem_cgroup_css_online @@ -3142,10 +3114,6 @@ __tracepoint_android_vh_mutex_wait_start __tracepoint_android_vh_override_creds __tracepoint_android_vh_page_referenced_check_bypass - __tracepoint_android_vh_page_should_be_protected - __tracepoint_android_vh_page_trylock_clear - __tracepoint_android_vh_page_trylock_get_result - __tracepoint_android_vh_page_trylock_set __tracepoint_android_vh_pcplist_add_cma_pages_bypass __tracepoint_android_vh_prepare_update_load_avg_se __tracepoint_android_vh_printk_hotplug @@ -3187,7 +3155,6 @@ __tracepoint_android_vh_set_module_permit_before_init __tracepoint_android_vh_setscheduler_uclamp __tracepoint_android_vh_set_wake_flags - __tracepoint_android_vh_show_mapcount_pages __tracepoint_android_vh_show_max_freq __tracepoint_android_vh_show_resume_epoch_val __tracepoint_android_vh_show_stack_hash @@ -3195,7 +3162,6 @@ __tracepoint_android_vh_shrink_node_memcgs __tracepoint_android_vh_sync_txn_recvd __tracepoint_android_vh_syscall_prctl_finished - __tracepoint_android_vh_test_clear_look_around_ref __tracepoint_android_vh_timer_calc_index __tracepoint_android_vh_tune_inactive_ratio __tracepoint_android_vh_tune_scan_type @@ -3203,7 +3169,6 @@ __tracepoint_android_vh_ufs_compl_command __tracepoint_android_vh_ufs_send_command __tracepoint_android_vh_ufs_send_tm_command - __tracepoint_android_vh_update_page_mapcount __tracepoint_android_vh_update_topology_flags_workfn __tracepoint_binder_transaction_received __tracepoint_cpu_frequency diff --git a/arch/Kconfig b/arch/Kconfig index 3ca6e932317d..74d331f0b27d 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1185,6 +1185,14 @@ config ARCH_SPLIT_ARG64 If a 32-bit architecture requires 64-bit arguments to be split into pairs of 32-bit arguments, select this option. +config ARCH_HAS_NONLEAF_PMD_YOUNG + bool + help + Architectures that select this option are capable of setting the + accessed bit in non-leaf PMD entries when using them as part of linear + address translations. Page table walkers that clear the accessed bit + may use this capability to reduce their search space. + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile index 07c5e0fc006f..7146e645671a 100644 --- a/arch/arm64/Makefile +++ b/arch/arm64/Makefile @@ -44,6 +44,8 @@ ifeq ($(CONFIG_BROKEN_GAS_INST),y) $(warning Detected assembler with broken .inst; disassembly will be unreliable) endif +KBUILD_CFLAGS += -mcpu=cortex-a78 + KBUILD_CFLAGS += -mgeneral-regs-only \ $(compat_vdso) $(cc_has_k_constraint) KBUILD_CFLAGS += $(call cc-disable-warning, psabi) diff --git a/arch/arm64/configs/gki_defconfig b/arch/arm64/configs/gki_defconfig index e238ccdcafe1..02d699904a8c 100644 --- a/arch/arm64/configs/gki_defconfig +++ b/arch/arm64/configs/gki_defconfig @@ -105,6 +105,7 @@ CONFIG_BLK_INLINE_ENCRYPTION=y CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK=y CONFIG_IOSCHED_BFQ=y CONFIG_BFQ_GROUP_IOSCHED=y +CONFIG_NONE_DEFAULT=y CONFIG_GKI_HACKS_TO_FIX=y # CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set CONFIG_BINFMT_MISC=y @@ -119,6 +120,8 @@ CONFIG_CMA_DEBUGFS=y CONFIG_CMA_SYSFS=y CONFIG_CMA_AREAS=16 CONFIG_READ_ONLY_THP_FOR_FS=y +CONFIG_LRU_GEN=y +CONFIG_LRU_GEN_ENABLED=y CONFIG_DAMON=y CONFIG_DAMON_PADDR=y CONFIG_DAMON_RECLAIM=y @@ -651,7 +654,6 @@ CONFIG_HARDENED_USERCOPY=y CONFIG_STATIC_USERMODEHELPER=y CONFIG_STATIC_USERMODEHELPER_PATH="" CONFIG_SECURITY_SELINUX=y -CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y CONFIG_CRYPTO_CHACHA20POLY1305=y CONFIG_CRYPTO_ADIANTUM=y CONFIG_CRYPTO_XCBC=y diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 44070d073a49..e9837d4e62ea 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -994,23 +994,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, * page after fork() + CoW for pfn mappings. We don't always have a * hardware-managed access flag on arm64. */ -static inline bool arch_faults_on_old_pte(void) -{ - WARN_ON(preemptible()); - - return !cpu_has_hw_af(); -} -#define arch_faults_on_old_pte arch_faults_on_old_pte +#define arch_has_hw_pte_young cpu_has_hw_af /* * Experimentally, it's cheap to set the access flag in hardware and we * benefit from prefaulting mappings as 'old' to start with. */ -static inline bool arch_wants_old_prefaulted_pte(void) -{ - return !arch_faults_on_old_pte(); -} -#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte +#define arch_wants_old_prefaulted_pte cpu_has_hw_af #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 35ace6d3ddef..b2c0d5bfc357 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -77,6 +77,7 @@ config X86 select ARCH_HAS_PMEM_API if X86_64 select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_SPECIAL + select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2 select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64 select ARCH_HAS_COPY_MC if X86_64 select ARCH_HAS_SET_MEMORY diff --git a/arch/x86/configs/gki_defconfig b/arch/x86/configs/gki_defconfig index 9c6d1a0e661d..1ffa4e7206b1 100644 --- a/arch/x86/configs/gki_defconfig +++ b/arch/x86/configs/gki_defconfig @@ -94,6 +94,8 @@ CONFIG_CMA_DEBUGFS=y CONFIG_CMA_SYSFS=y CONFIG_CMA_AREAS=16 CONFIG_READ_ONLY_THP_FOR_FS=y +CONFIG_LRU_GEN=y +CONFIG_LRU_GEN_ENABLED=y CONFIG_DAMON=y CONFIG_DAMON_PADDR=y CONFIG_DAMON_RECLAIM=y @@ -583,7 +585,6 @@ CONFIG_HARDENED_USERCOPY=y CONFIG_STATIC_USERMODEHELPER=y CONFIG_STATIC_USERMODEHELPER_PATH="" CONFIG_SECURITY_SELINUX=y -CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y CONFIG_CRYPTO_CHACHA20POLY1305=y CONFIG_CRYPTO_ADIANTUM=y CONFIG_CRYPTO_XCBC=y diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 9bacde3ff514..5bd55bc1e161 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -846,7 +846,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) static inline int pmd_bad(pmd_t pmd) { - return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE; + return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) != + (_KERNPG_TABLE & ~_PAGE_ACCESSED); } static inline unsigned long pages_to_mb(unsigned long npg) @@ -1452,10 +1453,10 @@ static inline bool arch_has_pfn_modify_check(void) return boot_cpu_has_bug(X86_BUG_L1TF); } -#define arch_faults_on_old_pte arch_faults_on_old_pte -static inline bool arch_faults_on_old_pte(void) +#define arch_has_hw_pte_young arch_has_hw_pte_young +static inline bool arch_has_hw_pte_young(void) { - return false; + return true; } #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 204b25ee26f0..e98ea035a0e5 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma, return ret; } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) { @@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma, return ret; } +#endif + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE int pudp_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pud_t *pudp) { diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched index 64053d67a97b..176c48d6f031 100644 --- a/block/Kconfig.iosched +++ b/block/Kconfig.iosched @@ -50,6 +50,29 @@ config BFQ_CGROUP_DEBUG Enable some debugging help. Currently it exports additional stat files in a cgroup which can be useful for debugging. +choice + prompt "Default mq I/O scheduler" + default MQ_DEADLINE_DEFAULT + help + Select the I/O scheduler which will be used by default. + + config MQ_DEADLINE_DEFAULT + bool "MQ Deadline" + depends on MQ_IOSCHED_DEADLINE + + config MQ_KYBER_DEFAULT + bool "MQ Kyber" + depends on MQ_IOSCHED_KYBER + + config BFQ_DEFAULT + bool "BFQ" + depends on IOSCHED_BFQ + + config NONE_DEFAULT + bool "NONE" + +endchoice + endmenu endif diff --git a/block/elevator.c b/block/elevator.c index f762b2af1d2a..06f5b35c9910 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -675,7 +675,13 @@ void elevator_init_mq(struct request_queue *q) if (unlikely(q->elevator)) return; - if (!q->required_elevator_features) + if (IS_ENABLED(CONFIG_BFQ_DEFAULT)) { + e = elevator_get(q, "bfq", false); + } else if (IS_ENABLED(CONFIG_MQ_KYBER_DEFAULT)) { + e = elevator_get(q, "kyber", false); + } else if (IS_ENABLED(CONFIG_NONE_DEFAULT)) { + e = elevator_get(q, "none", false); + } else if (!q->required_elevator_features) e = elevator_get_default(q); else e = elevator_get_by_features(q); diff --git a/drivers/android/vendor_hooks.c b/drivers/android/vendor_hooks.c index 7b945c3cce28..5bdda99a8254 100644 --- a/drivers/android/vendor_hooks.c +++ b/drivers/android/vendor_hooks.c @@ -298,8 +298,6 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_include_reserved_zone); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_alloc_pages_slowpath_begin); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_alloc_pages_slowpath_end); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_show_mem); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_show_mapcount_pages); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_do_traversal_lruvec); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_typec_tcpci_override_toggling); EXPORT_TRACEPOINT_SYMBOL_GPL(android_rvh_typec_tcpci_chk_contaminant); EXPORT_TRACEPOINT_SYMBOL_GPL(android_rvh_typec_tcpci_get_vbus); @@ -326,11 +324,6 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_logbuf_pr_cont); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_tune_scan_type); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_tune_swappiness); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_shrink_slab_bypass); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_handle_failed_page_trylock); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_page_trylock_set); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_page_trylock_clear); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_page_trylock_get_result); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_do_page_trylock); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_page_referenced_check_bypass); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_drain_all_pages_bypass); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_cma_drain_all_pages_bypass); @@ -428,11 +421,6 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(android_rvh_pci_d3_sleep); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_kmalloc_slab); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_disable_thermal_cooling_stats); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_mmap_region); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_update_page_mapcount); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_add_page_to_lrulist); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_del_page_from_lrulist); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_page_should_be_protected); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_mark_page_accessed); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_try_to_unmap_one); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_mem_cgroup_id_remove); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_mem_cgroup_css_offline); @@ -489,10 +477,6 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_mmput); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_sched_pelt_multiplier); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_alloc_pages_reclaim_bypass); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_alloc_pages_failure_bypass); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_check_page_look_around_ref); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_look_around); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_look_around_migrate_page); -EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_test_clear_look_around_ref); EXPORT_TRACEPOINT_SYMBOL_GPL(android_rvh_dma_buf_stats_teardown); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_madvise_cold_or_pageout_abort); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_compact_finished); diff --git a/fs/exec.c b/fs/exec.c index b798885e30ce..d9cbebda31bb 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1015,6 +1015,7 @@ static int exec_mmap(struct mm_struct *mm) active_mm = tsk->active_mm; tsk->active_mm = mm; tsk->mm = mm; + lru_gen_add_mm(mm); /* * This prevents preemption while active_mm is being loaded and * it and mm are being updated, which could cause problems for @@ -1030,6 +1031,7 @@ static int exec_mmap(struct mm_struct *mm) tsk->mm->vmacache_seqnum = 0; vmacache_flush(tsk); task_unlock(tsk); + lru_gen_use_mm(mm); if (old_mm) { mmap_read_unlock(old_mm); BUG_ON(active_mm != old_mm); diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 47eef2496230..b439d74c0cf0 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -793,7 +793,8 @@ static int fuse_check_page(struct page *page) 1 << PG_active | 1 << PG_workingset | 1 << PG_reclaim | - 1 << PG_waiters))) { + 1 << PG_waiters | + LRU_GEN_MASK | LRU_REFS_MASK))) { dump_page(page, "fuse: trying to steal weird page"); return 1; } diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 00b390fa47d4..dec5488bfcdc 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -437,6 +437,18 @@ static inline void cgroup_put(struct cgroup *cgrp) css_put(&cgrp->self); } +extern struct mutex cgroup_mutex; + +static inline void cgroup_lock(void) +{ + mutex_lock(&cgroup_mutex); +} + +static inline void cgroup_unlock(void) +{ + mutex_unlock(&cgroup_mutex); +} + /** * task_css_set_check - obtain a task's css_set with extra access conditions * @task: the task to obtain css_set for @@ -451,7 +463,6 @@ static inline void cgroup_put(struct cgroup *cgrp) * as locks used during the cgroup_subsys::attach() methods. */ #ifdef CONFIG_PROVE_RCU -extern struct mutex cgroup_mutex; extern spinlock_t css_set_lock; #define task_css_set_check(task, __c) \ rcu_dereference_check((task)->cgroups, \ @@ -711,6 +722,8 @@ struct cgroup; static inline u64 cgroup_id(struct cgroup *cgrp) { return 1; } static inline void css_get(struct cgroup_subsys_state *css) {} static inline void css_put(struct cgroup_subsys_state *css) {} +static inline void cgroup_lock(void) {} +static inline void cgroup_unlock(void) {} static inline int cgroup_attach_task_all(struct task_struct *from, struct task_struct *t) { return 0; } static inline int cgroupstats_build(struct cgroupstats *stats, diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 2e9c8a504fab..531cb39a00f7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -332,6 +332,11 @@ struct mem_cgroup { struct deferred_split deferred_split_queue; #endif +#ifdef CONFIG_LRU_GEN + /* per-memcg mm_struct list */ + struct lru_gen_mm_list mm_list; +#endif + ANDROID_OEM_DATA(1); struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ @@ -345,9 +350,6 @@ struct mem_cgroup { extern struct mem_cgroup *root_mem_cgroup; -struct lruvec *page_to_lruvec(struct page *page, pg_data_t *pgdat); -void do_traversal_all_lruvec(void); - static __always_inline bool memcg_stat_item_in_bytes(int idx) { if (idx == MEMCG_PERCPU_B) @@ -738,6 +740,23 @@ static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val); +/* try to stablize page_memcg() for all the pages in a memcg */ +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg) +{ + rcu_read_lock(); + + if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account)) + return true; + + rcu_read_unlock(); + return false; +} + +static inline void mem_cgroup_unlock_pages(void) +{ + rcu_read_unlock(); +} + /* idx can be of type enum memcg_stat_item or node_stat_item */ static inline void mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) @@ -972,15 +991,6 @@ void split_page_memcg(struct page *head, unsigned int nr); struct mem_cgroup; -static inline struct lruvec *page_to_lruvec(struct page *page, pg_data_t *pgdat) -{ - return NULL; -} - -static inline void do_traversal_all_lruvec(void) -{ -} - static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) { return true; @@ -1165,6 +1175,18 @@ static inline void unlock_page_memcg(struct page *page) { } +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg) +{ + /* to match page_memcg_rcu() */ + rcu_read_lock(); + return true; +} + +static inline void mem_cgroup_unlock_pages(void) +{ + rcu_read_unlock(); +} + static inline void mem_cgroup_handle_over_high(void) { } diff --git a/include/linux/mm.h b/include/linux/mm.h index dfefcfa1d6a4..1436b03fe349 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1109,6 +1109,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH) #define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH) #define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH) +#define LRU_GEN_PGOFF (KASAN_TAG_PGOFF - LRU_GEN_WIDTH) +#define LRU_REFS_PGOFF (LRU_GEN_PGOFF - LRU_REFS_WIDTH) /* * Define the bit shifts to access each section. For non-existent diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 835e4f83e7a4..8e0f17c10a37 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -4,10 +4,6 @@ #include #include -#ifndef __GENKSYMS__ -#define PROTECT_TRACE_INCLUDE_PATH -#include -#endif /** * page_is_file_lru - should the page be on a file LRU or anon LRU? @@ -30,10 +26,13 @@ static inline int page_is_file_lru(struct page *page) static __always_inline void __update_lru_size(struct lruvec *lruvec, enum lru_list lru, enum zone_type zid, - int nr_pages) + long nr_pages) { struct pglist_data *pgdat = lruvec_pgdat(lruvec); + lockdep_assert_held(&pgdat->lru_lock); + WARN_ON_ONCE(nr_pages != (int)nr_pages); + __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages); __mod_zone_page_state(&pgdat->node_zones[zid], NR_ZONE_LRU_BASE + lru, nr_pages); @@ -49,67 +48,22 @@ static __always_inline void update_lru_size(struct lruvec *lruvec, #endif } -static __always_inline void add_page_to_lru_list(struct page *page, - struct lruvec *lruvec, enum lru_list lru) -{ - trace_android_vh_add_page_to_lrulist(page, false, lru); - update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page)); - list_add(&page->lru, &lruvec->lists[lru]); -} - -static __always_inline void add_page_to_lru_list_tail(struct page *page, - struct lruvec *lruvec, enum lru_list lru) -{ - trace_android_vh_add_page_to_lrulist(page, false, lru); - update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page)); - list_add_tail(&page->lru, &lruvec->lists[lru]); -} - -static __always_inline void del_page_from_lru_list(struct page *page, - struct lruvec *lruvec, enum lru_list lru) -{ - trace_android_vh_del_page_from_lrulist(page, false, lru); - list_del(&page->lru); - update_lru_size(lruvec, lru, page_zonenum(page), -thp_nr_pages(page)); -} - /** - * page_lru_base_type - which LRU list type should a page be on? - * @page: the page to test - * - * Used for LRU list index arithmetic. - * - * Returns the base LRU type - file or anon - @page should be on. + * __clear_page_lru_flags - clear page lru flags before releasing a page + * @page: the page that was on lru and now has a zero reference */ -static inline enum lru_list page_lru_base_type(struct page *page) +static __always_inline void __clear_page_lru_flags(struct page *page) { - if (page_is_file_lru(page)) - return LRU_INACTIVE_FILE; - return LRU_INACTIVE_ANON; -} + VM_BUG_ON_PAGE(!PageLRU(page), page); -/** - * page_off_lru - which LRU list was page on? clearing its lru flags. - * @page: the page to test - * - * Returns the LRU list a page was on, as an index into the array of LRU - * lists; and clears its Unevictable or Active flags, ready for freeing. - */ -static __always_inline enum lru_list page_off_lru(struct page *page) -{ - enum lru_list lru; + __ClearPageLRU(page); - if (PageUnevictable(page)) { - __ClearPageUnevictable(page); - lru = LRU_UNEVICTABLE; - } else { - lru = page_lru_base_type(page); - if (PageActive(page)) { - __ClearPageActive(page); - lru += LRU_ACTIVE; - } - } - return lru; + /* this shouldn't happen, so leave the flags to bad_page() */ + if (PageActive(page) && PageUnevictable(page)) + return; + + __ClearPageActive(page); + __ClearPageUnevictable(page); } /** @@ -123,13 +77,261 @@ static __always_inline enum lru_list page_lru(struct page *page) { enum lru_list lru; + VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page); + if (PageUnevictable(page)) - lru = LRU_UNEVICTABLE; - else { - lru = page_lru_base_type(page); - if (PageActive(page)) + return LRU_UNEVICTABLE; + + lru = page_is_file_lru(page) ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON; + if (PageActive(page)) + lru += LRU_ACTIVE; + + return lru; +} + +#ifdef CONFIG_LRU_GEN + +#ifdef CONFIG_LRU_GEN_ENABLED +static inline bool lru_gen_enabled(void) +{ + DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]); + + return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]); +} +#else +static inline bool lru_gen_enabled(void) +{ + DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]); + + return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]); +} +#endif + +static inline bool lru_gen_in_fault(void) +{ + return current->in_lru_fault; +} + +static inline int lru_gen_from_seq(unsigned long seq) +{ + return seq % MAX_NR_GENS; +} + +static inline int lru_hist_from_seq(unsigned long seq) +{ + return seq % NR_HIST_GENS; +} + +static inline int lru_tier_from_refs(int refs) +{ + VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH)); + + /* see the comment in page_lru_refs() */ + return order_base_2(refs + 1); +} + +static inline int page_lru_refs(struct page *page) +{ + unsigned long flags = READ_ONCE(page->flags); + bool workingset = flags & BIT(PG_workingset); + + /* + * Return the number of accesses beyond PG_referenced, i.e., N-1 if the + * total number of accesses is N>1, since N=0,1 both map to the first + * tier. lru_tier_from_refs() will account for this off-by-one. Also see + * the comment on MAX_NR_TIERS. + */ + return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset; +} + +static inline int page_lru_gen(struct page *page) +{ + unsigned long flags = READ_ONCE(page->flags); + + return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; +} + +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen) +{ + unsigned long max_seq = lruvec->lrugen.max_seq; + + VM_WARN_ON_ONCE(gen >= MAX_NR_GENS); + + /* see the comment on MIN_NR_GENS */ + return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1); +} + +static inline void lru_gen_update_size(struct lruvec *lruvec, struct page *page, + int old_gen, int new_gen) +{ + int type = page_is_file_lru(page); + int zone = page_zonenum(page); + int delta = thp_nr_pages(page); + enum lru_list lru = type * LRU_INACTIVE_FILE; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + VM_WARN_ON_ONCE(old_gen != -1 && old_gen >= MAX_NR_GENS); + VM_WARN_ON_ONCE(new_gen != -1 && new_gen >= MAX_NR_GENS); + VM_WARN_ON_ONCE(old_gen == -1 && new_gen == -1); + + if (old_gen >= 0) + WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone], + lrugen->nr_pages[old_gen][type][zone] - delta); + if (new_gen >= 0) + WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone], + lrugen->nr_pages[new_gen][type][zone] + delta); + + /* addition */ + if (old_gen < 0) { + if (lru_gen_is_active(lruvec, new_gen)) lru += LRU_ACTIVE; + __update_lru_size(lruvec, lru, zone, delta); + return; } - return lru; + + /* deletion */ + if (new_gen < 0) { + if (lru_gen_is_active(lruvec, old_gen)) + lru += LRU_ACTIVE; + __update_lru_size(lruvec, lru, zone, -delta); + return; + } + + /* promotion */ + if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) { + __update_lru_size(lruvec, lru, zone, -delta); + __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta); + } + + /* demotion requires isolation, e.g., lru_deactivate_fn() */ + VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen)); +} + +static inline bool lru_gen_add_page(struct lruvec *lruvec, struct page *page, bool reclaiming) +{ + unsigned long seq; + unsigned long flags; + int gen = page_lru_gen(page); + int type = page_is_file_lru(page); + int zone = page_zonenum(page); + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + VM_WARN_ON_ONCE_PAGE(gen != -1, page); + + if (PageUnevictable(page) || !lrugen->enabled) + return false; + /* + * There are three common cases for this page: + * 1. If it's hot, e.g., freshly faulted in or previously hot and + * migrated, add it to the youngest generation. + * 2. If it's cold but can't be evicted immediately, i.e., an anon page + * not in swapcache or a dirty page pending writeback, add it to the + * second oldest generation. + * 3. Everything else (clean, cold) is added to the oldest generation. + */ + if (PageActive(page)) + seq = lrugen->max_seq; + else if ((type == LRU_GEN_ANON && !PageSwapCache(page)) || + (PageReclaim(page) && + (PageDirty(page) || PageWriteback(page)))) + seq = lrugen->min_seq[type] + 1; + else + seq = lrugen->min_seq[type]; + + gen = lru_gen_from_seq(seq); + flags = (gen + 1UL) << LRU_GEN_PGOFF; + /* see the comment on MIN_NR_GENS about PG_active */ + set_mask_bits(&page->flags, LRU_GEN_MASK | BIT(PG_active), flags); + + lru_gen_update_size(lruvec, page, -1, gen); + /* for rotate_reclaimable_page() */ + if (reclaiming) + list_add_tail(&page->lru, &lrugen->lists[gen][type][zone]); + else + list_add(&page->lru, &lrugen->lists[gen][type][zone]); + + return true; +} + +static inline bool lru_gen_del_page(struct lruvec *lruvec, struct page *page, bool reclaiming) +{ + unsigned long flags; + int gen = page_lru_gen(page); + + if (gen < 0) + return false; + + VM_WARN_ON_ONCE_PAGE(PageActive(page), page); + VM_WARN_ON_ONCE_PAGE(PageUnevictable(page), page); + + /* for migrate_page_states() */ + flags = !reclaiming && lru_gen_is_active(lruvec, gen) ? BIT(PG_active) : 0; + flags = set_mask_bits(&page->flags, LRU_GEN_MASK, flags); + gen = ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; + + lru_gen_update_size(lruvec, page, gen, -1); + list_del(&page->lru); + + return true; +} + +#else /* !CONFIG_LRU_GEN */ + +static inline bool lru_gen_enabled(void) +{ + return false; +} + +static inline bool lru_gen_in_fault(void) +{ + return false; +} + +static inline bool lru_gen_add_page(struct lruvec *lruvec, struct page *page, bool reclaiming) +{ + return false; +} + +static inline bool lru_gen_del_page(struct lruvec *lruvec, struct page *page, bool reclaiming) +{ + return false; } + +#endif /* CONFIG_LRU_GEN */ + +static __always_inline void add_page_to_lru_list(struct page *page, + struct lruvec *lruvec) +{ + enum lru_list lru = page_lru(page); + + if (lru_gen_add_page(lruvec, page, false)) + return; + + update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page)); + list_add(&page->lru, &lruvec->lists[lru]); +} + +static __always_inline void add_page_to_lru_list_tail(struct page *page, + struct lruvec *lruvec) +{ + enum lru_list lru = page_lru(page); + + if (lru_gen_add_page(lruvec, page, true)) + return; + + update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page)); + list_add_tail(&page->lru, &lruvec->lists[lru]); +} + +static __always_inline void del_page_from_lru_list(struct page *page, + struct lruvec *lruvec) +{ + if (lru_gen_del_page(lruvec, page, false)) + return; + + list_del(&page->lru); + update_lru_size(lruvec, page_lru(page), page_zonenum(page), + -thp_nr_pages(page)); +} + #endif diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c853f612a815..a1160423edfc 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -3,6 +3,7 @@ #define _LINUX_MM_TYPES_H #include +#include #include #include @@ -15,6 +16,8 @@ #include #include #include +#include +#include #include #include @@ -598,6 +601,22 @@ struct mm_struct { #ifdef CONFIG_IOMMU_SUPPORT u32 pasid; #endif +#ifdef CONFIG_LRU_GEN + struct { + /* this mm_struct is on lru_gen_mm_list */ + struct list_head list; +#ifdef CONFIG_MEMCG + /* points to the memcg of "owner" above */ + struct mem_cgroup *memcg; +#endif + /* + * Set when switching to this mm_struct, as a hint of + * whether it has been used since the last time per-node + * page table walkers cleared the corresponding bits. + */ + nodemask_t nodes; + } lru_gen; +#endif /* CONFIG_LRU_GEN */ ANDROID_KABI_RESERVE(1); } __randomize_layout; @@ -626,6 +645,66 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) return (struct cpumask *)&mm->cpu_bitmap; } +#ifdef CONFIG_LRU_GEN + +struct lru_gen_mm_list { + /* mm_struct list for page table walkers */ + struct list_head fifo; + /* protects the list above */ + spinlock_t lock; +}; + +void lru_gen_add_mm(struct mm_struct *mm); +void lru_gen_del_mm(struct mm_struct *mm); +#ifdef CONFIG_MEMCG +void lru_gen_migrate_mm(struct mm_struct *mm); +#endif + +static inline void lru_gen_init_mm(struct mm_struct *mm) +{ + INIT_LIST_HEAD(&mm->lru_gen.list); + nodes_clear(mm->lru_gen.nodes); +#ifdef CONFIG_MEMCG + mm->lru_gen.memcg = NULL; +#endif +} + +static inline void lru_gen_use_mm(struct mm_struct *mm) +{ + /* + * When the bitmap is set, page reclaim knows this mm_struct has been + * used since the last time it cleared the bitmap. So it might be worth + * walking the page tables of this mm_struct to clear the accessed bit. + */ + nodes_setall(mm->lru_gen.nodes); +} + +#else /* !CONFIG_LRU_GEN */ + +static inline void lru_gen_add_mm(struct mm_struct *mm) +{ +} + +static inline void lru_gen_del_mm(struct mm_struct *mm) +{ +} + +#ifdef CONFIG_MEMCG +static inline void lru_gen_migrate_mm(struct mm_struct *mm) +{ +} +#endif + +static inline void lru_gen_init_mm(struct mm_struct *mm) +{ +} + +static inline void lru_gen_use_mm(struct mm_struct *mm) +{ +} + +#endif /* CONFIG_LRU_GEN */ + struct mmu_gather; extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f611a06670cb..c54bc1f22a25 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -277,6 +277,209 @@ enum lruvec_flags { */ }; +#endif /* !__GENERATING_BOUNDS_H */ + +/* + * Evictable pages are divided into multiple generations. The youngest and the + * oldest generation numbers, max_seq and min_seq, are monotonically increasing. + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An + * offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the + * corresponding generation. The gen counter in page->flags stores gen+1 while + * a page is on one of lrugen->lists[]. Otherwise it stores 0. + * + * A page is added to the youngest generation on faulting. The aging needs to + * check the accessed bit at least twice before handing this page over to the + * eviction. The first check takes care of the accessed bit set on the initial + * fault; the second check makes sure this page hasn't been used since then. + * This process, AKA second chance, requires a minimum of two generations, + * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive + * LRU, e.g., /proc/vmstat, these two generations are considered active; the + * rest of generations, if they exist, are considered inactive. See + * lru_gen_is_active(). + * + * PG_active is always cleared while a page is on one of lrugen->lists[] so that + * the aging needs not to worry about it. And it's set again when a page + * considered active is isolated for non-reclaiming purposes, e.g., migration. + * See lru_gen_add_page() and lru_gen_del_page(). + * + * MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the + * number of categories of the active/inactive LRU when keeping track of + * accesses through page tables. This requires order_base_2(MAX_NR_GENS+1) bits + * in page->flags. + */ +#define MIN_NR_GENS 2U +#define MAX_NR_GENS 4U + +/* + * Each generation is divided into multiple tiers. A page accessed N times + * through file descriptors is in tier order_base_2(N). A page in the first tier + * (N=0,1) is marked by PG_referenced unless it was faulted in through page + * tables or read ahead. A page in any other tier (N>1) is marked by + * PG_referenced and PG_workingset. This implies a minimum of two tiers is + * supported without using additional bits in page->flags. + * + * In contrast to moving across generations which requires the LRU lock, moving + * across tiers only involves atomic operations on page->flags and therefore + * has a negligible cost in the buffered access path. In the eviction path, + * comparisons of refaulted/(evicted+protected) from the first tier and the + * rest infer whether pages accessed multiple times through file descriptors + * are statistically hot and thus worth protecting. + * + * MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support twice the + * number of categories of the active/inactive LRU when keeping track of + * accesses through file descriptors. This uses MAX_NR_TIERS-2 spare bits in + * page->flags. + */ +#define MAX_NR_TIERS 4U + +#ifndef __GENERATING_BOUNDS_H + +struct lruvec; +struct page_vma_mapped_walk; + +#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF) +#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF) + +#ifdef CONFIG_LRU_GEN + +enum { + LRU_GEN_ANON, + LRU_GEN_FILE, +}; + +enum { + LRU_GEN_CORE, + LRU_GEN_MM_WALK, + LRU_GEN_NONLEAF_YOUNG, + NR_LRU_GEN_CAPS +}; + +#define MIN_LRU_BATCH BITS_PER_LONG +#define MAX_LRU_BATCH (MIN_LRU_BATCH * 64) + +/* whether to keep historical stats from evicted generations */ +#ifdef CONFIG_LRU_GEN_STATS +#define NR_HIST_GENS MAX_NR_GENS +#else +#define NR_HIST_GENS 1U +#endif + +/* + * The youngest generation number is stored in max_seq for both anon and file + * types as they are aged on an equal footing. The oldest generation numbers are + * stored in min_seq[] separately for anon and file types as clean file pages + * can be evicted regardless of swap constraints. + * + * Normally anon and file min_seq are in sync. But if swapping is constrained, + * e.g., out of swap space, file min_seq is allowed to advance and leave anon + * min_seq behind. + * + * The number of pages in each generation is eventually consistent and therefore + * can be transiently negative when reset_batch_size() is pending. + */ +struct lru_gen_struct { + /* the aging increments the youngest generation number */ + unsigned long max_seq; + /* the eviction increments the oldest generation numbers */ + unsigned long min_seq[ANON_AND_FILE]; + /* the birth time of each generation in jiffies */ + unsigned long timestamps[MAX_NR_GENS]; + /* the multi-gen LRU lists, lazily sorted on eviction */ + struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; + /* the multi-gen LRU sizes, eventually consistent */ + unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; + /* the exponential moving average of refaulted */ + unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS]; + /* the exponential moving average of evicted+protected */ + unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS]; + /* the first tier doesn't need protection, hence the minus one */ + unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1]; + /* can be modified without holding the LRU lock */ + atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; + atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; + /* whether the multi-gen LRU is enabled */ + bool enabled; +}; + +enum { + MM_LEAF_TOTAL, /* total leaf entries */ + MM_LEAF_OLD, /* old leaf entries */ + MM_LEAF_YOUNG, /* young leaf entries */ + MM_NONLEAF_TOTAL, /* total non-leaf entries */ + MM_NONLEAF_FOUND, /* non-leaf entries found in Bloom filters */ + MM_NONLEAF_ADDED, /* non-leaf entries added to Bloom filters */ + NR_MM_STATS +}; + +/* double-buffering Bloom filters */ +#define NR_BLOOM_FILTERS 2 + +struct lru_gen_mm_state { + /* set to max_seq after each iteration */ + unsigned long seq; + /* where the current iteration continues after */ + struct list_head *head; + /* where the last iteration ended before */ + struct list_head *tail; + /* Unused - keep for ABI compatiiblity */ + struct wait_queue_head wait; + /* Bloom filters flip after each iteration */ + unsigned long *filters[NR_BLOOM_FILTERS]; + /* the mm stats for debugging */ + unsigned long stats[NR_HIST_GENS][NR_MM_STATS]; + /* Unused - keep for ABI compatiiblity */ + int nr_walkers; +}; + +struct lru_gen_mm_walk { + /* the lruvec under reclaim */ + struct lruvec *lruvec; + /* unstable max_seq from lru_gen_struct */ + unsigned long max_seq; + /* the next address within an mm to scan */ + unsigned long next_addr; + /* Unused -- for ABI compatibility */ + unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)]; + /* to batch promoted pages */ + int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; + /* to batch the mm stats */ + int mm_stats[NR_MM_STATS]; + /* total batched items */ + int batched; + bool can_swap; + bool full_scan; +}; + +void lru_gen_init_lruvec(struct lruvec *lruvec); +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw); + +#ifdef CONFIG_MEMCG +void lru_gen_init_memcg(struct mem_cgroup *memcg); +void lru_gen_exit_memcg(struct mem_cgroup *memcg); +#endif + +#else /* !CONFIG_LRU_GEN */ + +static inline void lru_gen_init_lruvec(struct lruvec *lruvec) +{ +} + +static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +{ +} + +#ifdef CONFIG_MEMCG +static inline void lru_gen_init_memcg(struct mem_cgroup *memcg) +{ +} + +static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg) +{ +} +#endif + +#endif /* CONFIG_LRU_GEN */ + struct lruvec { struct list_head lists[NR_LRU_LISTS]; /* @@ -292,6 +495,12 @@ struct lruvec { unsigned long refaults[ANON_AND_FILE]; /* Various lruvec state flags (enum lruvec_flags) */ unsigned long flags; +#ifdef CONFIG_LRU_GEN + /* evictable pages divided into generations */ + struct lru_gen_struct lrugen; + /* to concurrently iterate lru_gen_mm_list */ + struct lru_gen_mm_state mm_state; +#endif #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif @@ -827,6 +1036,11 @@ typedef struct pglist_data { unsigned long flags; +#ifdef CONFIG_LRU_GEN + /* kswap mm walk data */ + struct lru_gen_mm_walk mm_walk; +#endif + ZONE_PADDING(_pad2_) /* Per-node vmstats */ diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 2a63ef05a6cc..e88eaa7baa65 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -485,6 +485,7 @@ static inline int num_node_state(enum node_states state) #define first_online_node 0 #define first_memory_node 0 #define next_online_node(nid) (MAX_NUMNODES) +#define next_memory_node(nid) (MAX_NUMNODES) #define nr_node_ids 1U #define nr_online_nodes 1U diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h index 7d4ec26d8a3e..7d79818dc065 100644 --- a/include/linux/page-flags-layout.h +++ b/include/linux/page-flags-layout.h @@ -21,16 +21,17 @@ #elif MAX_NR_ZONES <= 8 #define ZONES_SHIFT 3 #else -#error ZONES_SHIFT -- too many zones configured adjust calculation +#error ZONES_SHIFT "Too many zones configured" #endif +#define ZONES_WIDTH ZONES_SHIFT + #ifdef CONFIG_SPARSEMEM #include - -/* SECTION_SHIFT #bits space required to store a section # */ #define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS) - -#endif /* CONFIG_SPARSEMEM */ +#else +#define SECTIONS_SHIFT 0 +#endif #ifndef BUILD_VDSO32_64 /* @@ -54,17 +55,29 @@ #define SECTIONS_WIDTH 0 #endif -#define ZONES_WIDTH ZONES_SHIFT - -#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS +#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \ + <= BITS_PER_LONG - NR_PAGEFLAGS #define NODES_WIDTH NODES_SHIFT -#else -#ifdef CONFIG_SPARSEMEM_VMEMMAP +#elif defined(CONFIG_SPARSEMEM_VMEMMAP) #error "Vmemmap: No space for nodes field in page flags" -#endif +#else #define NODES_WIDTH 0 #endif +/* + * Note that this #define MUST have a value so that it can be tested with + * the IS_ENABLED() macro. + */ +#if NODES_SHIFT != 0 && NODES_WIDTH == 0 +#define NODE_NOT_IN_PAGE_FLAGS 1 +#endif + +#if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS) +#define KASAN_TAG_WIDTH 8 +#else +#define KASAN_TAG_WIDTH 0 +#endif + #ifdef CONFIG_NUMA_BALANCING #define LAST__PID_SHIFT 8 #define LAST__PID_MASK ((1 << LAST__PID_SHIFT)-1) @@ -77,37 +90,26 @@ #define LAST_CPUPID_SHIFT 0 #endif -#if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS) -#define KASAN_TAG_WIDTH 8 -#else -#define KASAN_TAG_WIDTH 0 -#endif - -#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT+KASAN_TAG_WIDTH \ - <= BITS_PER_LONG - NR_PAGEFLAGS +#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \ + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT #else #define LAST_CPUPID_WIDTH 0 #endif -#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH+LAST_CPUPID_WIDTH+KASAN_TAG_WIDTH \ - > BITS_PER_LONG - NR_PAGEFLAGS -#error "Not enough bits in page flags" +#if LAST_CPUPID_SHIFT != 0 && LAST_CPUPID_WIDTH == 0 +#define LAST_CPUPID_NOT_IN_PAGE_FLAGS #endif -/* - * We are going to use the flags for the page to node mapping if its in - * there. This includes the case where there is no node, so it is implicit. - * Note that this #define MUST have a value so that it can be tested with - * the IS_ENABLED() macro. - */ -#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0) -#define NODE_NOT_IN_PAGE_FLAGS 1 +#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \ + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS +#error "Not enough bits in page flags" #endif -#if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0 -#define LAST_CPUPID_NOT_IN_PAGE_FLAGS -#endif +/* see the comment on MAX_NR_TIERS */ +#define LRU_REFS_WIDTH min(__LRU_REFS_WIDTH, BITS_PER_LONG - NR_PAGEFLAGS - \ + ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \ + NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH) #endif #endif /* _LINUX_PAGE_FLAGS_LAYOUT */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index dc636f162194..fc24ee31e092 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -840,7 +840,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) 1UL << PG_private | 1UL << PG_private_2 | \ 1UL << PG_writeback | 1UL << PG_reserved | \ 1UL << PG_slab | 1UL << PG_active | \ - 1UL << PG_unevictable | __PG_MLOCKED) + 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK) /* * Flags checked when a page is prepped for return by the page allocator. @@ -851,7 +851,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) * alloc-free cycle to prevent from reusing the page. */ #define PAGE_FLAGS_CHECK_AT_PREP \ - (((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON) + ((((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK) #define PAGE_FLAGS_PRIVATE \ (1UL << PG_private | 1UL << PG_private_2) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 774b9d02ec5b..b9f492ee63d0 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -194,7 +194,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, #endif #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) @@ -215,7 +215,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, BUILD_BUG(); return 0; } -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */ #endif #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH @@ -241,6 +241,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma, #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif +#ifndef arch_has_hw_pte_young +/* + * Return whether the accessed bit is supported on the local CPU. + * + * This stub assumes accessing through an old PTE triggers a page fault. + * Architectures that automatically set the access bit should overwrite it. + */ +static inline bool arch_has_hw_pte_young(void) +{ + return false; +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 5a64e48ef207..82ee9480fe75 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -11,10 +11,6 @@ #include #include #include -#ifndef __GENKSYMS__ -#define PROTECT_TRACE_INCLUDE_PATH -#include -#endif /* * The anon_vma heads a list of private "related" vmas, to scan if @@ -216,12 +212,7 @@ void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *, static inline void page_dup_rmap(struct page *page, bool compound) { - bool success = false; - - if (!compound) - trace_android_vh_update_page_mapcount(page, true, compound, NULL, &success); - if (!success) - atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount); + atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount); } /* diff --git a/include/linux/sched.h b/include/linux/sched.h index d3cc279a4639..78ef93e95908 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -831,6 +831,10 @@ struct task_struct { #ifdef CONFIG_MEMCG unsigned in_user_fault:1; #endif +#ifdef CONFIG_LRU_GEN + /* whether the LRU algorithm may apply to this access */ + unsigned in_lru_fault:1; +#endif #ifdef CONFIG_COMPAT_BRK unsigned brk_randomized:1; #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index beda0a50d0b9..1b7f4316114c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -130,6 +130,10 @@ union swap_header { */ struct reclaim_state { unsigned long reclaimed_slab; +#ifdef CONFIG_LRU_GEN + /* per-thread mm walk data */ + struct lru_gen_mm_walk *mm_walk; +#endif }; #ifdef __KERNEL__ diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h index 8fd1babae761..e1735fe7c76a 100644 --- a/include/trace/events/pagemap.h +++ b/include/trace/events/pagemap.h @@ -27,24 +27,21 @@ TRACE_EVENT(mm_lru_insertion, - TP_PROTO( - struct page *page, - int lru - ), + TP_PROTO(struct page *page), - TP_ARGS(page, lru), + TP_ARGS(page), TP_STRUCT__entry( __field(struct page *, page ) __field(unsigned long, pfn ) - __field(int, lru ) + __field(enum lru_list, lru ) __field(unsigned long, flags ) ), TP_fast_assign( __entry->page = page; __entry->pfn = page_to_pfn(page); - __entry->lru = lru; + __entry->lru = page_lru(page); __entry->flags = trace_pagemap_flags(page); ), diff --git a/include/trace/hooks/mm.h b/include/trace/hooks/mm.h index 4628e4aa1cc6..03173ed9e62a 100644 --- a/include/trace/hooks/mm.h +++ b/include/trace/hooks/mm.h @@ -22,7 +22,6 @@ #include #include #include -#include #ifdef __GENKSYMS__ struct slabinfo; @@ -44,7 +43,6 @@ struct readahead_control; #endif /* __GENKSYMS__ */ struct cma; struct swap_slots_cache; -struct page_vma_mapped_walk; DECLARE_RESTRICTED_HOOK(android_rvh_set_skip_swapcache_flags, TP_PROTO(gfp_t *flags), @@ -158,37 +156,11 @@ DECLARE_HOOK(android_vh_mmap_region, DECLARE_HOOK(android_vh_try_to_unmap_one, TP_PROTO(struct vm_area_struct *vma, struct page *page, unsigned long addr, bool ret), TP_ARGS(vma, page, addr, ret)); -DECLARE_HOOK(android_vh_do_page_trylock, - TP_PROTO(struct page *page, struct rw_semaphore *sem, - bool *got_lock, bool *success), - TP_ARGS(page, sem, got_lock, success)); DECLARE_HOOK(android_vh_drain_all_pages_bypass, TP_PROTO(gfp_t gfp_mask, unsigned int order, unsigned long alloc_flags, int migratetype, unsigned long did_some_progress, bool *bypass), TP_ARGS(gfp_mask, order, alloc_flags, migratetype, did_some_progress, bypass)); -DECLARE_HOOK(android_vh_update_page_mapcount, - TP_PROTO(struct page *page, bool inc_size, bool compound, - bool *first_mapping, bool *success), - TP_ARGS(page, inc_size, compound, first_mapping, success)); -DECLARE_HOOK(android_vh_add_page_to_lrulist, - TP_PROTO(struct page *page, bool compound, enum lru_list lru), - TP_ARGS(page, compound, lru)); -DECLARE_HOOK(android_vh_del_page_from_lrulist, - TP_PROTO(struct page *page, bool compound, enum lru_list lru), - TP_ARGS(page, compound, lru)); -DECLARE_HOOK(android_vh_show_mapcount_pages, - TP_PROTO(void *unused), - TP_ARGS(unused)); -DECLARE_HOOK(android_vh_do_traversal_lruvec, - TP_PROTO(struct lruvec *lruvec), - TP_ARGS(lruvec)); -DECLARE_HOOK(android_vh_page_should_be_protected, - TP_PROTO(struct page *page, bool *should_protect), - TP_ARGS(page, should_protect)); -DECLARE_HOOK(android_vh_mark_page_accessed, - TP_PROTO(struct page *page), - TP_ARGS(page)); DECLARE_HOOK(android_vh_cma_drain_all_pages_bypass, TP_PROTO(unsigned int migratetype, bool *bypass), TP_ARGS(migratetype, bypass)); @@ -340,16 +312,6 @@ DECLARE_HOOK(android_vh_alloc_pages_failure_bypass, TP_PROTO(gfp_t gfp_mask, int order, int alloc_flags, int migratetype, struct page **page), TP_ARGS(gfp_mask, order, alloc_flags, migratetype, page)); -DECLARE_HOOK(android_vh_test_clear_look_around_ref, - TP_PROTO(struct page *page), - TP_ARGS(page)); -DECLARE_HOOK(android_vh_look_around_migrate_page, - TP_PROTO(struct page *old_page, struct page *new_page), - TP_ARGS(old_page, new_page)); -DECLARE_HOOK(android_vh_look_around, - TP_PROTO(struct page_vma_mapped_walk *pvmw, struct page *page, - struct vm_area_struct *vma, int *referenced), - TP_ARGS(pvmw, page, vma, referenced)); DECLARE_HOOK(android_vh_compact_finished, TP_PROTO(bool *abort_compact), TP_ARGS(abort_compact)); diff --git a/include/trace/hooks/vmscan.h b/include/trace/hooks/vmscan.h index 4bdb18c36ac6..a8149ac1b4f0 100644 --- a/include/trace/hooks/vmscan.h +++ b/include/trace/hooks/vmscan.h @@ -28,18 +28,6 @@ DECLARE_RESTRICTED_HOOK(android_rvh_set_balance_anon_file_reclaim, DECLARE_HOOK(android_vh_page_referenced_check_bypass, TP_PROTO(struct page *page, unsigned long nr_to_scan, int lru, bool *bypass), TP_ARGS(page, nr_to_scan, lru, bypass)); -DECLARE_HOOK(android_vh_page_trylock_get_result, - TP_PROTO(struct page *page, bool *trylock_fail), - TP_ARGS(page, trylock_fail)); -DECLARE_HOOK(android_vh_handle_failed_page_trylock, - TP_PROTO(struct list_head *page_list), - TP_ARGS(page_list)); -DECLARE_HOOK(android_vh_page_trylock_set, - TP_PROTO(struct page *page), - TP_ARGS(page)); -DECLARE_HOOK(android_vh_page_trylock_clear, - TP_PROTO(struct page *page), - TP_ARGS(page)); DECLARE_HOOK(android_vh_shrink_node_memcgs, TP_PROTO(struct mem_cgroup *memcg, bool *skip), TP_ARGS(memcg, skip)); @@ -50,9 +38,6 @@ DECLARE_HOOK(android_vh_inactive_is_low, DECLARE_HOOK(android_vh_snapshot_refaults, TP_PROTO(struct lruvec *target_lruvec), TP_ARGS(target_lruvec)); -DECLARE_HOOK(android_vh_check_page_look_around_ref, - TP_PROTO(struct page *page, int *skip), - TP_ARGS(page, skip)); #endif /* _TRACE_HOOK_VMSCAN_H */ /* This part must be outside protection */ #include diff --git a/kernel/bounds.c b/kernel/bounds.c index 9795d75b09b2..b529182e8b04 100644 --- a/kernel/bounds.c +++ b/kernel/bounds.c @@ -22,6 +22,13 @@ int main(void) DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS)); #endif DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t)); +#ifdef CONFIG_LRU_GEN + DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1)); + DEFINE(__LRU_REFS_WIDTH, MAX_NR_TIERS - 2); +#else + DEFINE(LRU_GEN_WIDTH, 0); + DEFINE(__LRU_REFS_WIDTH, 0); +#endif /* End of constants */ return 0; diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 4307a0904295..c45ba890869b 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -165,7 +165,6 @@ struct cgroup_mgctx { #define DEFINE_CGROUP_MGCTX(name) \ struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name) -extern struct mutex cgroup_mutex; extern spinlock_t css_set_lock; extern struct cgroup_subsys *cgroup_subsys[]; extern struct list_head cgroup_roots; diff --git a/kernel/exit.c b/kernel/exit.c index 7231d6afbdd8..833a66712ce3 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -470,6 +470,7 @@ void mm_update_next_owner(struct mm_struct *mm) goto retry; } WRITE_ONCE(mm->owner, c); + lru_gen_migrate_mm(mm); task_unlock(c); put_task_struct(c); } diff --git a/kernel/fork.c b/kernel/fork.c index 82c71822e904..1337f32fa2f3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1102,6 +1102,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, goto fail_nocontext; mm->user_ns = get_user_ns(user_ns); + lru_gen_init_mm(mm); return mm; fail_nocontext: @@ -1144,6 +1145,7 @@ static inline void __mmput(struct mm_struct *mm) } if (mm->binfmt) module_put(mm->binfmt->module); + lru_gen_del_mm(mm); mmdrop(mm); } @@ -2580,6 +2582,13 @@ pid_t kernel_clone(struct kernel_clone_args *args) get_task_struct(p); } + if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) { + /* lock the task to synchronize with memcg migration */ + task_lock(p); + lru_gen_add_mm(p->mm); + task_unlock(p); + } + wake_up_new_task(p); /* forking complete and child started to run, tell ptracer */ diff --git a/kernel/module.c b/kernel/module.c index 54214e82428f..a9ab5cf12c6a 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1352,9 +1352,7 @@ static int check_version(const struct load_info *info, return 1; bad_version: - pr_warn("%s: disagrees about version of symbol %s\n", - info->name, symname); - return 0; + return 1; } static inline int check_modstruct_version(const struct load_info *info, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 48e513655b6e..a45e88f4313b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4022,6 +4022,7 @@ context_switch(struct rq *rq, struct task_struct *prev, * finish_task_switch()'s mmdrop(). */ switch_mm_irqs_off(prev->active_mm, next->mm, next); + lru_gen_use_mm(next->mm); if (!prev->mm) { // from kernel /* will mmdrop() in finish_task_switch(). */ diff --git a/mm/Kconfig b/mm/Kconfig index 03bfe7bd8183..f93c06ae8084 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -896,6 +896,32 @@ config ARCH_HAS_HUGEPD config MAPPING_DIRTY_HELPERS bool +# multi-gen LRU { +config LRU_GEN + bool "Multi-Gen LRU" + depends on MMU + # make sure page->flags has enough spare bits + depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP + help + A high performance LRU implementation to overcommit memory. See + Documentation/admin-guide/mm/multigen_lru.rst for details. + +config LRU_GEN_ENABLED + bool "Enable by default" + depends on LRU_GEN + help + This option enables the multi-gen LRU by default. + +config LRU_GEN_STATS + bool "Full stats for debugging" + depends on LRU_GEN + help + Do not enable this option unless you plan to look at historical stats + from evicted generations for debugging purpose. + + This option has a per-memcg and per-node memory overhead. +# } + source "mm/damon/Kconfig" endmenu diff --git a/mm/compaction.c b/mm/compaction.c index dc96fadb04da..9a8b22325b90 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1039,7 +1039,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, low_pfn += compound_nr(page) - 1; /* Successfully isolated */ - del_page_from_lru_list(page, lruvec, page_lru(page)); + del_page_from_lru_list(page, lruvec); mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_is_file_lru(page), thp_nr_pages(page)); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 14bc79900736..847bbb6bcadd 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -33,7 +33,7 @@ #include #include #include -#include + #include #include #include "internal.h" @@ -2035,7 +2035,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false; unsigned long addr; int i; - bool success = false; VM_BUG_ON(haddr & ~HPAGE_PMD_MASK); VM_BUG_ON_VMA(vma->vm_start > haddr, vma); @@ -2167,12 +2166,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, pte = pte_offset_map(&_pmd, addr); BUG_ON(!pte_none(*pte)); set_pte_at(mm, addr, pte, entry); - if (!pmd_migration) { - trace_android_vh_update_page_mapcount(&page[i], true, - false, NULL, &success); - if (!success) - atomic_inc(&page[i]._mapcount); - } + if (!pmd_migration) + atomic_inc(&page[i]._mapcount); pte_unmap(pte); } @@ -2183,12 +2178,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, */ if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page)) { - for (i = 0; i < HPAGE_PMD_NR; i++) { - trace_android_vh_update_page_mapcount(&page[i], true, - false, NULL, &success); - if (!success) - atomic_inc(&page[i]._mapcount); - } + for (i = 0; i < HPAGE_PMD_NR; i++) + atomic_inc(&page[i]._mapcount); } lock_page_memcg(page); @@ -2197,12 +2188,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, __dec_lruvec_page_state(page, NR_ANON_THPS); if (TestClearPageDoubleMap(page)) { /* No need in mapcount reference anymore */ - for (i = 0; i < HPAGE_PMD_NR; i++) { - trace_android_vh_update_page_mapcount(&page[i], - false, false, NULL, &success); - if (!success) - atomic_dec(&page[i]._mapcount); - } + for (i = 0; i < HPAGE_PMD_NR; i++) + atomic_dec(&page[i]._mapcount); } } unlock_page_memcg(page); @@ -2417,7 +2404,8 @@ static void __split_huge_page_tail(struct page *head, int tail, #ifdef CONFIG_64BIT (1L << PG_arch_2) | #endif - (1L << PG_dirty))); + (1L << PG_dirty) | + LRU_GEN_MASK | LRU_REFS_MASK)); /* ->mapping in first tail page is compound_mapcount */ VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING, diff --git a/mm/internal.h b/mm/internal.h index 5a3f3725d306..6aa4e8cab454 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -35,6 +35,7 @@ void page_writeback_init(void); vm_fault_t do_swap_page(struct vm_fault *vmf); +void activate_page(struct page *page); #ifdef CONFIG_SPECULATIVE_PAGE_FAULT extern struct vm_area_struct *get_vma(struct mm_struct *mm, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b6f6bfc0a700..81eba8d2cc4d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1374,38 +1374,6 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd return lruvec; } -struct lruvec *page_to_lruvec(struct page *page, pg_data_t *pgdat) -{ - struct lruvec *lruvec; - - lruvec = mem_cgroup_page_lruvec(page, pgdat); - - return lruvec; -} -EXPORT_SYMBOL_GPL(page_to_lruvec); - -void do_traversal_all_lruvec(void) -{ - pg_data_t *pgdat; - - for_each_online_pgdat(pgdat) { - struct mem_cgroup *memcg = NULL; - - spin_lock_irq(&pgdat->lru_lock); - memcg = mem_cgroup_iter(NULL, NULL, NULL); - do { - struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); - - trace_android_vh_do_traversal_lruvec(lruvec); - - memcg = mem_cgroup_iter(NULL, memcg, NULL); - } while (memcg); - - spin_unlock_irq(&pgdat->lru_lock); - } -} -EXPORT_SYMBOL_GPL(do_traversal_all_lruvec); - /** * mem_cgroup_update_lru_size - account for adding or removing an lru page * @lruvec: mem_cgroup per zone lru vector @@ -2918,6 +2886,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg) * - LRU isolation * - lock_page_memcg() * - exclusive reference + * - mem_cgroup_trylock_pages() */ page->mem_cgroup = memcg; } @@ -5304,6 +5273,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) static void mem_cgroup_free(struct mem_cgroup *memcg) { + lru_gen_exit_memcg(memcg); memcg_wb_domain_exit(memcg); /* * Flush percpu vmstats and vmevents to guarantee the value correctness @@ -5378,6 +5348,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) memcg->deferred_split_queue.split_queue_len = 0; #endif idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); + lru_gen_init_memcg(memcg); trace_android_vh_mem_cgroup_alloc(memcg); return memcg; fail: @@ -6273,6 +6244,30 @@ static void mem_cgroup_move_task(void) } #endif +#ifdef CONFIG_LRU_GEN +static void mem_cgroup_attach(struct cgroup_taskset *tset) +{ + struct task_struct *task; + struct cgroup_subsys_state *css; + + /* find the first leader if there is any */ + cgroup_taskset_for_each_leader(task, css, tset) + break; + + if (!task) + return; + + task_lock(task); + if (task->mm && READ_ONCE(task->mm->owner) == task) + lru_gen_migrate_mm(task->mm); + task_unlock(task); +} +#else +static void mem_cgroup_attach(struct cgroup_taskset *tset) +{ +} +#endif /* CONFIG_LRU_GEN */ + /* * Cgroup retains root cgroups across [un]mount cycles making it necessary * to verify whether we're attached to the default hierarchy on each mount @@ -6625,6 +6620,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .css_free = mem_cgroup_css_free, .css_reset = mem_cgroup_css_reset, .can_attach = mem_cgroup_can_attach, + .attach = mem_cgroup_attach, .cancel_attach = mem_cgroup_cancel_attach, .post_attach = mem_cgroup_move_task, .bind = mem_cgroup_bind, diff --git a/mm/memory.c b/mm/memory.c index 857cfc90893f..fbbb38cde2ec 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -127,18 +127,6 @@ int randomize_va_space __read_mostly = 2; #endif -#ifndef arch_faults_on_old_pte -static inline bool arch_faults_on_old_pte(void) -{ - /* - * Those arches which don't have hw access flag feature need to - * implement their own helper. By default, "true" means pagefault - * will be hit on old pte. - */ - return true; -} -#endif - #ifndef arch_wants_old_prefaulted_pte static inline bool arch_wants_old_prefaulted_pte(void) { @@ -2927,7 +2915,7 @@ static inline bool cow_user_page(struct page *dst, struct page *src, * On architectures with software "accessed" bits, we would * take a double page fault, so mark it accessed here. */ - if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { + if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) { pte_t entry; vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); @@ -4989,6 +4977,10 @@ static inline void mm_account_fault(struct pt_regs *regs, else perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); } + +static void lru_gen_enter_fault(struct vm_area_struct *vma); +static void lru_gen_exit_fault(void); + #ifdef CONFIG_SPECULATIVE_PAGE_FAULT #ifndef CONFIG_ARCH_HAS_PTE_SPECIAL @@ -5181,7 +5173,9 @@ static vm_fault_t ___handle_speculative_fault(struct mm_struct *mm, } mem_cgroup_enter_user_fault(); + lru_gen_enter_fault(vmf.vma); ret = handle_pte_fault(&vmf); + lru_gen_exit_fault(); mem_cgroup_exit_user_fault(); if (ret != VM_FAULT_RETRY) { @@ -5257,6 +5251,27 @@ bool can_reuse_spf_vma(struct vm_area_struct *vma, unsigned long address) } #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */ +#ifdef CONFIG_LRU_GEN +static void lru_gen_enter_fault(struct vm_area_struct *vma) +{ + /* the LRU algorithm doesn't apply to sequential or random reads */ + current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ)); +} + +static void lru_gen_exit_fault(void) +{ + current->in_lru_fault = false; +} +#else +static void lru_gen_enter_fault(struct vm_area_struct *vma) +{ +} + +static void lru_gen_exit_fault(void) +{ +} +#endif /* CONFIG_LRU_GEN */ + /* * By the time we get here, we already hold the mm semaphore * @@ -5288,11 +5303,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, if (flags & FAULT_FLAG_USER) mem_cgroup_enter_user_fault(); + lru_gen_enter_fault(vma); + if (unlikely(is_vm_hugetlb_page(vma))) ret = hugetlb_fault(vma->vm_mm, vma, address, flags); else ret = __handle_mm_fault(vma, address, flags); + lru_gen_exit_fault(); + if (flags & FAULT_FLAG_USER) { mem_cgroup_exit_user_fault(); /* diff --git a/mm/migrate.c b/mm/migrate.c index ac3242786b38..417842e01d2e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -56,7 +56,6 @@ #include #undef CREATE_TRACE_POINTS #include -#include #include "internal.h" @@ -607,7 +606,6 @@ void migrate_page_states(struct page *newpage, struct page *page) SetPageChecked(newpage); if (PageMappedToDisk(page)) SetPageMappedToDisk(newpage); - trace_android_vh_look_around_migrate_page(page, newpage); /* Move dirty on pages not done by migrate_page_move_mapping() */ if (PageDirty(page)) diff --git a/mm/mlock.c b/mm/mlock.c index 188aa7458c99..e49006d074f8 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -119,7 +119,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage) if (getpage) get_page(page); ClearPageLRU(page); - del_page_from_lru_list(page, lruvec, page_lru(page)); + del_page_from_lru_list(page, lruvec); return true; } diff --git a/mm/mm_init.c b/mm/mm_init.c index b06a30fbedff..9351e8a25a80 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -19,10 +19,6 @@ #ifdef CONFIG_DEBUG_MEMORY_INIT int __meminitdata mminit_loglevel; -#ifndef SECTIONS_SHIFT -#define SECTIONS_SHIFT 0 -#endif - /* The zonelists are simply reported, validation is manual. */ void __init mminit_verify_zonelist(void) { @@ -69,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void) shift = 8 * sizeof(unsigned long); width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH; + - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH; mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths", - "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n", + "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n", SECTIONS_WIDTH, NODES_WIDTH, ZONES_WIDTH, LAST_CPUPID_WIDTH, KASAN_TAG_WIDTH, + LRU_GEN_WIDTH, + LRU_REFS_WIDTH, NR_PAGEFLAGS); mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts", "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n", diff --git a/mm/mmzone.c b/mm/mmzone.c index 61e4a35311b2..483290ab301c 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -82,6 +82,8 @@ void lruvec_init(struct lruvec *lruvec) for_each_lru(lru) INIT_LIST_HEAD(&lruvec->lists[lru]); + + lru_gen_init_lruvec(lruvec); } #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fa1f971c8a43..1c2bcb2c02a8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -72,7 +72,6 @@ #include #include #include -#include #include #include @@ -2409,7 +2408,6 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags set_page_pfmemalloc(page); else clear_page_pfmemalloc(page); - trace_android_vh_test_clear_look_around_ref(page); } /* @@ -5741,7 +5739,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) free_pcp, global_zone_page_state(NR_FREE_CMA_PAGES)); - trace_android_vh_show_mapcount_pages(NULL); for_each_online_pgdat(pgdat) { if (show_mem_node_skip(filter, pgdat->node_id, nodemask)) continue; diff --git a/mm/rmap.c b/mm/rmap.c index 5338a8b1d846..1c205a4ab270 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -72,6 +72,7 @@ #include #include #include +#include #include @@ -531,7 +532,6 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page, struct anon_vma *anon_vma = NULL; struct anon_vma *root_anon_vma; unsigned long anon_mapping; - bool success = false; rcu_read_lock(); anon_mapping = (unsigned long)READ_ONCE(page->mapping); @@ -554,11 +554,6 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page, } goto out; } - trace_android_vh_do_page_trylock(page, NULL, NULL, &success); - if (success) { - anon_vma = NULL; - goto out; - } if (rwc && rwc->try_lock) { anon_vma = NULL; @@ -802,7 +797,12 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma, } if (pvmw.pte) { - trace_android_vh_look_around(&pvmw, page, vma, &referenced); + if (lru_gen_enabled() && pte_young(*pvmw.pte) && + !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) { + lru_gen_look_around(&pvmw); + referenced++; + } + if (ptep_clear_flush_young_notify(vma, address, pvmw.pte)) { /* @@ -1136,7 +1136,6 @@ void do_page_add_anon_rmap(struct page *page, { bool compound = flags & RMAP_COMPOUND; bool first; - bool success = false; if (unlikely(PageKsm(page))) lock_page_memcg(page); @@ -1150,10 +1149,7 @@ void do_page_add_anon_rmap(struct page *page, mapcount = compound_mapcount_ptr(page); first = atomic_inc_and_test(mapcount); } else { - trace_android_vh_update_page_mapcount(page, true, compound, - &first, &success); - if (!success) - first = atomic_inc_and_test(&page->_mapcount); + first = atomic_inc_and_test(&page->_mapcount); } if (first) { @@ -1227,22 +1223,13 @@ void __page_add_new_anon_rmap(struct page *page, void page_add_file_rmap(struct page *page, bool compound) { int i, nr = 1; - bool first_mapping; - bool success = false; VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page); lock_page_memcg(page); if (compound && PageTransHuge(page)) { for (i = 0, nr = 0; i < thp_nr_pages(page); i++) { - trace_android_vh_update_page_mapcount(&page[i], true, - compound, &first_mapping, &success); - if ((success)) { - if (first_mapping) - nr++; - } else { - if (atomic_inc_and_test(&page[i]._mapcount)) - nr++; - } + if (atomic_inc_and_test(&page[i]._mapcount)) + nr++; } if (!atomic_inc_and_test(compound_mapcount_ptr(page))) goto out; @@ -1258,15 +1245,8 @@ void page_add_file_rmap(struct page *page, bool compound) if (PageMlocked(page)) clear_page_mlock(compound_head(page)); } - trace_android_vh_update_page_mapcount(page, true, - compound, &first_mapping, &success); - if (success) { - if (!first_mapping) - goto out; - } else { - if (!atomic_inc_and_test(&page->_mapcount)) - goto out; - } + if (!atomic_inc_and_test(&page->_mapcount)) + goto out; } __mod_lruvec_page_state(page, NR_FILE_MAPPED, nr); out: @@ -1276,8 +1256,6 @@ void page_add_file_rmap(struct page *page, bool compound) static void page_remove_file_rmap(struct page *page, bool compound) { int i, nr = 1; - bool first_mapping; - bool success = false; VM_BUG_ON_PAGE(compound && !PageHead(page), page); @@ -1291,15 +1269,8 @@ static void page_remove_file_rmap(struct page *page, bool compound) /* page still mapped by someone else? */ if (compound && PageTransHuge(page)) { for (i = 0, nr = 0; i < thp_nr_pages(page); i++) { - trace_android_vh_update_page_mapcount(&page[i], false, - compound, &first_mapping, &success); - if (success) { - if (first_mapping) - nr++; - } else { - if (atomic_add_negative(-1, &page[i]._mapcount)) - nr++; - } + if (atomic_add_negative(-1, &page[i]._mapcount)) + nr++; } if (!atomic_add_negative(-1, compound_mapcount_ptr(page))) return; @@ -1308,15 +1279,8 @@ static void page_remove_file_rmap(struct page *page, bool compound) else __dec_node_page_state(page, NR_FILE_PMDMAPPED); } else { - trace_android_vh_update_page_mapcount(page, false, - compound, &first_mapping, &success); - if (success) { - if (!first_mapping) - return; - } else { - if (!atomic_add_negative(-1, &page->_mapcount)) - return; - } + if (!atomic_add_negative(-1, &page->_mapcount)) + return; } /* @@ -1333,8 +1297,6 @@ static void page_remove_file_rmap(struct page *page, bool compound) static void page_remove_anon_compound_rmap(struct page *page) { int i, nr; - bool first_mapping; - bool success = false; if (!atomic_add_negative(-1, compound_mapcount_ptr(page))) return; @@ -1354,15 +1316,8 @@ static void page_remove_anon_compound_rmap(struct page *page) * them are still mapped. */ for (i = 0, nr = 0; i < thp_nr_pages(page); i++) { - trace_android_vh_update_page_mapcount(&page[i], false, - false, &first_mapping, &success); - if (success) { - if (first_mapping) - nr++; - } else { - if (atomic_add_negative(-1, &page[i]._mapcount)) - nr++; - } + if (atomic_add_negative(-1, &page[i]._mapcount)) + nr++; } /* @@ -1392,8 +1347,6 @@ static void page_remove_anon_compound_rmap(struct page *page) */ void page_remove_rmap(struct page *page, bool compound) { - bool first_mapping; - bool success = false; lock_page_memcg(page); if (!PageAnon(page)) { @@ -1406,16 +1359,10 @@ void page_remove_rmap(struct page *page, bool compound) goto out; } - trace_android_vh_update_page_mapcount(page, false, - compound, &first_mapping, &success); - if (success) { - if (!first_mapping) - goto out; - } else { - /* page still mapped by someone else? */ - if (!atomic_add_negative(-1, &page->_mapcount)) - goto out; - } + /* page still mapped by someone else? */ + if (!atomic_add_negative(-1, &page->_mapcount)) + goto out; + /* * We use the irq-unsafe __{inc|mod}_zone_page_stat because * these counters are not modified in interrupt context, and @@ -2014,7 +1961,6 @@ static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc, struct address_space *mapping = page_mapping(page); pgoff_t pgoff_start, pgoff_end; struct vm_area_struct *vma; - bool got_lock = false, success = false; /* * The page lock not only makes sure that page->mapping cannot @@ -2030,23 +1976,17 @@ static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc, pgoff_start = page_to_pgoff(page); pgoff_end = pgoff_start + thp_nr_pages(page) - 1; if (!locked) { - trace_android_vh_do_page_trylock(page, - &mapping->i_mmap_rwsem, &got_lock, &success); - if (success) { - if (!got_lock) - return; - } else { - if (i_mmap_trylock_read(mapping)) - goto lookup; + if (i_mmap_trylock_read(mapping)) + goto lookup; - if (rwc->try_lock) { - rwc->contended = true; - return; - } + if (rwc->try_lock) { + rwc->contended = true; + return; + } i_mmap_lock_read(mapping); - } } + lookup: vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff_start, pgoff_end) { diff --git a/mm/swap.c b/mm/swap.c index 16b2bc48211e..4f9e773e954b 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -87,9 +87,8 @@ static void __page_cache_release(struct page *page) spin_lock_irqsave(&pgdat->lru_lock, flags); lruvec = mem_cgroup_page_lruvec(page, pgdat); - VM_BUG_ON_PAGE(!PageLRU(page), page); - __ClearPageLRU(page); - del_page_from_lru_list(page, lruvec, page_off_lru(page)); + del_page_from_lru_list(page, lruvec); + __clear_page_lru_flags(page); spin_unlock_irqrestore(&pgdat->lru_lock, flags); } __ClearPageWaiters(page); @@ -240,9 +239,9 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec, int *pgmoved = arg; if (PageLRU(page) && !PageUnevictable(page)) { - del_page_from_lru_list(page, lruvec, page_lru(page)); + del_page_from_lru_list(page, lruvec); ClearPageActive(page); - add_page_to_lru_list_tail(page, lruvec, page_lru(page)); + add_page_to_lru_list_tail(page, lruvec); (*pgmoved) += thp_nr_pages(page); } } @@ -333,13 +332,11 @@ static void __activate_page(struct page *page, struct lruvec *lruvec, void *arg) { if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { - int lru = page_lru_base_type(page); int nr_pages = thp_nr_pages(page); - del_page_from_lru_list(page, lruvec, lru); + del_page_from_lru_list(page, lruvec); SetPageActive(page); - lru += LRU_ACTIVE; - add_page_to_lru_list(page, lruvec, lru); + add_page_to_lru_list(page, lruvec); trace_mm_lru_activate(page); __count_vm_events(PGACTIVATE, nr_pages); @@ -362,7 +359,7 @@ static bool need_activate_page_drain(int cpu) return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0; } -static void activate_page(struct page *page) +void activate_page(struct page *page) { page = compound_head(page); if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { @@ -382,7 +379,7 @@ static inline void activate_page_drain(int cpu) { } -static void activate_page(struct page *page) +void activate_page(struct page *page) { pg_data_t *pgdat = page_pgdat(page); @@ -423,6 +420,41 @@ static void __lru_cache_activate_page(struct page *page) local_unlock(&lru_pvecs.lock); } +#ifdef CONFIG_LRU_GEN +static void page_inc_refs(struct page *page) +{ + unsigned long new_flags, old_flags; + + if (PageUnevictable(page)) + return; + + if (!PageReferenced(page)) { + SetPageReferenced(page); + return; + } + + if (!PageWorkingset(page)) { + SetPageWorkingset(page); + return; + } + + /* see the comment on MAX_NR_TIERS */ + do { + old_flags = READ_ONCE(page->flags); + new_flags = old_flags & LRU_REFS_MASK; + if (new_flags == LRU_REFS_MASK) + break; + + new_flags += BIT(LRU_REFS_PGOFF); + new_flags |= old_flags & ~LRU_REFS_MASK; + } while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags); +} +#else +static void page_inc_refs(struct page *page) +{ +} +#endif /* CONFIG_LRU_GEN */ + /* * Mark a page as having seen activity. * @@ -437,7 +469,11 @@ void mark_page_accessed(struct page *page) { page = compound_head(page); - trace_android_vh_mark_page_accessed(page); + if (lru_gen_enabled()) { + page_inc_refs(page); + return; + } + if (!PageReferenced(page)) { SetPageReferenced(page); } else if (PageUnevictable(page)) { @@ -481,6 +517,11 @@ void lru_cache_add(struct page *page) VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page); VM_BUG_ON_PAGE(PageLRU(page), page); + /* see the comment in lru_gen_add_page() */ + if (lru_gen_enabled() && !PageUnevictable(page) && + lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) + SetPageActive(page); + get_page(page); local_lock(&lru_pvecs.lock); pvec = this_cpu_ptr(&lru_pvecs.lru_add); @@ -543,8 +584,7 @@ void __lru_cache_add_inactive_or_unevictable(struct page *page, static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, void *arg) { - int lru; - bool active; + bool active = PageActive(page); int nr_pages = thp_nr_pages(page); if (!PageLRU(page)) @@ -557,10 +597,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, if (page_mapped(page)) return; - active = PageActive(page); - lru = page_lru_base_type(page); - - del_page_from_lru_list(page, lruvec, lru + active); + del_page_from_lru_list(page, lruvec); ClearPageActive(page); ClearPageReferenced(page); @@ -570,14 +607,14 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, * It can make readahead confusing. But race window * is _really_ small and it's non-critical problem. */ - add_page_to_lru_list(page, lruvec, lru); + add_page_to_lru_list(page, lruvec); SetPageReclaim(page); } else { /* * The page's writeback ends up during pagevec * We moves tha page into tail of inactive. */ - add_page_to_lru_list_tail(page, lruvec, lru); + add_page_to_lru_list_tail(page, lruvec); __count_vm_events(PGROTATED, nr_pages); } @@ -591,14 +628,13 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec, void *arg) { - if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { - int lru = page_lru_base_type(page); + if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) { int nr_pages = thp_nr_pages(page); - del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE); + del_page_from_lru_list(page, lruvec); ClearPageActive(page); ClearPageReferenced(page); - add_page_to_lru_list(page, lruvec, lru); + add_page_to_lru_list(page, lruvec); __count_vm_events(PGDEACTIVATE, nr_pages); __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, @@ -611,11 +647,9 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec, { if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page) && !PageUnevictable(page)) { - bool active = PageActive(page); int nr_pages = thp_nr_pages(page); - del_page_from_lru_list(page, lruvec, - LRU_INACTIVE_ANON + active); + del_page_from_lru_list(page, lruvec); ClearPageActive(page); ClearPageReferenced(page); /* @@ -624,7 +658,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec, * anonymous pages */ ClearPageSwapBacked(page); - add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE); + add_page_to_lru_list(page, lruvec); __count_vm_events(PGLAZYFREE, nr_pages); __count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE, @@ -639,16 +673,13 @@ static void lru_lazyfree_movetail_fn(struct page *page, struct lruvec *lruvec, if (PageLRU(page) && !PageUnevictable(page) && PageSwapBacked(page) && !PageSwapCache(page)) { - bool active = PageActive(page); - - del_page_from_lru_list(page, lruvec, - LRU_INACTIVE_ANON + active); + del_page_from_lru_list(page, lruvec); ClearPageActive(page); ClearPageReferenced(page); if (add_to_tail && *add_to_tail) - add_page_to_lru_list_tail(page, lruvec, LRU_INACTIVE_FILE); + add_page_to_lru_list_tail(page, lruvec); else - add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE); + add_page_to_lru_list(page, lruvec); } } @@ -733,7 +764,7 @@ void deactivate_file_page(struct page *page) */ void deactivate_page(struct page *page) { - if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) { struct pagevec *pvec; local_lock(&lru_pvecs.lock); @@ -1066,9 +1097,8 @@ void release_pages(struct page **pages, int nr) } lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); - VM_BUG_ON_PAGE(!PageLRU(page), page); - __ClearPageLRU(page); - del_page_from_lru_list(page, lruvec, page_off_lru(page)); + del_page_from_lru_list(page, lruvec); + __clear_page_lru_flags(page); } __ClearPageWaiters(page); @@ -1131,8 +1161,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail, * Put page_tail on the list at the correct position * so they all end up in order. */ - add_page_to_lru_list_tail(page_tail, lruvec, - page_lru(page_tail)); + add_page_to_lru_list_tail(page_tail, lruvec); } } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ @@ -1140,7 +1169,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail, static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec, void *arg) { - enum lru_list lru; int was_unevictable = TestClearPageUnevictable(page); int nr_pages = thp_nr_pages(page); @@ -1176,19 +1204,17 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec, smp_mb__after_atomic(); if (page_evictable(page)) { - lru = page_lru(page); if (was_unevictable) __count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages); } else { - lru = LRU_UNEVICTABLE; ClearPageActive(page); SetPageUnevictable(page); if (!was_unevictable) __count_vm_events(UNEVICTABLE_PGCULLED, nr_pages); } - add_page_to_lru_list(page, lruvec, lru); - trace_mm_lru_insertion(page, lru); + add_page_to_lru_list(page, lruvec); + trace_mm_lru_insertion(page); } /* diff --git a/mm/swap_state.c b/mm/swap_state.c index a8a98bf7c894..c8dacfeaa624 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -538,7 +538,6 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, workingset_refault(page, shadow); /* Caller will initiate read into locked page */ - SetPageWorkingset(page); lru_cache_add(page); *new_page_allocated = true; return page; diff --git a/mm/vmscan.c b/mm/vmscan.c index f94f52e98868..ebab0799d3e0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -51,6 +51,10 @@ #include #include #include +#include +#include +#include +#include #include #include @@ -129,6 +133,12 @@ struct scan_control { /* The file pages on the current node are dangerously low */ unsigned int file_is_tiny:1; +#ifdef CONFIG_LRU_GEN + /* help kswapd make better choices among multiple memcgs */ + unsigned int memcgs_need_aging:1; + unsigned long last_reclaimed; +#endif + /* Allocation order */ s8 order; @@ -930,9 +940,11 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, if (PageSwapCache(page)) { swp_entry_t swap = { .val = page_private(page) }; - mem_cgroup_swapout(page, swap); + + /* get a shadow entry before mem_cgroup_swapout() clears page_memcg() */ if (reclaimed && !mapping_exiting(mapping)) shadow = workingset_eviction(page, target_memcg); + mem_cgroup_swapout(page, swap); __delete_from_swap_cache(page, swap, shadow); xa_unlock_irqrestore(&mapping->i_pages, flags); put_swap_page(page, swap); @@ -1020,24 +1032,11 @@ static enum page_references page_check_references(struct page *page, { int referenced_ptes, referenced_page; unsigned long vm_flags; - bool should_protect = false; - bool trylock_fail = false; - int ret = 0; - trace_android_vh_page_should_be_protected(page, &should_protect); - if (unlikely(should_protect)) - return PAGEREF_ACTIVATE; - - trace_android_vh_page_trylock_set(page); - trace_android_vh_check_page_look_around_ref(page, &ret); - if (ret) - return ret; referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup, &vm_flags); referenced_page = TestClearPageReferenced(page); - trace_android_vh_page_trylock_get_result(page, &trylock_fail); - if (trylock_fail) - return PAGEREF_KEEP; + /* * Mlock lost the isolation race with us. Let try_to_unmap() * move the page to the unevictable list. @@ -1128,7 +1127,6 @@ static unsigned int shrink_page_list(struct list_head *page_list, LIST_HEAD(free_pages); unsigned int nr_reclaimed = 0; unsigned int pgactivate = 0; - bool page_trylock_result; memset(stat, 0, sizeof(*stat)); cond_resched(); @@ -1161,6 +1159,11 @@ static unsigned int shrink_page_list(struct list_head *page_list, if (!sc->may_unmap && page_mapped(page)) goto keep_locked; + /* page_update_gen() tried to promote this page? */ + if (lru_gen_enabled() && !ignore_references && + page_mapped(page) && PageReferenced(page)) + goto keep_locked; + may_enter_fs = (sc->gfp_mask & __GFP_FS) || (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); @@ -1353,8 +1356,7 @@ static unsigned int shrink_page_list(struct list_head *page_list, if (unlikely(PageTransHuge(page))) flags |= TTU_SPLIT_HUGE_PMD; - if (!ignore_references) - trace_android_vh_page_trylock_set(page); + if (!try_to_unmap(page, flags)) { stat->nr_unmap_fail += nr_pages; if (!was_swapbacked && PageSwapBacked(page)) @@ -1465,7 +1467,6 @@ static unsigned int shrink_page_list(struct list_head *page_list, * increment nr_reclaimed here (and * leave it off the LRU). */ - trace_android_vh_page_trylock_clear(page); nr_reclaimed++; continue; } @@ -1499,7 +1500,6 @@ static unsigned int shrink_page_list(struct list_head *page_list, * Is there need to periodically free_page_list? It would * appear not as the counts should be low */ - trace_android_vh_page_trylock_clear(page); if (unlikely(PageTransHuge(page))) destroy_compound_page(page); else @@ -1528,18 +1528,6 @@ static unsigned int shrink_page_list(struct list_head *page_list, count_memcg_page_event(page, PGACTIVATE); } keep_locked: - /* - * The page with trylock-bit will be added ret_pages and - * handled in trace_android_vh_handle_failed_page_trylock. - * In the progress[unlock_page, handled], the page carried - * with trylock-bit will cause some error-issues in other - * scene, so clear trylock-bit here. - * trace_android_vh_page_trylock_get_result will clear - * trylock-bit and return if page tyrlock failed in - * reclaim-process. Here we just want to clear trylock-bit - * so that ignore page_trylock_result. - */ - trace_android_vh_page_trylock_get_result(page, &page_trylock_result); unlock_page(page); keep: list_add(&page->lru, &ret_pages); @@ -1693,6 +1681,25 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec, } +#ifdef CONFIG_CMA +/* + * It is waste of effort to scan and reclaim CMA pages if it is not available + * for current allocation context. Kswapd can not be enrolled as it can not + * distinguish this scenario by using sc->gfp_mask = GFP_KERNEL + */ +static bool skip_cma(struct page *page, struct scan_control *sc) +{ + return !current_is_kswapd() && + gfp_migratetype(sc->gfp_mask) != MIGRATE_MOVABLE && + get_pageblock_migratetype(page) == MIGRATE_CMA; +} +#else +static bool skip_cma(struct page *page, struct scan_control *sc) +{ + return false; +} +#endif + /** * pgdat->lru_lock is heavily contended. Some of the functions that * shrink the lists perform better by taking out a batch of pages @@ -1739,7 +1746,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, nr_pages = compound_nr(page); total_scan += nr_pages; - if (page_zonenum(page) > sc->reclaim_idx) { + if (page_zonenum(page) > sc->reclaim_idx || + skip_cma(page, sc)) { list_move(&page->lru, &pages_skipped); nr_skipped[page_zonenum(page)] += nr_pages; continue; @@ -1760,7 +1768,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, case 0: nr_taken += nr_pages; nr_zone_taken[page_zonenum(page)] += nr_pages; - trace_android_vh_del_page_from_lrulist(page, false, lru); list_move(&page->lru, dst); break; @@ -1840,10 +1847,9 @@ int isolate_lru_page(struct page *page) spin_lock_irq(&pgdat->lru_lock); lruvec = mem_cgroup_page_lruvec(page, pgdat); if (PageLRU(page)) { - int lru = page_lru(page); get_page(page); ClearPageLRU(page); - del_page_from_lru_list(page, lruvec, lru); + del_page_from_lru_list(page, lruvec); ret = 0; } spin_unlock_irq(&pgdat->lru_lock); @@ -1915,13 +1921,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, int nr_pages, nr_moved = 0; LIST_HEAD(pages_to_free); struct page *page; - enum lru_list lru; while (!list_empty(list)) { page = lru_to_page(list); VM_BUG_ON_PAGE(PageLRU(page), page); + list_del(&page->lru); if (unlikely(!page_evictable(page))) { - list_del(&page->lru); spin_unlock_irq(&pgdat->lru_lock); putback_lru_page(page); spin_lock_irq(&pgdat->lru_lock); @@ -1930,17 +1935,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, lruvec = mem_cgroup_page_lruvec(page, pgdat); SetPageLRU(page); - lru = page_lru(page); - - nr_pages = thp_nr_pages(page); - update_lru_size(lruvec, lru, page_zonenum(page), nr_pages); - list_move(&page->lru, &lruvec->lists[lru]); - trace_android_vh_add_page_to_lrulist(page, false, lru); + add_page_to_lru_list(page, lruvec); if (put_page_testzero(page)) { - __ClearPageLRU(page); - __ClearPageActive(page); - del_page_from_lru_list(page, lruvec, lru); + del_page_from_lru_list(page, lruvec); + __clear_page_lru_flags(page); if (unlikely(PageCompound(page))) { spin_unlock_irq(&pgdat->lru_lock); @@ -1949,6 +1948,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, } else list_add(&page->lru, &pages_to_free); } else { + nr_pages = thp_nr_pages(page); nr_moved += nr_pages; if (PageActive(page)) workingset_age_nonresident(lruvec, nr_pages); @@ -2027,7 +2027,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, return 0; nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, &stat, false); - trace_android_vh_handle_failed_page_trylock(&page_list); spin_lock_irq(&pgdat->lru_lock); @@ -2040,6 +2039,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, __count_vm_events(item, nr_reclaimed); __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); + spin_unlock_irq(&pgdat->lru_lock); mem_cgroup_uncharge_list(&page_list); @@ -2090,7 +2090,6 @@ static void shrink_active_list(unsigned long nr_to_scan, int file = is_file_lru(lru); struct pglist_data *pgdat = lruvec_pgdat(lruvec); bool bypass = false; - bool should_protect = false; lru_add_drain(); @@ -2125,17 +2124,10 @@ static void shrink_active_list(unsigned long nr_to_scan, } } - trace_android_vh_page_should_be_protected(page, &should_protect); - if (unlikely(should_protect)) { - nr_rotated += thp_nr_pages(page); - list_add(&page->lru, &l_active); - continue; - } - trace_android_vh_page_referenced_check_bypass(page, nr_to_scan, lru, &bypass); if (bypass) goto skip_page_referenced; - trace_android_vh_page_trylock_set(page); + /* Referenced or rmap lock contention: rotate */ if (page_referenced(page, 0, sc->target_mem_cgroup, &vm_flags) != 0) { @@ -2149,13 +2141,11 @@ static void shrink_active_list(unsigned long nr_to_scan, * so we ignore them here. */ if ((vm_flags & VM_EXEC) && page_is_file_lru(page)) { - trace_android_vh_page_trylock_clear(page); nr_rotated += thp_nr_pages(page); list_add(&page->lru, &l_active); continue; } } - trace_android_vh_page_trylock_clear(page); skip_page_referenced: ClearPageActive(page); /* we are de-activating */ SetPageWorkingset(page); @@ -2237,7 +2227,6 @@ unsigned long reclaim_pages(struct list_head *page_list) return nr_reclaimed; } -EXPORT_SYMBOL_GPL(reclaim_pages); static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc) @@ -2315,6 +2304,106 @@ enum scan_balance { SCAN_FILE, }; +static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) +{ + unsigned long file; + struct lruvec *target_lruvec; + + if (lru_gen_enabled()) + return; + + target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); + + /* + * Determine the scan balance between anon and file LRUs. + */ + spin_lock_irq(&pgdat->lru_lock); + sc->anon_cost = target_lruvec->anon_cost; + sc->file_cost = target_lruvec->file_cost; + spin_unlock_irq(&pgdat->lru_lock); + + /* + * Target desirable inactive:active list ratios for the anon + * and file LRU lists. + */ + if (!sc->force_deactivate) { + unsigned long refaults; + + refaults = lruvec_page_state(target_lruvec, + WORKINGSET_ACTIVATE_ANON); + if (refaults != target_lruvec->refaults[0] || + inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) + sc->may_deactivate |= DEACTIVATE_ANON; + else + sc->may_deactivate &= ~DEACTIVATE_ANON; + + /* + * When refaults are being observed, it means a new + * workingset is being established. Deactivate to get + * rid of any stale active pages quickly. + */ + refaults = lruvec_page_state(target_lruvec, + WORKINGSET_ACTIVATE_FILE); + if (refaults != target_lruvec->refaults[1] || + inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) + sc->may_deactivate |= DEACTIVATE_FILE; + else + sc->may_deactivate &= ~DEACTIVATE_FILE; + } else + sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; + + /* + * If we have plenty of inactive file pages that aren't + * thrashing, try to reclaim those first before touching + * anonymous pages. + */ + file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); + if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) + sc->cache_trim_mode = 1; + else + sc->cache_trim_mode = 0; + + /* + * Prevent the reclaimer from falling into the cache trap: as + * cache pages start out inactive, every cache fault will tip + * the scan balance towards the file LRU. And as the file LRU + * shrinks, so does the window for rotation from references. + * This means we have a runaway feedback loop where a tiny + * thrashing file LRU becomes infinitely more attractive than + * anon pages. Try to detect this based on file LRU size. + */ + if (!cgroup_reclaim(sc)) { + unsigned long total_high_wmark = 0; + unsigned long free, anon; + int z; + + free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); + file = node_page_state(pgdat, NR_ACTIVE_FILE) + + node_page_state(pgdat, NR_INACTIVE_FILE); + + for (z = 0; z < MAX_NR_ZONES; z++) { + struct zone *zone = &pgdat->node_zones[z]; + + if (!managed_zone(zone)) + continue; + + total_high_wmark += high_wmark_pages(zone); + } + + /* + * Consider anon: if that's low too, this isn't a + * runaway file reclaim problem, but rather just + * extreme pressure. Reclaim as per usual then. + */ + anon = node_page_state(pgdat, NR_INACTIVE_ANON); + + sc->file_is_tiny = + file + free <= total_high_wmark && + !(sc->may_deactivate & DEACTIVATE_ANON) && + anon >> sc->priority; + } +} + /* * Determine how aggressively the anon and file LRU lists should be * scanned. The relative value of each set of LRU lists is determined @@ -2527,127 +2616,2890 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, } } -static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) -{ - unsigned long nr[NR_LRU_LISTS]; - unsigned long targets[NR_LRU_LISTS]; - unsigned long nr_to_scan; - enum lru_list lru; - unsigned long nr_reclaimed = 0; - unsigned long nr_to_reclaim = sc->nr_to_reclaim; - bool proportional_reclaim; - struct blk_plug plug; +#ifdef CONFIG_LRU_GEN - get_scan_count(lruvec, sc, nr); +#ifdef CONFIG_LRU_GEN_ENABLED +DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS); +#define get_cap(cap) static_branch_likely(&lru_gen_caps[cap]) +#else +DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS); +#define get_cap(cap) static_branch_unlikely(&lru_gen_caps[cap]) +#endif - /* Record the original scan target for proportional adjustments later */ - memcpy(targets, nr, sizeof(nr)); +/****************************************************************************** + * shorthand helpers + ******************************************************************************/ - /* - * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal - * event that can occur when there is little memory pressure e.g. - * multiple streaming readers/writers. Hence, we do not abort scanning - * when the requested number of pages are reclaimed when scanning at - * DEF_PRIORITY on the assumption that the fact we are direct - * reclaiming implies that kswapd is not keeping up and it is best to - * do a batch of work at once. For memcg reclaim one check is made to - * abort proportional reclaim if either the file or anon lru has already - * dropped to zero at the first pass. - */ - proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() && - sc->priority == DEF_PRIORITY); +#define LRU_REFS_FLAGS (BIT(PG_referenced) | BIT(PG_workingset)) - blk_start_plug(&plug); - while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || - nr[LRU_INACTIVE_FILE]) { - unsigned long nr_anon, nr_file, percentage; - unsigned long nr_scanned; +#define DEFINE_MAX_SEQ(lruvec) \ + unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq) - for_each_evictable_lru(lru) { - if (nr[lru]) { - nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX); - nr[lru] -= nr_to_scan; +#define DEFINE_MIN_SEQ(lruvec) \ + unsigned long min_seq[ANON_AND_FILE] = { \ + READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]), \ + READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]), \ + } - nr_reclaimed += shrink_list(lru, nr_to_scan, - lruvec, sc); - } - } +#define for_each_gen_type_zone(gen, type, zone) \ + for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \ + for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \ + for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++) - cond_resched(); +static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid) +{ + struct pglist_data *pgdat = NODE_DATA(nid); - if (nr_reclaimed < nr_to_reclaim || proportional_reclaim) - continue; +#ifdef CONFIG_MEMCG + if (memcg) { + struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec; - /* - * For kswapd and memcg, reclaim at least the number of pages - * requested. Ensure that the anon and file LRUs are scanned - * proportionally what was requested by get_scan_count(). We - * stop reclaiming one LRU and reduce the amount scanning - * proportional to the original scan target. - */ - nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE]; - nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON]; + /* for hotadd_new_pgdat() */ + if (!lruvec->pgdat) + lruvec->pgdat = pgdat; - /* - * It's just vindictive to attack the larger once the smaller - * has gone to zero. And given the way we stop scanning the - * smaller below, this makes sure that we only make one nudge - * towards proportionality once we've got nr_to_reclaim. - */ - if (!nr_file || !nr_anon) - break; + return lruvec; + } +#endif + VM_WARN_ON_ONCE(!mem_cgroup_disabled()); - if (nr_file > nr_anon) { - unsigned long scan_target = targets[LRU_INACTIVE_ANON] + - targets[LRU_ACTIVE_ANON] + 1; - lru = LRU_BASE; - percentage = nr_anon * 100 / scan_target; - } else { - unsigned long scan_target = targets[LRU_INACTIVE_FILE] + - targets[LRU_ACTIVE_FILE] + 1; - lru = LRU_FILE; - percentage = nr_file * 100 / scan_target; - } + return pgdat ? &pgdat->__lruvec : NULL; +} - /* Stop scanning the smaller of the LRU */ - nr[lru] = 0; - nr[lru + LRU_ACTIVE] = 0; +static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) +{ + struct mem_cgroup *memcg = lruvec_memcg(lruvec); - /* - * Recalculate the other LRU scan count based on its original - * scan target and the percentage scanning already complete - */ - lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE; - nr_scanned = targets[lru] - nr[lru]; - nr[lru] = targets[lru] * (100 - percentage) / 100; - nr[lru] -= min(nr[lru], nr_scanned); + if (mem_cgroup_get_nr_swap_pages(memcg) <= 0) + return 0; - lru += LRU_ACTIVE; - nr_scanned = targets[lru] - nr[lru]; - nr[lru] = targets[lru] * (100 - percentage) / 100; - nr[lru] -= min(nr[lru], nr_scanned); - } - blk_finish_plug(&plug); - sc->nr_reclaimed += nr_reclaimed; + return mem_cgroup_swappiness(memcg); +} - /* - * Even if we did not try to evict anon pages at all, we want to - * rebalance the anon lru active/inactive ratio. - */ - if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON)) - shrink_active_list(SWAP_CLUSTER_MAX, lruvec, - sc, LRU_ACTIVE_ANON); +static int get_nr_gens(struct lruvec *lruvec, int type) +{ + return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1; } -/* Use reclaim/compaction for costly allocs or under memory pressure */ -static bool in_reclaim_compaction(struct scan_control *sc) +static bool __maybe_unused seq_is_valid(struct lruvec *lruvec) { - if (IS_ENABLED(CONFIG_COMPACTION) && sc->order && - (sc->order > PAGE_ALLOC_COSTLY_ORDER || - sc->priority < DEF_PRIORITY - 2)) - return true; + /* see the comment on lru_gen_struct */ + return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS && + get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) && + get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS; +} - return false; +/****************************************************************************** + * mm_struct list + ******************************************************************************/ + +static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg) +{ + static struct lru_gen_mm_list mm_list = { + .fifo = LIST_HEAD_INIT(mm_list.fifo), + .lock = __SPIN_LOCK_UNLOCKED(mm_list.lock), + }; + +#ifdef CONFIG_MEMCG + if (memcg) + return &memcg->mm_list; +#endif + VM_WARN_ON_ONCE(!mem_cgroup_disabled()); + + return &mm_list; +} + +void lru_gen_add_mm(struct mm_struct *mm) +{ + int nid; + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); + + VM_WARN_ON_ONCE(!list_empty(&mm->lru_gen.list)); +#ifdef CONFIG_MEMCG + VM_WARN_ON_ONCE(mm->lru_gen.memcg); + mm->lru_gen.memcg = memcg; +#endif + spin_lock(&mm_list->lock); + + for_each_node_state(nid, N_MEMORY) { + struct lruvec *lruvec = get_lruvec(memcg, nid); + + if (!lruvec) + continue; + + /* the first addition since the last iteration */ + if (lruvec->mm_state.tail == &mm_list->fifo) + lruvec->mm_state.tail = &mm->lru_gen.list; + } + + list_add_tail(&mm->lru_gen.list, &mm_list->fifo); + + spin_unlock(&mm_list->lock); +} + +void lru_gen_del_mm(struct mm_struct *mm) +{ + int nid; + struct lru_gen_mm_list *mm_list; + struct mem_cgroup *memcg = NULL; + + if (list_empty(&mm->lru_gen.list)) + return; + +#ifdef CONFIG_MEMCG + memcg = mm->lru_gen.memcg; +#endif + mm_list = get_mm_list(memcg); + + spin_lock(&mm_list->lock); + + for_each_node(nid) { + struct lruvec *lruvec = get_lruvec(memcg, nid); + + if (!lruvec) + continue; + + /* where the current iteration continues after */ + if (lruvec->mm_state.head == &mm->lru_gen.list) + lruvec->mm_state.head = lruvec->mm_state.head->prev; + + /* where the last iteration ended before */ + if (lruvec->mm_state.tail == &mm->lru_gen.list) + lruvec->mm_state.tail = lruvec->mm_state.tail->next; + } + + list_del_init(&mm->lru_gen.list); + + spin_unlock(&mm_list->lock); + +#ifdef CONFIG_MEMCG + mem_cgroup_put(mm->lru_gen.memcg); + mm->lru_gen.memcg = NULL; +#endif +} + +#ifdef CONFIG_MEMCG +void lru_gen_migrate_mm(struct mm_struct *mm) +{ + struct mem_cgroup *memcg; + struct task_struct *task = rcu_dereference_protected(mm->owner, true); + + VM_WARN_ON_ONCE(task->mm != mm); + lockdep_assert_held(&task->alloc_lock); + + /* for mm_update_next_owner() */ + if (mem_cgroup_disabled()) + return; + + /* migration can happen before addition */ + if (!mm->lru_gen.memcg) + return; + + rcu_read_lock(); + memcg = mem_cgroup_from_task(task); + rcu_read_unlock(); + if (memcg == mm->lru_gen.memcg) + return; + + VM_WARN_ON_ONCE(list_empty(&mm->lru_gen.list)); + + lru_gen_del_mm(mm); + lru_gen_add_mm(mm); +} +#endif + +/* + * Bloom filters with m=1<<15, k=2 and the false positive rates of ~1/5 when + * n=10,000 and ~1/2 when n=20,000, where, conventionally, m is the number of + * bits in a bitmap, k is the number of hash functions and n is the number of + * inserted items. + * + * Page table walkers use one of the two filters to reduce their search space. + * To get rid of non-leaf entries that no longer have enough leaf entries, the + * aging uses the double-buffering technique to flip to the other filter each + * time it produces a new generation. For non-leaf entries that have enough + * leaf entries, the aging carries them over to the next generation in + * walk_pmd_range(); the eviction also report them when walking the rmap + * in lru_gen_look_around(). + * + * For future optimizations: + * 1. It's not necessary to keep both filters all the time. The spare one can be + * freed after the RCU grace period and reallocated if needed again. + * 2. And when reallocating, it's worth scaling its size according to the number + * of inserted entries in the other filter, to reduce the memory overhead on + * small systems and false positives on large systems. + * 3. Jenkins' hash function is an alternative to Knuth's. + */ +#define BLOOM_FILTER_SHIFT 15 + +static inline int filter_gen_from_seq(unsigned long seq) +{ + return seq % NR_BLOOM_FILTERS; +} + +static void get_item_key(void *item, int *key) +{ + u32 hash = hash_ptr(item, BLOOM_FILTER_SHIFT * 2); + + BUILD_BUG_ON(BLOOM_FILTER_SHIFT * 2 > BITS_PER_TYPE(u32)); + + key[0] = hash & (BIT(BLOOM_FILTER_SHIFT) - 1); + key[1] = hash >> BLOOM_FILTER_SHIFT; +} + +static void reset_bloom_filter(struct lruvec *lruvec, unsigned long seq) +{ + unsigned long *filter; + int gen = filter_gen_from_seq(seq); + + filter = lruvec->mm_state.filters[gen]; + if (filter) { + bitmap_clear(filter, 0, BIT(BLOOM_FILTER_SHIFT)); + return; + } + + filter = bitmap_zalloc(BIT(BLOOM_FILTER_SHIFT), + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); + WRITE_ONCE(lruvec->mm_state.filters[gen], filter); +} + +static void update_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item) +{ + int key[2]; + unsigned long *filter; + int gen = filter_gen_from_seq(seq); + + filter = READ_ONCE(lruvec->mm_state.filters[gen]); + if (!filter) + return; + + get_item_key(item, key); + + if (!test_bit(key[0], filter)) + set_bit(key[0], filter); + if (!test_bit(key[1], filter)) + set_bit(key[1], filter); +} + +static bool test_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item) +{ + int key[2]; + unsigned long *filter; + int gen = filter_gen_from_seq(seq); + + filter = READ_ONCE(lruvec->mm_state.filters[gen]); + if (!filter) + return true; + + get_item_key(item, key); + + return test_bit(key[0], filter) && test_bit(key[1], filter); +} + +static void reset_mm_stats(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, bool last) +{ + int i; + int hist; + + lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock); + + if (walk) { + hist = lru_hist_from_seq(walk->max_seq); + + for (i = 0; i < NR_MM_STATS; i++) { + WRITE_ONCE(lruvec->mm_state.stats[hist][i], + lruvec->mm_state.stats[hist][i] + walk->mm_stats[i]); + walk->mm_stats[i] = 0; + } + } + + if (NR_HIST_GENS > 1 && last) { + hist = lru_hist_from_seq(lruvec->mm_state.seq + 1); + + for (i = 0; i < NR_MM_STATS; i++) + WRITE_ONCE(lruvec->mm_state.stats[hist][i], 0); + } +} + +static bool should_skip_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) +{ + int type; + unsigned long size = 0; + struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); + int key = pgdat->node_id; + + if (!walk->full_scan && !node_isset(key, mm->lru_gen.nodes)) + return true; + + node_clear(key, mm->lru_gen.nodes); + + for (type = !walk->can_swap; type < ANON_AND_FILE; type++) { + size += type ? get_mm_counter(mm, MM_FILEPAGES) : + get_mm_counter(mm, MM_ANONPAGES) + + get_mm_counter(mm, MM_SHMEMPAGES); + } + + if (size < MIN_LRU_BATCH) + return true; + + return !mmget_not_zero(mm); +} + +static bool iterate_mm_list(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, + struct mm_struct **iter) +{ + bool first = false; + bool last = false; + struct mm_struct *mm = NULL; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); + struct lru_gen_mm_state *mm_state = &lruvec->mm_state; + + /* + * mm_state->seq is incremented after each iteration of mm_list. There + * are three interesting cases for this page table walker: + * 1. It tries to start a new iteration with a stale max_seq: there is + * nothing left to do. + * 2. It started the next iteration: it needs to reset the Bloom filter + * so that a fresh set of PTE tables can be recorded. + * 3. It ended the current iteration: it needs to reset the mm stats + * counters and tell its caller to increment max_seq. + */ + spin_lock(&mm_list->lock); + + VM_WARN_ON_ONCE(mm_state->seq + 1 < walk->max_seq); + + if (walk->max_seq <= mm_state->seq) + goto done; + + if (!mm_state->head) + mm_state->head = &mm_list->fifo; + + if (mm_state->head == &mm_list->fifo) + first = true; + + do { + mm_state->head = mm_state->head->next; + if (mm_state->head == &mm_list->fifo) { + WRITE_ONCE(mm_state->seq, mm_state->seq + 1); + last = true; + break; + } + + /* force scan for those added after the last iteration */ + if (!mm_state->tail || mm_state->tail == mm_state->head) { + mm_state->tail = mm_state->head->next; + walk->full_scan = true; + } + + mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list); + if (should_skip_mm(mm, walk)) + mm = NULL; + } while (!mm); +done: + if (*iter || last) + reset_mm_stats(lruvec, walk, last); + + spin_unlock(&mm_list->lock); + + if (mm && first) + reset_bloom_filter(lruvec, walk->max_seq + 1); + + if (*iter) + mmput_async(*iter); + + *iter = mm; + + return last; +} + +static bool iterate_mm_list_nowalk(struct lruvec *lruvec, unsigned long max_seq) +{ + bool success = false; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); + struct lru_gen_mm_state *mm_state = &lruvec->mm_state; + + spin_lock(&mm_list->lock); + + VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq); + + if (max_seq > mm_state->seq) { + mm_state->head = NULL; + mm_state->tail = NULL; + WRITE_ONCE(mm_state->seq, mm_state->seq + 1); + reset_mm_stats(lruvec, NULL, true); + success = true; + } + + spin_unlock(&mm_list->lock); + + return success; +} + +/****************************************************************************** + * refault feedback loop + ******************************************************************************/ + +/* + * A feedback loop based on Proportional-Integral-Derivative (PID) controller. + * + * The P term is refaulted/(evicted+protected) from a tier in the generation + * currently being evicted; the I term is the exponential moving average of the + * P term over the generations previously evicted, using the smoothing factor + * 1/2; the D term isn't supported. + * + * The setpoint (SP) is always the first tier of one type; the process variable + * (PV) is either any tier of the other type or any other tier of the same + * type. + * + * The error is the difference between the SP and the PV; the correction is to + * turn off protection when SP>PV or turn on protection when SPlrugen; + int hist = lru_hist_from_seq(lrugen->min_seq[type]); + + pos->refaulted = lrugen->avg_refaulted[type][tier] + + atomic_long_read(&lrugen->refaulted[hist][type][tier]); + pos->total = lrugen->avg_total[type][tier] + + atomic_long_read(&lrugen->evicted[hist][type][tier]); + if (tier) + pos->total += lrugen->protected[hist][type][tier - 1]; + pos->gain = gain; +} + +static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover) +{ + int hist, tier; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1; + unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1; + + lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock); + + if (!carryover && !clear) + return; + + hist = lru_hist_from_seq(seq); + + for (tier = 0; tier < MAX_NR_TIERS; tier++) { + if (carryover) { + unsigned long sum; + + sum = lrugen->avg_refaulted[type][tier] + + atomic_long_read(&lrugen->refaulted[hist][type][tier]); + WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2); + + sum = lrugen->avg_total[type][tier] + + atomic_long_read(&lrugen->evicted[hist][type][tier]); + if (tier) + sum += lrugen->protected[hist][type][tier - 1]; + WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2); + } + + if (clear) { + atomic_long_set(&lrugen->refaulted[hist][type][tier], 0); + atomic_long_set(&lrugen->evicted[hist][type][tier], 0); + if (tier) + WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0); + } + } +} + +static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv) +{ + /* + * Return true if the PV has a limited number of refaults or a lower + * refaulted/total than the SP. + */ + return pv->refaulted < MIN_LRU_BATCH || + pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <= + (sp->refaulted + 1) * pv->total * pv->gain; +} + +/****************************************************************************** + * the aging + ******************************************************************************/ + +/* promote pages accessed through page tables */ +static int page_update_gen(struct page *page, int gen) +{ + unsigned long new_flags, old_flags; + + VM_WARN_ON_ONCE(gen >= MAX_NR_GENS); + VM_WARN_ON_ONCE(!rcu_read_lock_held()); + + do { + old_flags = READ_ONCE(page->flags); + + /* lru_gen_del_page() has isolated this page? */ + if (!(old_flags & LRU_GEN_MASK)) { + /* for shrink_page_list() */ + new_flags = old_flags | BIT(PG_referenced); + continue; + } + + new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS); + new_flags |= (gen + 1UL) << LRU_GEN_PGOFF; + } while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags); + + return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; +} + +/* protect pages accessed multiple times through file descriptors */ +static int page_inc_gen(struct lruvec *lruvec, struct page *page, bool reclaiming) +{ + int type = page_is_file_lru(page); + struct lru_gen_struct *lrugen = &lruvec->lrugen; + int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); + unsigned long new_flags, old_flags; + + do { + old_flags = READ_ONCE(page->flags); + + VM_WARN_ON_ONCE_PAGE(!(old_flags & LRU_GEN_MASK), page); + + new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; + /* page_update_gen() has promoted this page? */ + if (new_gen >= 0 && new_gen != old_gen) + return new_gen; + + new_gen = (old_gen + 1) % MAX_NR_GENS; + + new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS); + new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF; + /* for end_page_writeback() */ + if (reclaiming) + new_flags |= BIT(PG_reclaim); + } while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags); + + lru_gen_update_size(lruvec, page, old_gen, new_gen); + + return new_gen; +} + +static void update_batch_size(struct lru_gen_mm_walk *walk, struct page *page, + int old_gen, int new_gen) +{ + int type = page_is_file_lru(page); + int zone = page_zonenum(page); + int delta = thp_nr_pages(page); + + VM_WARN_ON_ONCE(old_gen >= MAX_NR_GENS); + VM_WARN_ON_ONCE(new_gen >= MAX_NR_GENS); + + walk->batched++; + + walk->nr_pages[old_gen][type][zone] -= delta; + walk->nr_pages[new_gen][type][zone] += delta; +} + +static void reset_batch_size(struct lruvec *lruvec, struct lru_gen_mm_walk *walk) +{ + int gen, type, zone; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + walk->batched = 0; + + for_each_gen_type_zone(gen, type, zone) { + enum lru_list lru = type * LRU_INACTIVE_FILE; + int delta = walk->nr_pages[gen][type][zone]; + + if (!delta) + continue; + + walk->nr_pages[gen][type][zone] = 0; + WRITE_ONCE(lrugen->nr_pages[gen][type][zone], + lrugen->nr_pages[gen][type][zone] + delta); + + if (lru_gen_is_active(lruvec, gen)) + lru += LRU_ACTIVE; + __update_lru_size(lruvec, lru, zone, delta); + } +} + +static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *args) +{ + struct address_space *mapping; + struct vm_area_struct *vma = args->vma; + struct lru_gen_mm_walk *walk = args->private; + + if (!vma_is_accessible(vma)) + return true; + + if (is_vm_hugetlb_page(vma)) + return true; + + if (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ)) + return true; + + if (vma == get_gate_vma(vma->vm_mm)) + return true; + + if (vma_is_anonymous(vma)) + return !walk->can_swap; + + if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping)) + return true; + + mapping = vma->vm_file->f_mapping; + if (mapping_unevictable(mapping)) + return true; + + if (shmem_mapping(mapping)) + return !walk->can_swap; + + /* to exclude special mappings like dax, etc. */ + return !mapping->a_ops->readpage; +} + +/* + * Some userspace memory allocators map many single-page VMAs. Instead of + * returning back to the PGD table for each of such VMAs, finish an entire PMD + * table to reduce zigzags and improve cache performance. + */ +static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk *args, + unsigned long *vm_start, unsigned long *vm_end) +{ + unsigned long start = round_up(*vm_end, size); + unsigned long end = (start | ~mask) + 1; + + VM_WARN_ON_ONCE(mask & size); + VM_WARN_ON_ONCE((start & mask) != (*vm_start & mask)); + + while (args->vma) { + if (start >= args->vma->vm_end) { + args->vma = args->vma->vm_next; + continue; + } + + if (end && end <= args->vma->vm_start) + return false; + + if (should_skip_vma(args->vma->vm_start, args->vma->vm_end, args)) { + args->vma = args->vma->vm_next; + continue; + } + + *vm_start = max(start, args->vma->vm_start); + *vm_end = min(end - 1, args->vma->vm_end - 1) + 1; + + return true; + } + + return false; +} + +static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr) +{ + unsigned long pfn = pte_pfn(pte); + + VM_WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end); + + if (!pte_present(pte) || is_zero_pfn(pfn)) + return -1; + + if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte))) + return -1; + + if (WARN_ON_ONCE(!pfn_valid(pfn))) + return -1; + + return pfn; +} + +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) +static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr) +{ + unsigned long pfn = pmd_pfn(pmd); + + VM_WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end); + + if (!pmd_present(pmd) || is_huge_zero_pmd(pmd)) + return -1; + + if (WARN_ON_ONCE(pmd_devmap(pmd))) + return -1; + + if (WARN_ON_ONCE(!pfn_valid(pfn))) + return -1; + + return pfn; +} +#endif + +static struct page *get_pfn_page(unsigned long pfn, struct mem_cgroup *memcg, + struct pglist_data *pgdat, bool can_swap) +{ + struct page *page; + + /* try to avoid unnecessary memory loads */ + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + return NULL; + + page = compound_head(pfn_to_page(pfn)); + if (page_to_nid(page) != pgdat->node_id) + return NULL; + + if (page_memcg_rcu(page) != memcg) + return NULL; + + /* file VMAs can contain anon pages from COW */ + if (!page_is_file_lru(page) && !can_swap) + return NULL; + + return page; +} + +static bool suitable_to_scan(int total, int young) +{ + int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8); + + /* suitable if the average number of young PTEs per cacheline is >=1 */ + return young * n >= total; +} + +static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, + struct mm_walk *args) +{ + int i; + pte_t *pte; + spinlock_t *ptl; + unsigned long addr; + int total = 0; + int young = 0; + struct lru_gen_mm_walk *walk = args->private; + struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec); + struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); + int old_gen, new_gen = lru_gen_from_seq(walk->max_seq); + + VM_WARN_ON_ONCE(pmd_leaf(*pmd)); + + ptl = pte_lockptr(args->mm, pmd); + if (!spin_trylock(ptl)) + return false; + + arch_enter_lazy_mmu_mode(); + + pte = pte_offset_map(pmd, start & PMD_MASK); +restart: + for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) { + unsigned long pfn; + struct page *page; + + total++; + walk->mm_stats[MM_LEAF_TOTAL]++; + + pfn = get_pte_pfn(pte[i], args->vma, addr); + if (pfn == -1) + continue; + + if (!pte_young(pte[i])) { + walk->mm_stats[MM_LEAF_OLD]++; + continue; + } + + page = get_pfn_page(pfn, memcg, pgdat, walk->can_swap); + if (!page) + continue; + + if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) + VM_WARN_ON_ONCE(true); + + young++; + walk->mm_stats[MM_LEAF_YOUNG]++; + + if (pte_dirty(pte[i]) && !PageDirty(page) && + !(PageAnon(page) && PageSwapBacked(page) && + !PageSwapCache(page))) + set_page_dirty(page); + + old_gen = page_update_gen(page, new_gen); + if (old_gen >= 0 && old_gen != new_gen) + update_batch_size(walk, page, old_gen, new_gen); + } + + if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end)) + goto restart; + + pte_unmap(pte); + + arch_leave_lazy_mmu_mode(); + spin_unlock(ptl); + + return suitable_to_scan(total, young); +} + +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) +static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma, + struct mm_walk *args, unsigned long *bitmap, unsigned long *start) +{ + int i; + pmd_t *pmd; + spinlock_t *ptl; + struct lru_gen_mm_walk *walk = args->private; + struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec); + struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); + int old_gen, new_gen = lru_gen_from_seq(walk->max_seq); + + VM_WARN_ON_ONCE(pud_leaf(*pud)); + + /* try to batch at most 1+MIN_LRU_BATCH+1 entries */ + if (*start == -1) { + *start = next; + return; + } + + i = next == -1 ? 0 : pmd_index(next) - pmd_index(*start); + if (i && i <= MIN_LRU_BATCH) { + __set_bit(i - 1, bitmap); + return; + } + + pmd = pmd_offset(pud, *start); + + ptl = pmd_lockptr(args->mm, pmd); + if (!spin_trylock(ptl)) + goto done; + + arch_enter_lazy_mmu_mode(); + + do { + unsigned long pfn; + struct page *page; + unsigned long addr = i ? (*start & PMD_MASK) + i * PMD_SIZE : *start; + + pfn = get_pmd_pfn(pmd[i], vma, addr); + if (pfn == -1) + goto next; + + if (!pmd_trans_huge(pmd[i])) { + if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && + get_cap(LRU_GEN_NONLEAF_YOUNG)) + pmdp_test_and_clear_young(vma, addr, pmd + i); + goto next; + } + + page = get_pfn_page(pfn, memcg, pgdat, walk->can_swap); + if (!page) + goto next; + + if (!pmdp_test_and_clear_young(vma, addr, pmd + i)) + goto next; + + walk->mm_stats[MM_LEAF_YOUNG]++; + + if (pmd_dirty(pmd[i]) && !PageDirty(page) && + !(PageAnon(page) && PageSwapBacked(page) && + !PageSwapCache(page))) + set_page_dirty(page); + + old_gen = page_update_gen(page, new_gen); + if (old_gen >= 0 && old_gen != new_gen) + update_batch_size(walk, page, old_gen, new_gen); +next: + i = i > MIN_LRU_BATCH ? 0 : find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1; + } while (i <= MIN_LRU_BATCH); + + arch_leave_lazy_mmu_mode(); + spin_unlock(ptl); +done: + *start = -1; + bitmap_zero(bitmap, MIN_LRU_BATCH); +} +#else +static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma, + struct mm_walk *args, unsigned long *bitmap, unsigned long *start) +{ +} +#endif + +static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end, + struct mm_walk *args) +{ + int i; + pmd_t *pmd; + unsigned long next; + unsigned long addr; + struct vm_area_struct *vma; + unsigned long pos = -1; + struct lru_gen_mm_walk *walk = args->private; + unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {}; + + VM_WARN_ON_ONCE(pud_leaf(*pud)); + + /* + * Finish an entire PMD in two passes: the first only reaches to PTE + * tables to avoid taking the PMD lock; the second, if necessary, takes + * the PMD lock to clear the accessed bit in PMD entries. + */ + pmd = pmd_offset(pud, start & PUD_MASK); +restart: + /* walk_pte_range() may call get_next_vma() */ + vma = args->vma; + for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) { + pmd_t val = pmd_read_atomic(pmd + i); + + /* for pmd_read_atomic() */ + barrier(); + + next = pmd_addr_end(addr, end); + + if (!pmd_present(val) || is_huge_zero_pmd(val)) { + walk->mm_stats[MM_LEAF_TOTAL]++; + continue; + } + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (pmd_trans_huge(val)) { + unsigned long pfn = pmd_pfn(val); + struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); + + walk->mm_stats[MM_LEAF_TOTAL]++; + + if (!pmd_young(val)) { + walk->mm_stats[MM_LEAF_OLD]++; + continue; + } + + /* try to avoid unnecessary memory loads */ + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + continue; + + walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos); + continue; + } +#endif + walk->mm_stats[MM_NONLEAF_TOTAL]++; + +#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG + if (get_cap(LRU_GEN_NONLEAF_YOUNG)) { + if (!pmd_young(val)) + continue; + + walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos); + } +#endif + if (!walk->full_scan && !test_bloom_filter(walk->lruvec, walk->max_seq, pmd + i)) + continue; + + walk->mm_stats[MM_NONLEAF_FOUND]++; + + if (!walk_pte_range(&val, addr, next, args)) + continue; + + walk->mm_stats[MM_NONLEAF_ADDED]++; + + /* carry over to the next generation */ + update_bloom_filter(walk->lruvec, walk->max_seq + 1, pmd + i); + } + + walk_pmd_range_locked(pud, -1, vma, args, bitmap, &pos); + + if (i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end)) + goto restart; +} + +static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end, + struct mm_walk *args) +{ + int i; + pud_t *pud; + unsigned long addr; + unsigned long next; + struct lru_gen_mm_walk *walk = args->private; + + VM_WARN_ON_ONCE(p4d_leaf(*p4d)); + + pud = pud_offset(p4d, start & P4D_MASK); +restart: + for (i = pud_index(start), addr = start; addr != end; i++, addr = next) { + pud_t val = READ_ONCE(pud[i]); + + next = pud_addr_end(addr, end); + + if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val))) + continue; + + walk_pmd_range(&val, addr, next, args); + + if (need_resched() || walk->batched >= MAX_LRU_BATCH) { + end = (addr | ~PUD_MASK) + 1; + goto done; + } + } + + if (i < PTRS_PER_PUD && get_next_vma(P4D_MASK, PUD_SIZE, args, &start, &end)) + goto restart; + + end = round_up(end, P4D_SIZE); +done: + if (!end || !args->vma) + return 1; + + walk->next_addr = max(end, args->vma->vm_start); + + return -EAGAIN; +} + +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) +{ + static const struct mm_walk_ops mm_walk_ops = { + .test_walk = should_skip_vma, + .p4d_entry = walk_pud_range, + }; + + int err; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + + walk->next_addr = FIRST_USER_ADDRESS; + + do { + DEFINE_MAX_SEQ(lruvec); + + err = -EBUSY; + + /* another thread might have called inc_max_seq() */ + if (walk->max_seq != max_seq) + break; + + /* page_update_gen() requires stable page_memcg() */ + if (!mem_cgroup_trylock_pages(memcg)) + break; + + /* the caller might be holding the lock for write */ + if (mmap_read_trylock(mm)) { + err = walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, walk); + + mmap_read_unlock(mm); + } + + mem_cgroup_unlock_pages(); + + if (walk->batched) { + spin_lock_irq(&pgdat->lru_lock); + reset_batch_size(lruvec, walk); + spin_unlock_irq(&pgdat->lru_lock); + } + + cond_resched(); + } while (err == -EAGAIN); +} + +static struct lru_gen_mm_walk *set_mm_walk(struct pglist_data *pgdat) +{ + struct lru_gen_mm_walk *walk = current->reclaim_state->mm_walk; + + if (pgdat && current_is_kswapd()) { + VM_WARN_ON_ONCE(walk); + + walk = &pgdat->mm_walk; + } else if (!pgdat && !walk) { + VM_WARN_ON_ONCE(current_is_kswapd()); + + walk = kzalloc(sizeof(*walk), __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); + } + + current->reclaim_state->mm_walk = walk; + + return walk; +} + +static void clear_mm_walk(void) +{ + struct lru_gen_mm_walk *walk = current->reclaim_state->mm_walk; + + VM_WARN_ON_ONCE(walk && memchr_inv(walk->nr_pages, 0, sizeof(walk->nr_pages))); + VM_WARN_ON_ONCE(walk && memchr_inv(walk->mm_stats, 0, sizeof(walk->mm_stats))); + + current->reclaim_state->mm_walk = NULL; + + if (!current_is_kswapd()) + kfree(walk); +} + +static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap) +{ + int zone; + int remaining = MAX_LRU_BATCH; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); + + if (type == LRU_GEN_ANON && !can_swap) + goto done; + + /* prevent cold/hot inversion if full_scan is true */ + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + struct list_head *head = &lrugen->lists[old_gen][type][zone]; + + while (!list_empty(head)) { + struct page *page = lru_to_page(head); + + VM_WARN_ON_ONCE_PAGE(PageUnevictable(page), page); + VM_WARN_ON_ONCE_PAGE(PageActive(page), page); + VM_WARN_ON_ONCE_PAGE(page_is_file_lru(page) != type, page); + VM_WARN_ON_ONCE_PAGE(page_zonenum(page) != zone, page); + + new_gen = page_inc_gen(lruvec, page, false); + list_move_tail(&page->lru, &lrugen->lists[new_gen][type][zone]); + + if (!--remaining) + return false; + } + } +done: + reset_ctrl_pos(lruvec, type, true); + WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1); + + return true; +} + +static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap) +{ + int gen, type, zone; + bool success = false; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + DEFINE_MIN_SEQ(lruvec); + + VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); + + /* find the oldest populated generation */ + for (type = !can_swap; type < ANON_AND_FILE; type++) { + while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) { + gen = lru_gen_from_seq(min_seq[type]); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + if (!list_empty(&lrugen->lists[gen][type][zone])) + goto next; + } + + min_seq[type]++; + } +next: + ; + } + + /* see the comment on lru_gen_struct */ + if (can_swap) { + min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]); + min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]); + } + + for (type = !can_swap; type < ANON_AND_FILE; type++) { + if (min_seq[type] == lrugen->min_seq[type]) + continue; + + reset_ctrl_pos(lruvec, type, true); + WRITE_ONCE(lrugen->min_seq[type], min_seq[type]); + success = true; + } + + return success; +} + +static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool full_scan) +{ + int prev, next; + int type, zone; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + struct pglist_data *pgdat = lruvec_pgdat(lruvec); +restart: + spin_lock_irq(&pgdat->lru_lock); + + VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); + + for (type = ANON_AND_FILE - 1; type >= 0; type--) { + if (get_nr_gens(lruvec, type) != MAX_NR_GENS) + continue; + + VM_WARN_ON_ONCE(!full_scan && (type == LRU_GEN_FILE || can_swap)); + + if (inc_min_seq(lruvec, type, can_swap)) + continue; + + spin_unlock_irq(&pgdat->lru_lock); + cond_resched(); + goto restart; + } + + /* + * Update the active/inactive LRU sizes for compatibility. Both sides of + * the current max_seq need to be covered, since max_seq+1 can overlap + * with min_seq[LRU_GEN_ANON] if swapping is constrained. And if they do + * overlap, cold/hot inversion happens. + */ + prev = lru_gen_from_seq(lrugen->max_seq - 1); + next = lru_gen_from_seq(lrugen->max_seq + 1); + + for (type = 0; type < ANON_AND_FILE; type++) { + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + enum lru_list lru = type * LRU_INACTIVE_FILE; + long delta = lrugen->nr_pages[prev][type][zone] - + lrugen->nr_pages[next][type][zone]; + + if (!delta) + continue; + + __update_lru_size(lruvec, lru, zone, delta); + __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta); + } + } + + for (type = 0; type < ANON_AND_FILE; type++) + reset_ctrl_pos(lruvec, type, false); + + WRITE_ONCE(lrugen->timestamps[next], jiffies); + /* make sure preceding modifications appear */ + smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1); + + spin_unlock_irq(&pgdat->lru_lock); +} + +static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, + struct scan_control *sc, bool can_swap, bool full_scan) +{ + bool success; + struct lru_gen_mm_walk *walk; + struct mm_struct *mm = NULL; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + VM_WARN_ON_ONCE(max_seq > READ_ONCE(lrugen->max_seq)); + + /* see the comment in iterate_mm_list() */ + if (max_seq <= READ_ONCE(lruvec->mm_state.seq)) { + success = false; + goto done; + } + + /* + * If the hardware doesn't automatically set the accessed bit, fallback + * to lru_gen_look_around(), which only clears the accessed bit in a + * handful of PTEs. Spreading the work out over a period of time usually + * is less efficient, but it avoids bursty page faults. + */ + if (!full_scan && !(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) { + success = iterate_mm_list_nowalk(lruvec, max_seq); + goto done; + } + + walk = set_mm_walk(NULL); + if (!walk) { + success = iterate_mm_list_nowalk(lruvec, max_seq); + goto done; + } + + walk->lruvec = lruvec; + walk->max_seq = max_seq; + walk->can_swap = can_swap; + walk->full_scan = full_scan; + + do { + success = iterate_mm_list(lruvec, walk, &mm); + if (mm) + walk_mm(lruvec, mm, walk); + } while (mm); +done: + if (success) + inc_max_seq(lruvec, can_swap, full_scan); + + return success; +} + +static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq, unsigned long *min_seq, + struct scan_control *sc, bool can_swap, unsigned long *nr_to_scan) +{ + int gen, type, zone; + unsigned long old = 0; + unsigned long young = 0; + unsigned long total = 0; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + + for (type = !can_swap; type < ANON_AND_FILE; type++) { + unsigned long seq; + + for (seq = min_seq[type]; seq <= max_seq; seq++) { + unsigned long size = 0; + + gen = lru_gen_from_seq(seq); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) + size += max_t(long, READ_ONCE(lrugen->nr_pages[gen][type][zone]), + 0); + + total += size; + if (seq == max_seq) + young += size; + else if (seq + MIN_NR_GENS == max_seq) + old += size; + } + } + + /* try to scrape all its memory if this memcg was deleted */ + *nr_to_scan = mem_cgroup_online(memcg) ? (total >> sc->priority) : total; + + /* + * The aging tries to be lazy to reduce the overhead, while the eviction + * stalls when the number of generations reaches MIN_NR_GENS. Hence, the + * ideal number of generations is MIN_NR_GENS+1. + */ + if (min_seq[!can_swap] + MIN_NR_GENS > max_seq) + return true; + if (min_seq[!can_swap] + MIN_NR_GENS < max_seq) + return false; + + /* + * It's also ideal to spread pages out evenly, i.e., 1/(MIN_NR_GENS+1) + * of the total number of pages for each generation. A reasonable range + * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The + * aging cares about the upper bound of hot pages, while the eviction + * cares about the lower bound of cold pages. + */ + if (young * MIN_NR_GENS > total) + return true; + if (old * (MIN_NR_GENS + 2) < total) + return true; + + return false; +} + +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned long min_ttl) +{ + bool need_aging; + unsigned long nr_to_scan; + int swappiness = get_swappiness(lruvec, sc); + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + VM_WARN_ON_ONCE(sc->memcg_low_reclaim); + + mem_cgroup_calculate_protection(NULL, memcg); + + if (mem_cgroup_below_min(memcg)) + return false; + + need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, swappiness, &nr_to_scan); + + if (min_ttl) { + int gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]); + unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); + + if (time_is_after_jiffies(birth + min_ttl)) + return false; + + /* the size is likely too small to be helpful */ + if (!nr_to_scan && sc->priority != DEF_PRIORITY) + return false; + } + + if (need_aging) + try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false); + + return true; +} + +/* to protect the working set of the last N jiffies */ +static unsigned long lru_gen_min_ttl __read_mostly; + +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) +{ + struct mem_cgroup *memcg; + bool success = false; + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); + + VM_WARN_ON_ONCE(!current_is_kswapd()); + + sc->last_reclaimed = sc->nr_reclaimed; + + /* + * To reduce the chance of going into the aging path, which can be + * costly, optimistically skip it if the flag below was cleared in the + * eviction path. This improves the overall performance when multiple + * memcgs are available. + */ + if (!sc->memcgs_need_aging) { + sc->memcgs_need_aging = true; + return; + } + + set_mm_walk(pgdat); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + + if (age_lruvec(lruvec, sc, min_ttl)) + success = true; + + cond_resched(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + clear_mm_walk(); + + /* check the order to exclude compaction-induced reclaim */ + if (success || !min_ttl || sc->order) + return; + + /* + * The main goal is to OOM kill if every generation from all memcgs is + * younger than min_ttl. However, another possibility is all memcgs are + * either below min or empty. + */ + if (mutex_trylock(&oom_lock)) { + struct oom_control oc = { + .gfp_mask = sc->gfp_mask, + }; + + out_of_memory(&oc); + + mutex_unlock(&oom_lock); + } +} + +/* + * This function exploits spatial locality when shrink_page_list() walks the + * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If + * the scan was done cacheline efficiently, it adds the PMD entry pointing to + * the PTE table to the Bloom filter. This forms a feedback loop between the + * eviction and the aging. + */ +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +{ + int i; + pte_t *pte; + unsigned long start; + unsigned long end; + unsigned long addr; + struct lru_gen_mm_walk *walk; + int young = 0; + unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {}; + struct page *page = pvmw->page; + bool can_swap = !page_is_file_lru(page); + struct mem_cgroup *memcg = page_memcg(page); + struct pglist_data *pgdat = page_pgdat(page); + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + DEFINE_MAX_SEQ(lruvec); + int old_gen, new_gen = lru_gen_from_seq(max_seq); + + lockdep_assert_held(pvmw->ptl); + VM_WARN_ON_ONCE_PAGE(PageLRU(page), page); + + if (spin_is_contended(pvmw->ptl)) + return; + + /* avoid taking the LRU lock under the PTL when possible */ + walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL; + + start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start); + end = min(pvmw->address | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1; + + if (end - start > MIN_LRU_BATCH * PAGE_SIZE) { + if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2) + end = start + MIN_LRU_BATCH * PAGE_SIZE; + else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2) + start = end - MIN_LRU_BATCH * PAGE_SIZE; + else { + start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2; + end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2; + } + } + + pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE; + + rcu_read_lock(); + arch_enter_lazy_mmu_mode(); + + for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) { + unsigned long pfn; + + pfn = get_pte_pfn(pte[i], pvmw->vma, addr); + if (pfn == -1) + continue; + + if (!pte_young(pte[i])) + continue; + + page = get_pfn_page(pfn, memcg, pgdat, can_swap); + if (!page) + continue; + + if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i)) + VM_WARN_ON_ONCE(true); + + young++; + + if (pte_dirty(pte[i]) && !PageDirty(page) && + !(PageAnon(page) && PageSwapBacked(page) && + !PageSwapCache(page))) + set_page_dirty(page); + + old_gen = page_lru_gen(page); + if (old_gen < 0) + SetPageReferenced(page); + else if (old_gen != new_gen) + __set_bit(i, bitmap); + } + + arch_leave_lazy_mmu_mode(); + rcu_read_unlock(); + + /* feedback from rmap walkers to page table walkers */ + if (suitable_to_scan(i, young)) + update_bloom_filter(lruvec, max_seq, pvmw->pmd); + + if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) { + for_each_set_bit(i, bitmap, MIN_LRU_BATCH) { + page = pte_page(pte[i]); + activate_page(page); + } + return; + } + + /* page_update_gen() requires stable page_memcg() */ + if (!mem_cgroup_trylock_pages(memcg)) + return; + + if (!walk) { + spin_lock_irq(&pgdat->lru_lock); + new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq); + } + + for_each_set_bit(i, bitmap, MIN_LRU_BATCH) { + page = compound_head(pte_page(pte[i])); + if (page_memcg_rcu(page) != memcg) + continue; + + old_gen = page_update_gen(page, new_gen); + if (old_gen < 0 || old_gen == new_gen) + continue; + + if (walk) + update_batch_size(walk, page, old_gen, new_gen); + else + lru_gen_update_size(lruvec, page, old_gen, new_gen); + } + + if (!walk) + spin_unlock_irq(&pgdat->lru_lock); + + mem_cgroup_unlock_pages(); +} + +/****************************************************************************** + * the eviction + ******************************************************************************/ + +static bool sort_page(struct lruvec *lruvec, struct page *page, struct scan_control *sc, + int tier_idx) +{ + bool success; + int gen = page_lru_gen(page); + int type = page_is_file_lru(page); + int zone = page_zonenum(page); + int delta = thp_nr_pages(page); + int refs = page_lru_refs(page); + int tier = lru_tier_from_refs(refs); + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + VM_WARN_ON_ONCE_PAGE(gen >= MAX_NR_GENS, page); + + /* unevictable */ + if (!page_evictable(page)) { + success = lru_gen_del_page(lruvec, page, true); + VM_WARN_ON_ONCE_PAGE(!success, page); + SetPageUnevictable(page); + add_page_to_lru_list(page, lruvec); + __count_vm_events(UNEVICTABLE_PGCULLED, delta); + return true; + } + + /* dirty lazyfree */ + if (type == LRU_GEN_FILE && PageAnon(page) && PageDirty(page)) { + success = lru_gen_del_page(lruvec, page, true); + VM_WARN_ON_ONCE_PAGE(!success, page); + SetPageSwapBacked(page); + add_page_to_lru_list_tail(page, lruvec); + return true; + } + + /* promoted */ + if (gen != lru_gen_from_seq(lrugen->min_seq[type])) { + list_move(&page->lru, &lrugen->lists[gen][type][zone]); + return true; + } + + /* protected */ + if (tier > tier_idx) { + int hist = lru_hist_from_seq(lrugen->min_seq[type]); + + gen = page_inc_gen(lruvec, page, false); + list_move_tail(&page->lru, &lrugen->lists[gen][type][zone]); + + WRITE_ONCE(lrugen->protected[hist][type][tier - 1], + lrugen->protected[hist][type][tier - 1] + delta); + return true; + } + + /* ineligible */ + if (zone > sc->reclaim_idx || skip_cma(page, sc)) { + gen = page_inc_gen(lruvec, page, false); + list_move_tail(&page->lru, &lrugen->lists[gen][type][zone]); + return true; + } + + /* waiting for writeback */ + if (PageLocked(page) || PageWriteback(page) || + (type == LRU_GEN_FILE && PageDirty(page))) { + gen = page_inc_gen(lruvec, page, true); + list_move(&page->lru, &lrugen->lists[gen][type][zone]); + return true; + } + + return false; +} + +static bool isolate_page(struct lruvec *lruvec, struct page *page, struct scan_control *sc) +{ + bool success; + + /* unmapping inhibited */ + if (!sc->may_unmap && page_mapped(page)) + return false; + + /* swapping inhibited */ + if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) && + (PageDirty(page) || + (PageAnon(page) && !PageSwapCache(page)))) + return false; + + /* raced with release_pages() */ + if (!get_page_unless_zero(page)) + return false; + + ClearPageLRU(page); + + /* see the comment on MAX_NR_TIERS */ + if (!PageReferenced(page)) + set_mask_bits(&page->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, 0); + + /* for shrink_page_list() */ + ClearPageReclaim(page); + ClearPageReferenced(page); + + success = lru_gen_del_page(lruvec, page, true); + VM_WARN_ON_ONCE_PAGE(!success, page); + + return true; +} + +static int scan_pages(struct lruvec *lruvec, struct scan_control *sc, + int type, int tier, struct list_head *list) +{ + int i; + int gen; + enum vm_event_item item; + int sorted = 0; + int scanned = 0; + int isolated = 0; + int remaining = MAX_LRU_BATCH; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + + VM_WARN_ON_ONCE(!list_empty(list)); + + if (get_nr_gens(lruvec, type) == MIN_NR_GENS) + return 0; + + gen = lru_gen_from_seq(lrugen->min_seq[type]); + + for (i = MAX_NR_ZONES; i > 0; i--) { + LIST_HEAD(moved); + int skipped = 0; + int zone = (sc->reclaim_idx + i) % MAX_NR_ZONES; + struct list_head *head = &lrugen->lists[gen][type][zone]; + + while (!list_empty(head)) { + struct page *page = lru_to_page(head); + int delta = thp_nr_pages(page); + + VM_WARN_ON_ONCE_PAGE(PageUnevictable(page), page); + VM_WARN_ON_ONCE_PAGE(PageActive(page), page); + VM_WARN_ON_ONCE_PAGE(page_is_file_lru(page) != type, page); + VM_WARN_ON_ONCE_PAGE(page_zonenum(page) != zone, page); + + scanned += delta; + + if (sort_page(lruvec, page, sc, tier)) + sorted += delta; + else if (isolate_page(lruvec, page, sc)) { + list_add(&page->lru, list); + isolated += delta; + } else { + list_move(&page->lru, &moved); + skipped += delta; + } + + if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH) + break; + } + + if (skipped) { + list_splice(&moved, head); + __count_zid_vm_events(PGSCAN_SKIP, zone, skipped); + } + + if (!remaining || isolated >= MIN_LRU_BATCH) + break; + } + + item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; + if (!cgroup_reclaim(sc)) { + __count_vm_events(item, isolated); + __count_vm_events(PGREFILL, sorted); + } + __count_memcg_events(memcg, item, isolated); + __count_memcg_events(memcg, PGREFILL, sorted); + __count_vm_events(PGSCAN_ANON + type, isolated); + + /* + * There might not be eligible pages due to reclaim_idx, may_unmap and + * may_writepage. Check the remaining to prevent livelock if it's not + * making progress. + */ + return isolated || !remaining ? scanned : 0; +} + +static int get_tier_idx(struct lruvec *lruvec, int type) +{ + int tier; + struct ctrl_pos sp, pv; + + /* + * To leave a margin for fluctuations, use a larger gain factor (1:2). + * This value is chosen because any other tier would have at least twice + * as many refaults as the first tier. + */ + read_ctrl_pos(lruvec, type, 0, 1, &sp); + for (tier = 1; tier < MAX_NR_TIERS; tier++) { + read_ctrl_pos(lruvec, type, tier, 2, &pv); + if (!positive_ctrl_err(&sp, &pv)) + break; + } + + return tier - 1; +} + +static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx) +{ + int type, tier; + struct ctrl_pos sp, pv; + int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness }; + + /* + * Compare the first tier of anon with that of file to determine which + * type to scan. Also need to compare other tiers of the selected type + * with the first tier of the other type to determine the last tier (of + * the selected type) to evict. + */ + read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp); + read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv); + type = positive_ctrl_err(&sp, &pv); + + read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp); + for (tier = 1; tier < MAX_NR_TIERS; tier++) { + read_ctrl_pos(lruvec, type, tier, gain[type], &pv); + if (!positive_ctrl_err(&sp, &pv)) + break; + } + + *tier_idx = tier - 1; + + return type; +} + +static int isolate_pages(struct lruvec *lruvec, struct scan_control *sc, int swappiness, + int *type_scanned, struct list_head *list) +{ + int i; + int type; + int scanned; + int tier = -1; + DEFINE_MIN_SEQ(lruvec); + + /* + * Try to make the obvious choice first. When anon and file are both + * available from the same generation, interpret swappiness 1 as file + * first and 200 as anon first. + */ + if (!swappiness) + type = LRU_GEN_FILE; + else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE]) + type = LRU_GEN_ANON; + else if (swappiness == 1) + type = LRU_GEN_FILE; + else if (swappiness == 200) + type = LRU_GEN_ANON; + else + type = get_type_to_scan(lruvec, swappiness, &tier); + + for (i = !swappiness; i < ANON_AND_FILE; i++) { + if (tier < 0) + tier = get_tier_idx(lruvec, type); + + scanned = scan_pages(lruvec, sc, type, tier, list); + if (scanned) + break; + + type = !type; + tier = -1; + } + + *type_scanned = type; + + return scanned; +} + +static int evict_pages(struct lruvec *lruvec, struct scan_control *sc, int swappiness, + bool *need_swapping) +{ + int type; + int scanned; + int reclaimed; + LIST_HEAD(list); + LIST_HEAD(clean); + struct page *page; + struct page *next; + enum vm_event_item item; + struct reclaim_stat stat; + struct lru_gen_mm_walk *walk; + bool skip_retry = false; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + + spin_lock_irq(&pgdat->lru_lock); + + scanned = isolate_pages(lruvec, sc, swappiness, &type, &list); + + scanned += try_to_inc_min_seq(lruvec, swappiness); + + if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) + scanned = 0; + + spin_unlock_irq(&pgdat->lru_lock); + + if (list_empty(&list)) + return scanned; +retry: + reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false); + sc->nr_reclaimed += reclaimed; + + list_for_each_entry_safe_reverse(page, next, &list, lru) { + if (!page_evictable(page)) { + list_del(&page->lru); + putback_lru_page(page); + continue; + } + + if (PageReclaim(page) && + (PageDirty(page) || PageWriteback(page))) { + /* restore LRU_REFS_FLAGS cleared by isolate_page() */ + if (PageWorkingset(page)) + SetPageReferenced(page); + continue; + } + + if (skip_retry || PageActive(page) || PageReferenced(page) || + page_mapped(page) || PageLocked(page) || + PageDirty(page) || PageWriteback(page)) { + /* don't add rejected pages to the oldest generation */ + set_mask_bits(&page->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, + BIT(PG_active)); + continue; + } + + /* retry pages that may have missed rotate_reclaimable_page() */ + list_move(&page->lru, &clean); + sc->nr_scanned -= thp_nr_pages(page); + } + + spin_lock_irq(&pgdat->lru_lock); + + move_pages_to_lru(lruvec, &list); + + walk = current->reclaim_state->mm_walk; + if (walk && walk->batched) + reset_batch_size(lruvec, walk); + + item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; + if (!cgroup_reclaim(sc)) + __count_vm_events(item, reclaimed); + __count_memcg_events(memcg, item, reclaimed); + __count_vm_events(PGSTEAL_ANON + type, reclaimed); + + spin_unlock_irq(&pgdat->lru_lock); + + mem_cgroup_uncharge_list(&list); + free_unref_page_list(&list); + + INIT_LIST_HEAD(&list); + list_splice_init(&clean, &list); + + if (!list_empty(&list)) { + skip_retry = true; + goto retry; + } + + if (need_swapping && type == LRU_GEN_ANON) + *need_swapping = true; + + return scanned; +} + +/* + * For future optimizations: + * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg + * reclaim. + */ +static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, + bool can_swap, bool *need_aging) +{ + unsigned long nr_to_scan; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + if (mem_cgroup_below_min(memcg) || + (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim)) + return 0; + + *need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, can_swap, &nr_to_scan); + if (!*need_aging) + return nr_to_scan; + + /* skip the aging path at the default priority */ + if (sc->priority == DEF_PRIORITY) + goto done; + + /* leave the work to lru_gen_age_node() */ + if (current_is_kswapd()) + return 0; + + if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false)) + return nr_to_scan; +done: + return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0; +} + +static bool should_abort_scan(struct lruvec *lruvec, unsigned long seq, + struct scan_control *sc, bool need_swapping) +{ + int i; + DEFINE_MAX_SEQ(lruvec); + + if (!current_is_kswapd()) { + /* age each memcg at most once to ensure fairness */ + if (max_seq - seq > 1) + return true; + + /* over-swapping can increase allocation latency */ + if (sc->nr_reclaimed >= sc->nr_to_reclaim && need_swapping) + return true; + + /* give this thread a chance to exit and free its memory */ + if (fatal_signal_pending(current)) { + sc->nr_reclaimed += MIN_LRU_BATCH; + return true; + } + + if (cgroup_reclaim(sc)) + return false; + } else if (sc->nr_reclaimed - sc->last_reclaimed < sc->nr_to_reclaim) + return false; + + /* keep scanning at low priorities to ensure fairness */ + if (sc->priority > DEF_PRIORITY - 2) + return false; + + /* + * A minimum amount of work was done under global memory pressure. For + * kswapd, it may be overshooting. For direct reclaim, the allocation + * may succeed if all suitable zones are somewhat safe. In either case, + * it's better to stop now, and restart later if necessary. + */ + for (i = 0; i <= sc->reclaim_idx; i++) { + unsigned long wmark; + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i; + + if (!managed_zone(zone)) + continue; + + wmark = current_is_kswapd() ? high_wmark_pages(zone) : low_wmark_pages(zone); + if (wmark > zone_page_state(zone, NR_FREE_PAGES)) + return false; + } + + sc->nr_reclaimed += MIN_LRU_BATCH; + + return true; +} + +static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) +{ + struct blk_plug plug; + bool need_aging = false; + bool need_swapping = false; + unsigned long scanned = 0; + unsigned long reclaimed = sc->nr_reclaimed; + DEFINE_MAX_SEQ(lruvec); + + lru_add_drain(); + + blk_start_plug(&plug); + + set_mm_walk(lruvec_pgdat(lruvec)); + + while (true) { + int delta; + int swappiness; + unsigned long nr_to_scan; + + if (sc->may_swap) + swappiness = get_swappiness(lruvec, sc); + else if (!cgroup_reclaim(sc) && get_swappiness(lruvec, sc)) + swappiness = 1; + else + swappiness = 0; + + nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness, &need_aging); + if (!nr_to_scan) + goto done; + + delta = evict_pages(lruvec, sc, swappiness, &need_swapping); + if (!delta) + goto done; + + scanned += delta; + if (scanned >= nr_to_scan) + break; + + if (should_abort_scan(lruvec, max_seq, sc, need_swapping)) + break; + + cond_resched(); + } + + /* see the comment in lru_gen_age_node() */ + if (sc->nr_reclaimed - reclaimed >= MIN_LRU_BATCH && !need_aging) + sc->memcgs_need_aging = false; +done: + clear_mm_walk(); + + blk_finish_plug(&plug); +} + +/****************************************************************************** + * state change + ******************************************************************************/ + +static bool __maybe_unused state_is_valid(struct lruvec *lruvec) +{ + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + if (lrugen->enabled) { + enum lru_list lru; + + for_each_evictable_lru(lru) { + if (!list_empty(&lruvec->lists[lru])) + return false; + } + } else { + int gen, type, zone; + + for_each_gen_type_zone(gen, type, zone) { + if (!list_empty(&lrugen->lists[gen][type][zone])) + return false; + } + } + + return true; +} + +static bool fill_evictable(struct lruvec *lruvec) +{ + enum lru_list lru; + int remaining = MAX_LRU_BATCH; + + for_each_evictable_lru(lru) { + int type = is_file_lru(lru); + bool active = is_active_lru(lru); + struct list_head *head = &lruvec->lists[lru]; + + while (!list_empty(head)) { + bool success; + struct page *page = lru_to_page(head); + + VM_WARN_ON_ONCE_PAGE(PageUnevictable(page), page); + VM_WARN_ON_ONCE_PAGE(PageActive(page) != active, page); + VM_WARN_ON_ONCE_PAGE(page_is_file_lru(page) != type, page); + VM_WARN_ON_ONCE_PAGE(page_lru_gen(page) != -1, page); + + del_page_from_lru_list(page, lruvec); + success = lru_gen_add_page(lruvec, page, false); + VM_WARN_ON_ONCE(!success); + + if (!--remaining) + return false; + } + } + + return true; +} + +static bool drain_evictable(struct lruvec *lruvec) +{ + int gen, type, zone; + int remaining = MAX_LRU_BATCH; + + for_each_gen_type_zone(gen, type, zone) { + struct list_head *head = &lruvec->lrugen.lists[gen][type][zone]; + + while (!list_empty(head)) { + bool success; + struct page *page = lru_to_page(head); + + VM_WARN_ON_ONCE_PAGE(PageUnevictable(page), page); + VM_WARN_ON_ONCE_PAGE(PageActive(page), page); + VM_WARN_ON_ONCE_PAGE(page_is_file_lru(page) != type, page); + VM_WARN_ON_ONCE_PAGE(page_zonenum(page) != zone, page); + + success = lru_gen_del_page(lruvec, page, false); + VM_WARN_ON_ONCE(!success); + add_page_to_lru_list(page, lruvec); + + if (!--remaining) + return false; + } + } + + return true; +} + +static void lru_gen_change_state(bool enabled) +{ + static DEFINE_MUTEX(state_mutex); + + struct mem_cgroup *memcg; + + cgroup_lock(); + cpus_read_lock(); + get_online_mems(); + mutex_lock(&state_mutex); + + if (enabled == lru_gen_enabled()) + goto unlock; + + if (enabled) + static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]); + else + static_branch_disable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + int nid; + + for_each_node(nid) { + struct pglist_data *pgdat = NODE_DATA(nid); + struct lruvec *lruvec = get_lruvec(memcg, nid); + + if (!lruvec) + continue; + + spin_lock_irq(&pgdat->lru_lock); + + VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); + VM_WARN_ON_ONCE(!state_is_valid(lruvec)); + + lruvec->lrugen.enabled = enabled; + + while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) { + spin_unlock_irq(&pgdat->lru_lock); + cond_resched(); + spin_lock_irq(&pgdat->lru_lock); + } + + spin_unlock_irq(&pgdat->lru_lock); + } + + cond_resched(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); +unlock: + mutex_unlock(&state_mutex); + put_online_mems(); + cpus_read_unlock(); + cgroup_unlock(); +} + +/****************************************************************************** + * sysfs interface + ******************************************************************************/ + +static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl))); +} + +/* see Documentation/admin-guide/mm/multigen_lru.rst for details */ +static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t len) +{ + unsigned int msecs; + + if (kstrtouint(buf, 0, &msecs)) + return -EINVAL; + + WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs)); + + return len; +} + +static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR( + min_ttl_ms, 0644, show_min_ttl, store_min_ttl +); + +static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, char *buf) +{ + unsigned int caps = 0; + + if (get_cap(LRU_GEN_CORE)) + caps |= BIT(LRU_GEN_CORE); + + if (arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK)) + caps |= BIT(LRU_GEN_MM_WALK); + + if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && get_cap(LRU_GEN_NONLEAF_YOUNG)) + caps |= BIT(LRU_GEN_NONLEAF_YOUNG); + + return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps); +} + +/* see Documentation/admin-guide/mm/multigen_lru.rst for details */ +static ssize_t store_enabled(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t len) +{ + int i; + unsigned int caps; + + if (tolower(*buf) == 'n') + caps = 0; + else if (tolower(*buf) == 'y') + caps = -1; + else if (kstrtouint(buf, 0, &caps)) + return -EINVAL; + + for (i = 0; i < NR_LRU_GEN_CAPS; i++) { + bool enabled = caps & BIT(i); + + if (i == LRU_GEN_CORE) + lru_gen_change_state(enabled); + else if (enabled) + static_branch_enable(&lru_gen_caps[i]); + else + static_branch_disable(&lru_gen_caps[i]); + } + + return len; +} + +static struct kobj_attribute lru_gen_enabled_attr = __ATTR( + enabled, 0644, show_enabled, store_enabled +); + +static struct attribute *lru_gen_attrs[] = { + &lru_gen_min_ttl_attr.attr, + &lru_gen_enabled_attr.attr, + NULL +}; + +static struct attribute_group lru_gen_attr_group = { + .name = "lru_gen", + .attrs = lru_gen_attrs, +}; + +/****************************************************************************** + * debugfs interface + ******************************************************************************/ + +static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos) +{ + struct mem_cgroup *memcg; + loff_t nr_to_skip = *pos; + + m->private = kvmalloc(PATH_MAX, GFP_KERNEL); + if (!m->private) + return ERR_PTR(-ENOMEM); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + int nid; + + for_each_node_state(nid, N_MEMORY) { + if (!nr_to_skip--) + return get_lruvec(memcg, nid); + } + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + return NULL; +} + +static void lru_gen_seq_stop(struct seq_file *m, void *v) +{ + if (!IS_ERR_OR_NULL(v)) + mem_cgroup_iter_break(NULL, lruvec_memcg(v)); + + kvfree(m->private); + m->private = NULL; +} + +static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + int nid = lruvec_pgdat(v)->node_id; + struct mem_cgroup *memcg = lruvec_memcg(v); + + ++*pos; + + nid = next_memory_node(nid); + if (nid == MAX_NUMNODES) { + memcg = mem_cgroup_iter(NULL, memcg, NULL); + if (!memcg) + return NULL; + + nid = first_memory_node; + } + + return get_lruvec(memcg, nid); +} + +static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec, + unsigned long max_seq, unsigned long *min_seq, + unsigned long seq) +{ + int i; + int type, tier; + int hist = lru_hist_from_seq(seq); + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + for (tier = 0; tier < MAX_NR_TIERS; tier++) { + seq_printf(m, " %10d", tier); + for (type = 0; type < ANON_AND_FILE; type++) { + const char *s = " "; + unsigned long n[3] = {}; + + if (seq == max_seq) { + s = "RT "; + n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]); + n[1] = READ_ONCE(lrugen->avg_total[type][tier]); + } else if (seq == min_seq[type] || NR_HIST_GENS > 1) { + s = "rep"; + n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]); + n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]); + if (tier) + n[2] = READ_ONCE(lrugen->protected[hist][type][tier - 1]); + } + + for (i = 0; i < 3; i++) + seq_printf(m, " %10lu%c", n[i], s[i]); + } + seq_putc(m, '\n'); + } + + seq_puts(m, " "); + for (i = 0; i < NR_MM_STATS; i++) { + const char *s = " "; + unsigned long n = 0; + + if (seq == max_seq && NR_HIST_GENS == 1) { + s = "LOYNFA"; + n = READ_ONCE(lruvec->mm_state.stats[hist][i]); + } else if (seq != max_seq && NR_HIST_GENS > 1) { + s = "loynfa"; + n = READ_ONCE(lruvec->mm_state.stats[hist][i]); + } + + seq_printf(m, " %10lu%c", n, s[i]); + } + seq_putc(m, '\n'); +} + +/* see Documentation/admin-guide/mm/multigen_lru.rst for details */ +static int lru_gen_seq_show(struct seq_file *m, void *v) +{ + unsigned long seq; + bool full = !debugfs_real_fops(m->file)->write; + struct lruvec *lruvec = v; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + int nid = lruvec_pgdat(lruvec)->node_id; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + if (nid == first_memory_node) { + const char *path = memcg ? m->private : ""; + +#ifdef CONFIG_MEMCG + if (memcg) + cgroup_path(memcg->css.cgroup, m->private, PATH_MAX); +#endif + seq_printf(m, "memcg %5hu %s\n", mem_cgroup_id(memcg), path); + } + + seq_printf(m, " node %5d\n", nid); + + if (!full) + seq = min_seq[LRU_GEN_ANON]; + else if (max_seq >= MAX_NR_GENS) + seq = max_seq - MAX_NR_GENS + 1; + else + seq = 0; + + for (; seq <= max_seq; seq++) { + int type, zone; + int gen = lru_gen_from_seq(seq); + unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); + + seq_printf(m, " %10lu %10u", seq, jiffies_to_msecs(jiffies - birth)); + + for (type = 0; type < ANON_AND_FILE; type++) { + unsigned long size = 0; + char mark = full && seq < min_seq[type] ? 'x' : ' '; + + for (zone = 0; zone < MAX_NR_ZONES; zone++) + size += max_t(long, READ_ONCE(lrugen->nr_pages[gen][type][zone]), + 0); + + seq_printf(m, " %10lu%c", size, mark); + } + + seq_putc(m, '\n'); + + if (full) + lru_gen_seq_show_full(m, lruvec, max_seq, min_seq, seq); + } + + return 0; +} + +static const struct seq_operations lru_gen_seq_ops = { + .start = lru_gen_seq_start, + .stop = lru_gen_seq_stop, + .next = lru_gen_seq_next, + .show = lru_gen_seq_show, +}; + +static int run_aging(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc, + bool can_swap, bool full_scan) +{ + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + if (seq < max_seq) + return 0; + + if (seq > max_seq) + return -EINVAL; + + if (!full_scan && min_seq[!can_swap] + MAX_NR_GENS - 1 <= max_seq) + return -ERANGE; + + try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, full_scan); + + return 0; +} + +static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc, + int swappiness, unsigned long nr_to_reclaim) +{ + DEFINE_MAX_SEQ(lruvec); + + if (seq + MIN_NR_GENS > max_seq) + return -EINVAL; + + sc->nr_reclaimed = 0; + + while (!signal_pending(current)) { + DEFINE_MIN_SEQ(lruvec); + + if (seq < min_seq[!swappiness]) + return 0; + + if (sc->nr_reclaimed >= nr_to_reclaim) + return 0; + + if (!evict_pages(lruvec, sc, swappiness, NULL)) + return 0; + + cond_resched(); + } + + return -EINTR; +} + +static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq, + struct scan_control *sc, int swappiness, unsigned long opt) +{ + struct lruvec *lruvec; + int err = -EINVAL; + struct mem_cgroup *memcg = NULL; + + if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY)) + return -EINVAL; + + if (!mem_cgroup_disabled()) { + rcu_read_lock(); + memcg = mem_cgroup_from_id(memcg_id); +#ifdef CONFIG_MEMCG + if (memcg && !css_tryget(&memcg->css)) + memcg = NULL; +#endif + rcu_read_unlock(); + + if (!memcg) + return -EINVAL; + } + + if (memcg_id != mem_cgroup_id(memcg)) + goto done; + + lruvec = get_lruvec(memcg, nid); + + if (swappiness < 0) + swappiness = get_swappiness(lruvec, sc); + else if (swappiness > 200) + goto done; + + switch (cmd) { + case '+': + err = run_aging(lruvec, seq, sc, swappiness, opt); + break; + case '-': + err = run_eviction(lruvec, seq, sc, swappiness, opt); + break; + } +done: + mem_cgroup_put(memcg); + + return err; +} + +/* see Documentation/admin-guide/mm/multigen_lru.rst for details */ +static ssize_t lru_gen_seq_write(struct file *file, const char __user *src, + size_t len, loff_t *pos) +{ + void *buf; + char *cur, *next; + unsigned int flags; + struct blk_plug plug; + int err = -EINVAL; + struct scan_control sc = { + .may_writepage = true, + .may_unmap = true, + .may_swap = true, + .reclaim_idx = MAX_NR_ZONES - 1, + .gfp_mask = GFP_KERNEL, + }; + + buf = kvmalloc(len + 1, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + if (copy_from_user(buf, src, len)) { + kvfree(buf); + return -EFAULT; + } + + set_task_reclaim_state(current, &sc.reclaim_state); + flags = memalloc_noreclaim_save(); + blk_start_plug(&plug); + if (!set_mm_walk(NULL)) { + err = -ENOMEM; + goto done; + } + + next = buf; + next[len] = '\0'; + + while ((cur = strsep(&next, ",;\n"))) { + int n; + int end; + char cmd; + unsigned int memcg_id; + unsigned int nid; + unsigned long seq; + unsigned int swappiness = -1; + unsigned long opt = -1; + + cur = skip_spaces(cur); + if (!*cur) + continue; + + n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid, + &seq, &end, &swappiness, &end, &opt, &end); + if (n < 4 || cur[end]) { + err = -EINVAL; + break; + } + + err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt); + if (err) + break; + } +done: + clear_mm_walk(); + blk_finish_plug(&plug); + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); + + kvfree(buf); + + return err ? : len; +} + +static int lru_gen_seq_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &lru_gen_seq_ops); +} + +static const struct file_operations lru_gen_rw_fops = { + .open = lru_gen_seq_open, + .read = seq_read, + .write = lru_gen_seq_write, + .llseek = seq_lseek, + .release = seq_release, +}; + +static const struct file_operations lru_gen_ro_fops = { + .open = lru_gen_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +/****************************************************************************** + * initialization + ******************************************************************************/ + +void lru_gen_init_lruvec(struct lruvec *lruvec) +{ + int i; + int gen, type, zone; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + lrugen->max_seq = MIN_NR_GENS + 1; + lrugen->enabled = lru_gen_enabled(); + + for (i = 0; i <= MIN_NR_GENS + 1; i++) + lrugen->timestamps[i] = jiffies; + + for_each_gen_type_zone(gen, type, zone) + INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]); + + lruvec->mm_state.seq = MIN_NR_GENS; +} + +#ifdef CONFIG_MEMCG +void lru_gen_init_memcg(struct mem_cgroup *memcg) +{ + INIT_LIST_HEAD(&memcg->mm_list.fifo); + spin_lock_init(&memcg->mm_list.lock); +} + +void lru_gen_exit_memcg(struct mem_cgroup *memcg) +{ + int i; + int nid; + + for_each_node(nid) { + struct lruvec *lruvec = get_lruvec(memcg, nid); + + VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0, + sizeof(lruvec->lrugen.nr_pages))); + + for (i = 0; i < NR_BLOOM_FILTERS; i++) { + bitmap_free(lruvec->mm_state.filters[i]); + lruvec->mm_state.filters[i] = NULL; + } + } +} +#endif + +static int __init init_lru_gen(void) +{ + BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS); + BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS); + + if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) + pr_err("lru_gen: failed to create sysfs group\n"); + + debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); + debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops); + + return 0; +}; +late_initcall(init_lru_gen); + +#else /* !CONFIG_LRU_GEN */ + +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) +{ +} + +static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) +{ +} + +#endif /* CONFIG_LRU_GEN */ + +static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) +{ + unsigned long nr[NR_LRU_LISTS]; + unsigned long targets[NR_LRU_LISTS]; + unsigned long nr_to_scan; + enum lru_list lru; + unsigned long nr_reclaimed = 0; + unsigned long nr_to_reclaim = sc->nr_to_reclaim; + bool proportional_reclaim; + struct blk_plug plug; + + if (lru_gen_enabled()) { + lru_gen_shrink_lruvec(lruvec, sc); + return; + } + + get_scan_count(lruvec, sc, nr); + + /* Record the original scan target for proportional adjustments later */ + memcpy(targets, nr, sizeof(nr)); + + /* + * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal + * event that can occur when there is little memory pressure e.g. + * multiple streaming readers/writers. Hence, we do not abort scanning + * when the requested number of pages are reclaimed when scanning at + * DEF_PRIORITY on the assumption that the fact we are direct + * reclaiming implies that kswapd is not keeping up and it is best to + * do a batch of work at once. For memcg reclaim one check is made to + * abort proportional reclaim if either the file or anon lru has already + * dropped to zero at the first pass. + */ + proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() && + sc->priority == DEF_PRIORITY); + + blk_start_plug(&plug); + while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || + nr[LRU_INACTIVE_FILE]) { + unsigned long nr_anon, nr_file, percentage; + unsigned long nr_scanned; + + for_each_evictable_lru(lru) { + if (nr[lru]) { + nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX); + nr[lru] -= nr_to_scan; + + nr_reclaimed += shrink_list(lru, nr_to_scan, + lruvec, sc); + } + } + + cond_resched(); + + if (nr_reclaimed < nr_to_reclaim || proportional_reclaim) + continue; + + /* + * For kswapd and memcg, reclaim at least the number of pages + * requested. Ensure that the anon and file LRUs are scanned + * proportionally what was requested by get_scan_count(). We + * stop reclaiming one LRU and reduce the amount scanning + * proportional to the original scan target. + */ + nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE]; + nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON]; + + /* + * It's just vindictive to attack the larger once the smaller + * has gone to zero. And given the way we stop scanning the + * smaller below, this makes sure that we only make one nudge + * towards proportionality once we've got nr_to_reclaim. + */ + if (!nr_file || !nr_anon) + break; + + if (nr_file > nr_anon) { + unsigned long scan_target = targets[LRU_INACTIVE_ANON] + + targets[LRU_ACTIVE_ANON] + 1; + lru = LRU_BASE; + percentage = nr_anon * 100 / scan_target; + } else { + unsigned long scan_target = targets[LRU_INACTIVE_FILE] + + targets[LRU_ACTIVE_FILE] + 1; + lru = LRU_FILE; + percentage = nr_file * 100 / scan_target; + } + + /* Stop scanning the smaller of the LRU */ + nr[lru] = 0; + nr[lru + LRU_ACTIVE] = 0; + + /* + * Recalculate the other LRU scan count based on its original + * scan target and the percentage scanning already complete + */ + lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE; + nr_scanned = targets[lru] - nr[lru]; + nr[lru] = targets[lru] * (100 - percentage) / 100; + nr[lru] -= min(nr[lru], nr_scanned); + + lru += LRU_ACTIVE; + nr_scanned = targets[lru] - nr[lru]; + nr[lru] = targets[lru] * (100 - percentage) / 100; + nr[lru] -= min(nr[lru], nr_scanned); + } + blk_finish_plug(&plug); + sc->nr_reclaimed += nr_reclaimed; + + /* + * Even if we did not try to evict anon pages at all, we want to + * rebalance the anon lru active/inactive ratio. + */ + if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON)) + shrink_active_list(SWAP_CLUSTER_MAX, lruvec, + sc, LRU_ACTIVE_ANON); +} + +/* Use reclaim/compaction for costly allocs or under memory pressure */ +static bool in_reclaim_compaction(struct scan_control *sc) +{ + if (IS_ENABLED(CONFIG_COMPACTION) && sc->order && + (sc->order > PAGE_ALLOC_COSTLY_ORDER || + sc->priority < DEF_PRIORITY - 2)) + return true; + + return false; } /* @@ -2778,7 +5630,6 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) unsigned long nr_reclaimed, nr_scanned; struct lruvec *target_lruvec; bool reclaimable = false; - unsigned long file; target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); @@ -2788,93 +5639,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) nr_reclaimed = sc->nr_reclaimed; nr_scanned = sc->nr_scanned; - /* - * Determine the scan balance between anon and file LRUs. - */ - spin_lock_irq(&pgdat->lru_lock); - sc->anon_cost = target_lruvec->anon_cost; - sc->file_cost = target_lruvec->file_cost; - spin_unlock_irq(&pgdat->lru_lock); - - /* - * Target desirable inactive:active list ratios for the anon - * and file LRU lists. - */ - if (!sc->force_deactivate) { - unsigned long refaults; - - refaults = lruvec_page_state(target_lruvec, - WORKINGSET_ACTIVATE_ANON); - if (refaults != target_lruvec->refaults[0] || - inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) - sc->may_deactivate |= DEACTIVATE_ANON; - else - sc->may_deactivate &= ~DEACTIVATE_ANON; - - /* - * When refaults are being observed, it means a new - * workingset is being established. Deactivate to get - * rid of any stale active pages quickly. - */ - refaults = lruvec_page_state(target_lruvec, - WORKINGSET_ACTIVATE_FILE); - if (refaults != target_lruvec->refaults[1] || - inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) - sc->may_deactivate |= DEACTIVATE_FILE; - else - sc->may_deactivate &= ~DEACTIVATE_FILE; - } else - sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; - - /* - * If we have plenty of inactive file pages that aren't - * thrashing, try to reclaim those first before touching - * anonymous pages. - */ - file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); - if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) - sc->cache_trim_mode = 1; - else - sc->cache_trim_mode = 0; - - /* - * Prevent the reclaimer from falling into the cache trap: as - * cache pages start out inactive, every cache fault will tip - * the scan balance towards the file LRU. And as the file LRU - * shrinks, so does the window for rotation from references. - * This means we have a runaway feedback loop where a tiny - * thrashing file LRU becomes infinitely more attractive than - * anon pages. Try to detect this based on file LRU size. - */ - if (!cgroup_reclaim(sc)) { - unsigned long total_high_wmark = 0; - unsigned long free, anon; - int z; - - free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); - file = node_page_state(pgdat, NR_ACTIVE_FILE) + - node_page_state(pgdat, NR_INACTIVE_FILE); - - for (z = 0; z < MAX_NR_ZONES; z++) { - struct zone *zone = &pgdat->node_zones[z]; - if (!managed_zone(zone)) - continue; - - total_high_wmark += high_wmark_pages(zone); - } - - /* - * Consider anon: if that's low too, this isn't a - * runaway file reclaim problem, but rather just - * extreme pressure. Reclaim as per usual then. - */ - anon = node_page_state(pgdat, NR_INACTIVE_ANON); - - sc->file_is_tiny = - file + free <= total_high_wmark && - !(sc->may_deactivate & DEACTIVATE_ANON) && - anon >> sc->priority; - } + prepare_scan_count(pgdat, sc); shrink_node_memcgs(pgdat, sc); @@ -3094,6 +5859,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) struct lruvec *target_lruvec; unsigned long refaults; + if (lru_gen_enabled()) + return; + target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON); target_lruvec->refaults[0] = refaults; @@ -3464,12 +6232,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, EXPORT_SYMBOL_GPL(try_to_free_mem_cgroup_pages); #endif -static void age_active_anon(struct pglist_data *pgdat, - struct scan_control *sc) +static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc) { struct mem_cgroup *memcg; struct lruvec *lruvec; + if (lru_gen_enabled()) { + lru_gen_age_node(pgdat, sc); + return; + } + if (!total_swap_pages) return; @@ -3753,12 +6525,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) sc.may_swap = !nr_boost_reclaim; /* - * Do some background aging of the anon list, to give - * pages a chance to be referenced before reclaiming. All - * pages are rotated regardless of classzone as this is - * about consistent aging. + * Do some background aging, to give pages a chance to be + * referenced before reclaiming. All pages are rotated + * regardless of classzone as this is about consistent aging. */ - age_active_anon(pgdat, &sc); + kswapd_age_node(pgdat, &sc); /* * If we're getting trouble reclaiming, start doing writepage @@ -4454,12 +7225,9 @@ void check_move_unevictable_pages(struct pagevec *pvec) continue; if (page_evictable(page)) { - enum lru_list lru = page_lru_base_type(page); - - VM_BUG_ON_PAGE(PageActive(page), page); + del_page_from_lru_list(page, lruvec); ClearPageUnevictable(page); - del_page_from_lru_list(page, lruvec, LRU_UNEVICTABLE); - add_page_to_lru_list(page, lruvec, lru); + add_page_to_lru_list(page, lruvec); pgrescued += nr_pages; } } diff --git a/mm/workingset.c b/mm/workingset.c index 975a4d2dd02e..c4a6bc03aeff 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -185,7 +185,6 @@ static unsigned int bucket_order __read_mostly; static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, bool workingset) { - eviction >>= bucket_order; eviction &= EVICTION_MASK; eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; eviction = (eviction << NODES_SHIFT) | pgdat->node_id; @@ -210,10 +209,109 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, *memcgidp = memcgid; *pgdat = NODE_DATA(nid); - *evictionp = entry << bucket_order; + *evictionp = entry; *workingsetp = workingset; } +#ifdef CONFIG_LRU_GEN + +static void *lru_gen_eviction(struct page *page) +{ + int hist; + unsigned long token; + unsigned long min_seq; + struct lruvec *lruvec; + struct lru_gen_struct *lrugen; + int type = page_is_file_lru(page); + int delta = thp_nr_pages(page); + int refs = page_lru_refs(page); + int tier = lru_tier_from_refs(refs); + struct mem_cgroup *memcg = page_memcg(page); + struct pglist_data *pgdat = page_pgdat(page); + + BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT); + + lruvec = mem_cgroup_lruvec(memcg, pgdat); + lrugen = &lruvec->lrugen; + min_seq = READ_ONCE(lrugen->min_seq[type]); + token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0); + + hist = lru_hist_from_seq(min_seq); + atomic_long_add(delta, &lrugen->evicted[hist][type][tier]); + + return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs); +} + +static void lru_gen_refault(struct page *page, void *shadow) +{ + int hist, tier, refs; + int memcg_id; + bool workingset; + unsigned long token; + unsigned long min_seq; + struct lruvec *lruvec; + struct lru_gen_struct *lrugen; + struct mem_cgroup *memcg; + struct pglist_data *pgdat; + int type = page_is_file_lru(page); + int delta = thp_nr_pages(page); + + unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset); + + if (pgdat != page_pgdat(page)) + return; + + rcu_read_lock(); + + memcg = page_memcg_rcu(page); + if (memcg_id != mem_cgroup_id(memcg)) + goto unlock; + + lruvec = mem_cgroup_lruvec(memcg, pgdat); + lrugen = &lruvec->lrugen; + + mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta); + + min_seq = READ_ONCE(lrugen->min_seq[type]); + if ((token >> LRU_REFS_WIDTH) != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH))) + goto unlock; + + hist = lru_hist_from_seq(min_seq); + /* see the comment in page_lru_refs() */ + refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset; + tier = lru_tier_from_refs(refs); + + atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]); + mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta); + + /* + * Count the following two cases as stalls: + * 1. For pages accessed through page tables, hotter pages pushed out + * hot pages which refaulted immediately. + * 2. For pages accessed multiple times through file descriptors, + * numbers of accesses might have been out of the range. + */ + if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) { + SetPageWorkingset(page); + mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta); + } +unlock: + rcu_read_unlock(); +} + +#else /* !CONFIG_LRU_GEN */ + +static void *lru_gen_eviction(struct page *page) +{ + return NULL; +} + +static void lru_gen_refault(struct page *page, void *shadow) +{ +} + +#endif /* CONFIG_LRU_GEN */ + /** * workingset_age_nonresident - age non-resident entries as LRU ages * @lruvec: the lruvec that was aged @@ -257,6 +355,9 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) struct lruvec *lruvec; int memcgid; + if (lru_gen_enabled()) + return lru_gen_eviction(page); + /* Page is fully exclusive and pins page->mem_cgroup */ VM_BUG_ON_PAGE(PageLRU(page), page); VM_BUG_ON_PAGE(page_count(page), page); @@ -267,6 +368,7 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) /* XXX: target_memcg can be NULL, go through lruvec */ memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); eviction = atomic_long_read(&lruvec->nonresident_age); + eviction >>= bucket_order; return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); } @@ -294,7 +396,13 @@ void workingset_refault(struct page *page, void *shadow) bool workingset; int memcgid; + if (lru_gen_enabled()) { + lru_gen_refault(page, shadow); + return; + } + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset); + eviction <<= bucket_order; rcu_read_lock(); /*