You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: linux/page-cache/README.md
+113Lines changed: 113 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -294,3 +294,116 @@ $ vmtouch /var/tmp/file1.db
294
294
```
295
295
296
296
## 4. Page cache eviction and page reclaim
297
+
298
+
### 4.1. Theory
299
+
300
+
Like any other cache, Linux Page cache continuously monitors the last used pages and makes decisions about which pages should be deleted and which should be kept in the cache.
301
+
302
+
The primary approach to control and tune Page cache is the cgroup subsystem. You can divide the server’s memory into several smaller caches (cgroups) and thus control and protect applications and services. In addition, the cgroup memory and IO controllers provide a lot of statistics that are useful for tuning your software and understanding the internals of the cache.
303
+
304
+
Linux Page Cache is closely tightened with Linux Memory Management, cgroup and virtual file system (VFS). Core building block is a per cgroup pair of active and inactive lists:
305
+
306
+
- The first pair for anonymous memory (for instance, allocated with `malloc()` or not file backended `mmap()`).
307
+
- The second pair for Page cache file memory (all file operations including `read()`, `write()`, `mmap()` accesses, etc.)
308
+
309
+
The least recently used algorithm LRU:
310
+
311
+
- These 2 lists from a double clock data structure.
312
+
- Linux should choose pages that have not been used recently (inactive) based on the fact that the pages that have not seen used recently will not be used frequently in a short period of time.
313
+
- Both the active and inactive lists adopt the form of FIFO for their entries.
For example, a user process has just read some data from disks. This action triggered the kernel to load data to the cache. It was the first time when the kernel had to access the file. Hence it added a page `h` to the head of the inactive list:
Now, a new file operation to the page `h` promotes the page to the active LRU list by putting it at the head. This action also ousts the page `1` to the head of the inactive LRU list and shifts all other members:
But it’s worth mentioning that the real process of pages promotion and demotion is much more complicated and sophisticated.
338
+
339
+
First of all, if a system has NUMA hardware nodes (`man 8 numastat`), it has twice more LRU lists. The reason is that the kernel tries to store memory information in the NUMA nodes in order to have fewer lock contentions.
340
+
341
+
In addition, Linux Page Cache also has special shadow and referenced flag logic for promotion, demotion and re-promotion pages.
342
+
343
+
Shadow entries help to mitigate the memory thrashing problem. This issue happens when the programs’ working set size is close to or greater than the real memory size (maybe cgroup limit or the system RAM limitation).
344
+
345
+
### 4.2. Manual pages eviction with `POSIX_FADV_DONTNEED`
### 4.4. Page cache, `vm.swappiness` and modern kernels
397
+
398
+
Page Cache should be the first and the only option for the memory eviction and reclaiming. But if the system has swap, the kernel has one more option. It can swap out the anonymous (not file-backed) pages. So, in order to control which inactive LRU list to prefer for scans, the kernel has the `sysctl vm.swappiness` knob.
399
+
400
+
```shell
401
+
$ sudo sysctl -a | grep swap
402
+
// From 0..200 Higher means more swappy
403
+
// 100 value means that the kernel considers anonymous and Page cache pages equally in terms of reclamation.
404
+
vm.swappiness = 60
405
+
```
406
+
407
+
### 4.4. Understanding memory reclaim process with `/proc/pid/pagemap`
408
+
409
+
There is a `/proc/PID/pagemap` file that contains the page table information of the PID. The page table, basically speaking, is an internal kernel map between page frames (real physical memory pages stored in RAM) and virtual pages of the process. Each process in the linux system has its own virtual memory address space which is completely independent form other processes and physical memory addresses.
0 commit comments