Skip to content

Commit 738793c

Browse files
committed
update but still wip
1 parent f3895d8 commit 738793c

File tree

2 files changed

+118
-0
lines changed

2 files changed

+118
-0
lines changed

linux/page-cache/README.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -294,3 +294,116 @@ $ vmtouch /var/tmp/file1.db
294294
```
295295

296296
## 4. Page cache eviction and page reclaim
297+
298+
### 4.1. Theory
299+
300+
Like any other cache, Linux Page cache continuously monitors the last used pages and makes decisions about which pages should be deleted and which should be kept in the cache.
301+
302+
The primary approach to control and tune Page cache is the cgroup subsystem. You can divide the server’s memory into several smaller caches (cgroups) and thus control and protect applications and services. In addition, the cgroup memory and IO controllers provide a lot of statistics that are useful for tuning your software and understanding the internals of the cache.
303+
304+
Linux Page Cache is closely tightened with Linux Memory Management, cgroup and virtual file system (VFS). Core building block is a per cgroup pair of active and inactive lists:
305+
306+
- The first pair for anonymous memory (for instance, allocated with `malloc()` or not file backended `mmap()`).
307+
- The second pair for Page cache file memory (all file operations including `read()`, `write()`, `mmap()` accesses, etc.)
308+
309+
The least recently used algorithm LRU:
310+
311+
- These 2 lists from a double clock data structure.
312+
- Linux should choose pages that have not been used recently (inactive) based on the fact that the pages that have not seen used recently will not be used frequently in a short period of time.
313+
- Both the active and inactive lists adopt the form of FIFO for their entries.
314+
315+
![](https://biriukov.dev/docs/page-cache/images/lru.png)
316+
317+
For example, a user process has just read some data from disks. This action triggered the kernel to load data to the cache. It was the first time when the kernel had to access the file. Hence it added a page `h` to the head of the inactive list:
318+
319+
![](https://biriukov.dev/docs/page-cache/images/eviction-1.png)
320+
321+
Some time has passed, the system loads 2 more pages: `i` and `j`.
322+
323+
![](https://biriukov.dev/docs/page-cache/images/eviction-2.png)
324+
325+
Now, a new file operation to the page `h` promotes the page to the active LRU list by putting it at the head. This action also ousts the page `1` to the head of the inactive LRU list and shifts all other members:
326+
327+
![](https://biriukov.dev/docs/page-cache/images/eviction-3.png)
328+
329+
As time flies, page `h` looses its head position in the active LRU list.
330+
331+
![](https://biriukov.dev/docs/page-cache/images/eviction-4.png)
332+
333+
But a new file access to the `h`’s position in the file returns h back to the head of the active LRU list.
334+
335+
![](https://biriukov.dev/docs/page-cache/images/eviction-5.png)
336+
337+
But it’s worth mentioning that the real process of pages promotion and demotion is much more complicated and sophisticated.
338+
339+
First of all, if a system has NUMA hardware nodes (`man 8 numastat`), it has twice more LRU lists. The reason is that the kernel tries to store memory information in the NUMA nodes in order to have fewer lock contentions.
340+
341+
In addition, Linux Page Cache also has special shadow and referenced flag logic for promotion, demotion and re-promotion pages.
342+
343+
Shadow entries help to mitigate the memory thrashing problem. This issue happens when the programs’ working set size is close to or greater than the real memory size (maybe cgroup limit or the system RAM limitation).
344+
345+
### 4.2. Manual pages eviction with `POSIX_FADV_DONTNEED`
346+
347+
```shell
348+
$ vmtouch /var/tmp/file1.db -e
349+
Files: 1
350+
Directories: 0
351+
Evicted Pages: 32768 (128M)
352+
Elapsed: 7.2e-05 seconds
353+
$ vmtouch /var/tmp/file1.db
354+
Files: 1
355+
Directories: 0
356+
Resident Pages: 0/32768 0/128M 0%
357+
Elapsed: 0.000526 seconds
358+
```
359+
360+
```python
361+
import os
362+
363+
with open("/var/tmp/file1.db", "br") as f:
364+
fd = f.fileno()
365+
os.posix_fadvise(fd, 0, os.fstat(fd).st_size, os.POSIX_FADV_DONTNEED)
366+
```
367+
368+
```shell
369+
# Read the entire test file into Page cache
370+
$ dd if=/var/tmp/file1.db of=/dev/null
371+
262144+0 records in
372+
262144+0 records out
373+
134217728 bytes (134 MB, 128 MiB) copied, 0,186082 s, 721 MB/s
374+
375+
$ python3 evict_full_file.py
376+
$ vmtouch /var/tmp/file1.db
377+
Files: 1G
378+
Directories: 0
379+
Resident Pages: 0/32768 0/128M 0%
380+
Elapsed: 0.000278 seconds
381+
```
382+
383+
### 4.3. Make your memory unevictable
384+
385+
Kernel provides a bunch of syscalls for doing that: `mlock()`, `mlock2()` (\*) and `mlockall()`. As with the `mincore()`, you must map the file first.
386+
387+
You likely need to increase the limit:
388+
389+
```shell
390+
$ ulimit -l
391+
392+
$ grep unevic /sys/fs/cgroup/user.slice/user-1000.slice/session-c2.scope/memory.stat
393+
unevictable 189382656
394+
```
395+
396+
### 4.4. Page cache, `vm.swappiness` and modern kernels
397+
398+
Page Cache should be the first and the only option for the memory eviction and reclaiming. But if the system has swap, the kernel has one more option. It can swap out the anonymous (not file-backed) pages. So, in order to control which inactive LRU list to prefer for scans, the kernel has the `sysctl vm.swappiness` knob.
399+
400+
```shell
401+
$ sudo sysctl -a | grep swap
402+
// From 0..200 Higher means more swappy
403+
// 100 value means that the kernel considers anonymous and Page cache pages equally in terms of reclamation.
404+
vm.swappiness = 60
405+
```
406+
407+
### 4.4. Understanding memory reclaim process with `/proc/pid/pagemap`
408+
409+
There is a `/proc/PID/pagemap` file that contains the page table information of the PID. The page table, basically speaking, is an internal kernel map between page frames (real physical memory pages stored in RAM) and virtual pages of the process. Each process in the linux system has its own virtual memory address space which is completely independent form other processes and physical memory addresses.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
import os
2+
3+
with open("/var/tmp/file1.db", "br") as f:
4+
fd = f.fileno()
5+
os.posix_fadvise(fd, 0, os.fstat(fd).st_size, os.POSIX_FADV_DONTNEED)

0 commit comments

Comments
 (0)