Skip to content

Commit 7c72c5b

Browse files
authored
Merge pull request #45 from machine424/directio
add 2025-01-02_direct-io.md
2 parents cd8a81a + 24d5293 commit 7c72c5b

File tree

2 files changed

+181
-0
lines changed

2 files changed

+181
-0
lines changed
534 KB
Loading

proposals/2025-01-02_direct-io.md

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# Direct I/O
2+
3+
* **Owners:**
4+
* [@machine424](https://github.com/machine424)
5+
6+
* **Implementation Status:** Partially implemented
7+
8+
* **Related Issues and PRs:**
9+
* [Prometheus PR #15365](https://github.com/prometheus/prometheus/pull/15365)
10+
11+
* **Other Docs or Links:**
12+
* [Slack Discussion](https://cloud-native.slack.com/archives/C01AUBA4PFE/p1726674665380109)
13+
14+
> This effort aims to experiment with direct I/O to determine whether it can enhance user
15+
experience and provide performance improvements.
16+
17+
## Why
18+
19+
Probably due to a lack of [understanding of good taste](https://yarchive.net/comp/linux/o_direct.html).
20+
21+
The motivation behind this effort is to address the confusion surrounding page cache behavior,
22+
particularly in environments where various memory-related statistics and metrics are available; for
23+
instance, see [this issue](https://github.com/kubernetes/kubernetes/issues/43916) regarding
24+
containerized environments.
25+
26+
This initiative was prompted by [@bboreham](https://github.com/bboreham)
27+
(refer to [the Slack thread](https://cloud-native.slack.com/archives/C01AUBA4PFE/p1726674665380109)).
28+
The effort aims to prevent misunderstandings and concerns regarding increased page cache usage, which
29+
might lead users or admins to make poor decisions, such as allocating more memory than necessary.
30+
Moreover, bypassing the cache when it is not required can help eliminate the overhead of data copying
31+
and reduce additional kernel overhead from reclaiming the cache. Thus, any performance improvements
32+
achieved through direct I/O are welcome.
33+
34+
### Pitfalls of the Current Solution
35+
36+
In addition to the issues mentioned above, sometimes the page cache generated during writes is not
37+
used at all. For instance, once chunks are written during compaction, they are opened via
38+
`mmap`, which may render the page cache produced during writing redundant (on `OpenBSD` e.g.) or/and
39+
useless.
40+
41+
## Goals
42+
43+
* Reduce concerns and misconceptions about page cache overhead caused by writes.
44+
* Establish a foundation and gain familiarity with direct I/O; even if it proves unsuccessful in
45+
this context, the knowledge can be applied elsewhere.
46+
47+
### Audience
48+
49+
## Non-Goals
50+
51+
* Switch all reads/writes to direct I/O.
52+
* Eliminate the page cache, even if it results in significantly increased CPU or disk I/O usage.
53+
* Remove I/O buffering in Prometheus code, as direct I/O does not imply the absence of user-space
54+
buffering.
55+
56+
## How
57+
58+
During development, this effort will be gated behind a feature flag (likely `use-uncached-io`).
59+
Enabling this flag will activate direct I/O, for now or any [other mechanism later](#alternatives),
60+
where appropriate to address the concerns outlined in the [Why](#why) section.
61+
62+
Due to the alignment requirements of direct I/O
63+
(see [open's man page](https://man7.org/linux/man-pages/man2/open.2.html) for Linux, for example), the
64+
existing `bufio.Writer` cannot be used for writes. An alignment-aware writer is therefore required.
65+
66+
The most suitable open-source direct I/O writer currently available is
67+
[brk0v/directio](https://github.com/brk0v/directio). However, it alternates between direct and buffered
68+
I/O, which is not ideal, and it does not support files that are not initially aligned
69+
(for example, segments with headers). It also lacks some other features, as mentioned below.
70+
71+
To accelerate the feedback loop, an in-tree direct I/O writer will be put together. Once it becomes
72+
stable, it can be moved out of the tree, contributed to the above repository, or hosted within the
73+
Prometheus or Prometheus-community organization.
74+
75+
The direct I/O writer will conform to the following `BufWriter` interface (which is already satisfied
76+
by `bufio.Writer`):
77+
78+
```go
79+
type BufWriter interface {
80+
// Writes data to the buffer and returns the number of bytes written.
81+
// May trigger an implicit flush if necessary.
82+
Write([]byte) (int, error)
83+
84+
// Flushes any buffered data to the file.
85+
// The writer may not be usable after a call to Flush().
86+
Flush() error
87+
88+
// Discards any unflushed buffered data and clears any errors.
89+
// Resets the writer to use the specified file.
90+
Reset(f *os.File) error
91+
}
92+
```
93+
94+
Additional utilities will also need to be developed to:
95+
96+
* Automatically discover alignment requirements (when supported):
97+
98+
```go
99+
// directIORqmts holds the alignment requirements for direct I/O.
100+
// All fields are in bytes.
101+
type directIORqmts struct {
102+
// The required alignment for memory buffers addresses.
103+
memoryAlign int
104+
// The required alignment for I/O segment lengths and file offsets.
105+
offsetAlign int
106+
}
107+
108+
func fetchDirectIORqmts(fd uintptr) (*directIORqmts, error)
109+
```
110+
111+
* Allocate aligned memory uffers:
112+
113+
```go
114+
func alignedBlock(size int, alignmentRqmts *directIORqmts) []byte
115+
```
116+
117+
* Enable Direct I/O on a fd:
118+
119+
```go
120+
func enableDirectIO(fd uintptr) error
121+
```
122+
123+
Direct I/O support and alignment requirementsdiffer across OSes and filesystems. To keep the scope
124+
manageable, the initial focus will be on adding support for Linux (ideally versions `>6.1`). Build
125+
tags will enfore this restriction. Attempting to enable the feature flag on an unsupported OS will
126+
result in am error.
127+
128+
Support for additional operating systems and filesystems will be added subsequently.
129+
130+
Another key consideration is ensuring that all tests, especially for TSDB, are also run with direct
131+
I/O enabled. A build tag will be added to force the direct I/O writer for testing purposes, allowing
132+
for commands such as:
133+
134+
```shell
135+
go test --tags=forcedirectio ./tsdb/
136+
```
137+
138+
For the direct I/O writer itself and its utils, different unit tests will be added.
139+
140+
However, micro-benchmarking the writer may be misleading , as its performance is highly dependent
141+
on usage patterns. While some micro-benchmarks will be provided to ensure consistency with
142+
`bufio.Writer` (without sacrificing performance), higher-level benchmarks which simulate real
143+
Prometheus processes (such as compaction instead of writing a `10GB` buffer which is unlikely to
144+
be encountered in Prometheus) will receive greater emphasis. These benchmarks may also help
145+
fine-tune parameters like the buffer size.
146+
147+
For a look at the outcomes, the screenshot below shows a reduction of `20-50%` in page cache usage,
148+
as measured by the `container_memory_cache` metric while running Prombench on
149+
[Prometheus PR #15365](https://github.com/prometheus/prometheus/pull/15365). In this test, the
150+
direct I/O writer was used for writing chunks during compaction. No significant regressions or
151+
adverse effects on CPU or disk I/O were observed.
152+
153+
![container_memory_cache](../assets/2025-01-02_direct-io/container_memory_cache.png)
154+
155+
## Alternatives
156+
157+
- An alternative approach is to focus on enhancing user understanding of page cache behavior within
158+
Prometheus, helping them better interpret and adapt to it, without making any changes.
159+
160+
- [@dgl](https://github.com/dgl) has proposed using page cache hints to help with unpredictable
161+
memory reclaim:
162+
```
163+
Using posix_fadvise with POSIX_FADV_DONTNEED could be an option if it doesn't make sense
164+
to totally avoid the cache (there's also Linux specific memory management options like
165+
MADV_COLD, which could help for some of the container concerns to optimise which memory
166+
is reclaimed, although it wouldn't likely have as much user visible impact on the page cache).
167+
```
168+
169+
- The work [work](https://lore.kernel.org/linux-fsdevel/20241111234842.2024180-1-axboe@kernel.dk/T/#cluster-upstream-ci)
170+
on introducing the `RWF_UNCACHED` flag for uncached buffered I/O has resumed, Provided no concessions
171+
or major changes are made solely to implement direct I/O support, direct I/O integration can always
172+
be challenged when `RWF_UNCACHED` is released and proven effective.
173+
174+
## Action Plan
175+
176+
* [ ] Implement the direct I/O writer and its utils and use it for chunks writing
177+
during compaction (behind `use-uncached-io`): <https://github.com/prometheus/prometheus/pull/15365>
178+
* [ ] Add more tests, identify or/add relevant Benchmarks and metrics (maybe add them to the
179+
prombench dashboard if needed).
180+
* [ ] Identify more use cases for direct I/O.
181+
* [ ] Add support for more OSes.

0 commit comments

Comments
 (0)