Skip to content

Commit 5c89855

Browse files
committed
add 2025-01-02_direct-io.md
Signed-off-by: machine424 <[email protected]>
1 parent a06b6b6 commit 5c89855

File tree

2 files changed

+166
-0
lines changed

2 files changed

+166
-0
lines changed
534 KB
Loading

Diff for: proposals/2025-01-02_direct-io.md

+166
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# Direct I/O
2+
3+
* **Owners:**
4+
* [@machine424](https://github.com/machine424)
5+
6+
* **Implementation Status:** Partially implemented
7+
8+
* **Related Issues and PRs:**
9+
* [Prometheus PR #15365](https://github.com/prometheus/prometheus/pull/15365)
10+
11+
* **Other Docs or Links:**
12+
* [Slack Discussion](https://cloud-native.slack.com/archives/C01AUBA4PFE/p1726674665380109)
13+
14+
> This effort aims to experiment with direct I/O to determine whether it can enhance user
15+
experience and provide performance improvements.
16+
17+
## Why
18+
19+
Probably due to a lack of [understanding of good taste](https://yarchive.net/comp/linux/o_direct.html).
20+
21+
The motivation behind this effort is to address the confusion surrounding page cache behavior,
22+
particularly in environments where various memory-related statistics and metrics are available; for
23+
instance, see [this issue](https://github.com/kubernetes/kubernetes/issues/43916) regarding
24+
containerized environments.
25+
26+
This initiative was prompted by [@bboreham](https://github.com/bboreham)
27+
(refer to [the Slack thread](https://cloud-native.slack.com/archives/C01AUBA4PFE/p1726674665380109)).
28+
The effort aims to prevent misunderstandings and concerns regarding increased page cache usage, which
29+
might lead users or admins to make poor decisions, such as allocating more memory than necessary.
30+
Moreover, bypassing the cache when it is not required can help eliminate the overhead of data copying
31+
and reduce additional kernel overhead from reclaiming the cache. Thus, any performance improvements
32+
achieved through direct I/O are welcome.
33+
34+
### Pitfalls of the Current Solution
35+
36+
In addition to the issues mentioned above, sometimes the page cache generated during writes is not
37+
used at all. For instance, once chunks are written during compaction, they are opened via
38+
`mmap`, which renders the page cache produced during writing redundant and useless.
39+
40+
## Goals
41+
42+
* Reduce concerns and misconceptions about page cache overhead caused by writes.
43+
* Establish a foundation and gain familiarity with direct I/O; even if it proves unsuccessful in
44+
this context, the knowledge can be applied elsewhere.
45+
46+
### Audience
47+
48+
## Non-Goals
49+
50+
* Switch all reads/writes to direct I/O.
51+
* Eliminate the page cache, even if it results in significantly increased CPU or disk I/O usage.
52+
* Remove I/O buffering in Prometheus code, as direct I/O does not imply the absence of user-space
53+
buffering.
54+
55+
## How
56+
57+
During development, this effort will be gated behind a feature flag (likely `use-uncached-io`).
58+
Enabling this flag will activate direct I/O, for now or any [other mechanism later](#alternatives),
59+
where appropriate to address the concerns outlined in the [Why](#why) section.
60+
61+
Due to the alignment requirements of direct I/O
62+
(see [open's man page](https://man7.org/linux/man-pages/man2/open.2.html) for Linux, for example), the
63+
existing `bufio.Writer` cannot be used for writes. An alignment-aware writer is therefore required.
64+
65+
The most suitable open-source direct I/O writer currently available is
66+
[brk0v/directio](https://github.com/brk0v/directio). However, it alternates between direct and buffered
67+
I/O, which is not ideal, and it does not support files that are not initially aligned
68+
(for example, segments with headers). It also lacks some other features, as mentioned below.
69+
70+
To accelerate the feedback loop, an in-tree direct I/O writer will be put together. Once it becomes
71+
stable, it can be moved out of the tree, contributed to the above repository, or hosted within the
72+
Prometheus or Prometheus-community organization.
73+
74+
The direct I/O writer will conform to the following `BufWriter` interface (which is already satisfied
75+
by `bufio.Writer`):
76+
77+
```go
78+
type BufWriter interface {
79+
// Writes data to the buffer and returns the number of bytes written.
80+
// May trigger an implicit flush if necessary.
81+
Write([]byte) (int, error)
82+
83+
// Flushes any buffered data to the file.
84+
// The writer may not be usable after a call to Flush().
85+
Flush() error
86+
87+
// Discards any unflushed buffered data and clears any errors.
88+
// Resets the writer to use the specified file.
89+
Reset(f *os.File) error
90+
}
91+
```
92+
93+
Additional utilities will also need to be developed to:
94+
95+
* Automatically discover alignment requirements (when supported):
96+
97+
```go
98+
// directIORqmts holds the alignment requirements for direct I/O.
99+
// All fields are in bytes.
100+
type directIORqmts struct {
101+
// The required alignment for memory buffers addresses.
102+
memoryAlign int
103+
// The required alignment for I/O segment lengths and file offsets.
104+
offsetAlign int
105+
}
106+
107+
func fetchDirectIORqmts(fd uintptr) (*directIORqmts, error)
108+
```
109+
110+
* Allocate aligned memory uffers:
111+
112+
```go
113+
func alignedBlock(size int, alignmentRqmts *directIORqmts) []byte
114+
```
115+
116+
* Enable Direct I/O on a fd:
117+
118+
```go
119+
func enableDirectIO(fd uintptr) error
120+
```
121+
122+
Direct I/O support and alignment requirementsdiffer across OSes and filesystems. To keep the scope
123+
manageable, the initial focus will be on adding support for Linux (ideally versions `>6.1`). Build
124+
tags will enfore this restriction. Attempting to enable the feature flag on an unsupported OS will
125+
result in am error.
126+
127+
Support for additional operating systems and filesystems will be added subsequently.
128+
129+
Another key consideration is ensuring that all tests, especially for TSDB, are also run with direct
130+
I/O enabled. A build tag will be added to force the direct I/O writer for testing purposes, allowing
131+
for commands such as:
132+
133+
```shell
134+
go test --tags=forcedirectio ./tsdb/
135+
```
136+
137+
For the direct I/O writer itself and its utils, different unit tests will be added.
138+
139+
However, micro-benchmarking the writer may be misleading , as its performance is highly dependent
140+
on usage patterns. While some micro-benchmarks will be provided to ensure consistency with
141+
`bufio.Writer` (without sacrificing performance), higher-level benchmarks which simulate real
142+
Prometheus processes (such as compaction instead of writing a `10GB` buffer which is unlikely to
143+
be encountered in Prometheus) will receive greater emphasis. These benchmarks may also help
144+
fine-tune parameters like the buffer size.
145+
146+
For a look at the outcomes, the screenshot below shows a reduction of `20-50%` in page cache usage,
147+
as measured by the `container_memory_cache` metric while running Prombench on
148+
[Prometheus PR #15365](https://github.com/prometheus/prometheus/pull/15365). In this test, the
149+
direct I/O writer was used for writing chunks during compaction. No significant regressions or
150+
adverse effects on CPU or disk I/O were observed.
151+
152+
![container_memory_cache](../assets/2025-01-02_direct-io/container_memory_cache.png)
153+
154+
## Alternatives
155+
156+
An alternative approach is to focus on enhancing user understanding of page cache behavior within
157+
Prometheus, helping them better interpret and adapt to it, without making any changes.
158+
159+
## Action Plan
160+
161+
* [ ] Implement the direct I/O writer and its utils and use it for chunks writing
162+
during compaction (behind `use-uncached-io`): <https://github.com/prometheus/prometheus/pull/15365>
163+
* [ ] Add more tests, identify or/add relevant Benchmarks and metrics (maybe add them to the
164+
prombench dashboard if needed).
165+
* [ ] Identify more use cases for direct I/O.
166+
* [ ] Add support for more OSes.

0 commit comments

Comments
 (0)