|
| 1 | +# Direct I/O |
| 2 | + |
| 3 | +* **Owners:** |
| 4 | + * [@machine424](https://github.com/machine424) |
| 5 | + |
| 6 | +* **Implementation Status:** Partially implemented |
| 7 | + |
| 8 | +* **Related Issues and PRs:** |
| 9 | + * [Prometheus PR #15365](https://github.com/prometheus/prometheus/pull/15365) |
| 10 | + |
| 11 | +* **Other Docs or Links:** |
| 12 | + * [Slack Discussion](https://cloud-native.slack.com/archives/C01AUBA4PFE/p1726674665380109) |
| 13 | + |
| 14 | +> This effort aims to experiment with direct I/O to determine whether it can enhance user |
| 15 | +experience and provide performance improvements. |
| 16 | + |
| 17 | +## Why |
| 18 | + |
| 19 | +Probably due to a lack of [understanding of good taste](https://yarchive.net/comp/linux/o_direct.html). |
| 20 | + |
| 21 | +The motivation behind this effort is to address the confusion surrounding page cache behavior, |
| 22 | +particularly in environments where various memory-related statistics and metrics are available; for |
| 23 | +instance, see [this issue](https://github.com/kubernetes/kubernetes/issues/43916) regarding |
| 24 | +containerized environments. |
| 25 | + |
| 26 | +This initiative was prompted by [@bboreham](https://github.com/bboreham) |
| 27 | +(refer to [the Slack thread](https://cloud-native.slack.com/archives/C01AUBA4PFE/p1726674665380109)). |
| 28 | +The effort aims to prevent misunderstandings and concerns regarding increased page cache usage, which |
| 29 | +might lead users or admins to make poor decisions, such as allocating more memory than necessary. |
| 30 | +Moreover, bypassing the cache when it is not required can help eliminate the overhead of data copying |
| 31 | +and reduce additional kernel overhead from reclaiming the cache. Thus, any performance improvements |
| 32 | +achieved through direct I/O are welcome. |
| 33 | + |
| 34 | +### Pitfalls of the Current Solution |
| 35 | + |
| 36 | +In addition to the issues mentioned above, sometimes the page cache generated during writes is not |
| 37 | +used at all. For instance, once chunks are written during compaction, they are opened via |
| 38 | +`mmap`, which renders the page cache produced during writing redundant and useless. |
| 39 | + |
| 40 | +## Goals |
| 41 | + |
| 42 | +* Reduce concerns and misconceptions about page cache overhead caused by writes. |
| 43 | +* Establish a foundation and gain familiarity with direct I/O; even if it proves unsuccessful in |
| 44 | +this context, the knowledge can be applied elsewhere. |
| 45 | + |
| 46 | +### Audience |
| 47 | + |
| 48 | +## Non-Goals |
| 49 | + |
| 50 | +* Switch all reads/writes to direct I/O. |
| 51 | +* Eliminate the page cache, even if it results in significantly increased CPU or disk I/O usage. |
| 52 | +* Remove I/O buffering in Prometheus code, as direct I/O does not imply the absence of user-space |
| 53 | +buffering. |
| 54 | + |
| 55 | +## How |
| 56 | + |
| 57 | +During development, this effort will be gated behind a feature flag (likely `use-uncached-io`). |
| 58 | +Enabling this flag will activate direct I/O, for now or any [other mechanism later](#alternatives), |
| 59 | +where appropriate to address the concerns outlined in the [Why](#why) section. |
| 60 | + |
| 61 | +Due to the alignment requirements of direct I/O |
| 62 | +(see [open's man page](https://man7.org/linux/man-pages/man2/open.2.html) for Linux, for example), the |
| 63 | +existing `bufio.Writer` cannot be used for writes. An alignment-aware writer is therefore required. |
| 64 | + |
| 65 | +The most suitable open-source direct I/O writer currently available is |
| 66 | +[brk0v/directio](https://github.com/brk0v/directio). However, it alternates between direct and buffered |
| 67 | +I/O, which is not ideal, and it does not support files that are not initially aligned |
| 68 | +(for example, segments with headers). It also lacks some other features, as mentioned below. |
| 69 | + |
| 70 | +To accelerate the feedback loop, an in-tree direct I/O writer will be put together. Once it becomes |
| 71 | +stable, it can be moved out of the tree, contributed to the above repository, or hosted within the |
| 72 | +Prometheus or Prometheus-community organization. |
| 73 | + |
| 74 | +The direct I/O writer will conform to the following `BufWriter` interface (which is already satisfied |
| 75 | +by `bufio.Writer`): |
| 76 | + |
| 77 | +```go |
| 78 | +type BufWriter interface { |
| 79 | + // Writes data to the buffer and returns the number of bytes written. |
| 80 | + // May trigger an implicit flush if necessary. |
| 81 | + Write([]byte) (int, error) |
| 82 | + |
| 83 | + // Flushes any buffered data to the file. |
| 84 | + // The writer may not be usable after a call to Flush(). |
| 85 | + Flush() error |
| 86 | + |
| 87 | + // Discards any unflushed buffered data and clears any errors. |
| 88 | + // Resets the writer to use the specified file. |
| 89 | + Reset(f *os.File) error |
| 90 | +} |
| 91 | +``` |
| 92 | + |
| 93 | +Additional utilities will also need to be developed to: |
| 94 | + |
| 95 | +* Automatically discover alignment requirements (when supported): |
| 96 | + |
| 97 | + ```go |
| 98 | + // directIORqmts holds the alignment requirements for direct I/O. |
| 99 | + // All fields are in bytes. |
| 100 | + type directIORqmts struct { |
| 101 | + // The required alignment for memory buffers addresses. |
| 102 | + memoryAlign int |
| 103 | + // The required alignment for I/O segment lengths and file offsets. |
| 104 | + offsetAlign int |
| 105 | + } |
| 106 | + |
| 107 | + func fetchDirectIORqmts(fd uintptr) (*directIORqmts, error) |
| 108 | + ``` |
| 109 | + |
| 110 | +* Allocate aligned memory uffers: |
| 111 | + |
| 112 | + ```go |
| 113 | + func alignedBlock(size int, alignmentRqmts *directIORqmts) []byte |
| 114 | + ``` |
| 115 | + |
| 116 | +* Enable Direct I/O on a fd: |
| 117 | + |
| 118 | + ```go |
| 119 | + func enableDirectIO(fd uintptr) error |
| 120 | + ``` |
| 121 | + |
| 122 | +Direct I/O support and alignment requirementsdiffer across OSes and filesystems. To keep the scope |
| 123 | +manageable, the initial focus will be on adding support for Linux (ideally versions `>6.1`). Build |
| 124 | +tags will enfore this restriction. Attempting to enable the feature flag on an unsupported OS will |
| 125 | +result in am error. |
| 126 | + |
| 127 | +Support for additional operating systems and filesystems will be added subsequently. |
| 128 | + |
| 129 | +Another key consideration is ensuring that all tests, especially for TSDB, are also run with direct |
| 130 | +I/O enabled. A build tag will be added to force the direct I/O writer for testing purposes, allowing |
| 131 | +for commands such as: |
| 132 | + |
| 133 | +```shell |
| 134 | +go test --tags=forcedirectio ./tsdb/ |
| 135 | +``` |
| 136 | + |
| 137 | +For the direct I/O writer itself and its utils, different unit tests will be added. |
| 138 | + |
| 139 | +However, micro-benchmarking the writer may be misleading , as its performance is highly dependent |
| 140 | +on usage patterns. While some micro-benchmarks will be provided to ensure consistency with |
| 141 | +`bufio.Writer` (without sacrificing performance), higher-level benchmarks which simulate real |
| 142 | +Prometheus processes (such as compaction instead of writing a `10GB` buffer which is unlikely to |
| 143 | +be encountered in Prometheus) will receive greater emphasis. These benchmarks may also help |
| 144 | +fine-tune parameters like the buffer size. |
| 145 | + |
| 146 | +For a look at the outcomes, the screenshot below shows a reduction of `20-50%` in page cache usage, |
| 147 | +as measured by the `container_memory_cache` metric while running Prombench on |
| 148 | +[Prometheus PR #15365](https://github.com/prometheus/prometheus/pull/15365). In this test, the |
| 149 | +direct I/O writer was used for writing chunks during compaction. No significant regressions or |
| 150 | +adverse effects on CPU or disk I/O were observed. |
| 151 | + |
| 152 | + |
| 153 | + |
| 154 | +## Alternatives |
| 155 | + |
| 156 | +An alternative approach is to focus on enhancing user understanding of page cache behavior within |
| 157 | +Prometheus, helping them better interpret and adapt to it, without making any changes. |
| 158 | + |
| 159 | +## Action Plan |
| 160 | + |
| 161 | +* [ ] Implement the direct I/O writer and its utils and use it for chunks writing |
| 162 | +during compaction (behind `use-uncached-io`): <https://github.com/prometheus/prometheus/pull/15365> |
| 163 | +* [ ] Add more tests, identify or/add relevant Benchmarks and metrics (maybe add them to the |
| 164 | +prombench dashboard if needed). |
| 165 | +* [ ] Identify more use cases for direct I/O. |
| 166 | +* [ ] Add support for more OSes. |
0 commit comments