Skip to content
12 changes: 12 additions & 0 deletions doc/CONFIGURATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,18 @@ At mount time, Mountpoint automatically selects appropriate defaults to provide
* By default, Mountpoint can serve up to 16 concurrent file or directory operations, and automatically scales up to reach this limit. If your application makes more than this many concurrent reads and writes (including to the same or different files), you can improve performance by increasing this limit with the `--max-threads` command-line argument. Higher values of this flag might cause Mountpoint to use more of your instance's resources.
* When reading or writing files to S3, Mountpoint divides them into parts and uses parallel requests to improve throughput. You can change the part size Mountpoint uses for these parallel requests using the `--read-part-size` and `--write-part-size` command-line arguments, providing a maximum number of bytes per part for reading or writing respectively. For Mountpoint v1.7.2 or earlier, use `--part-size` instead. The default value for these arguments is 8 MiB (8,306,688 bytes), which in our testing is the largest value that achieves maximum throughput. Larger values can reduce the number of billed requests Mountpoint makes, but also reduce the throughput of object reads and writes to S3.

### Maximum number of files open for write

Mountpoint enforces a cap on the number of files that may be open for write at the same time, to prevent out-of-memory crashes. The cap is computed at startup from the configured memory target and write part size:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say "to control memory usage" instead of oom crashes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should start mentioning what's the default for memory target (and thus max writes).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added mentioning of the default value.


```
max_concurrent_writes = (memory_target − additional_mem_reserved) / write_part_size
```

`memory_target` is set with `--memory-target` and `write_part_size` is set with `--write-part-size` (or with `--part-size`). `additional_mem_reserved` is `max(128 MiB, memory_target / 8)` and is held back from data buffers for Mountpoint's own overhead. The minimum supported `memory_target` is 512 MiB, which allows 48 concurrent writers at the default 8 MiB write part size.

Once the cap is reached, `open()` calls for write return `ENOMEM` ("Cannot allocate memory") until an existing write handle is closed. To raise the cap, increase `--memory-target` or decrease `--write-part-size`.

### Maximum object size

In its default configuration, there is no maximum on the size of objects Mountpoint can read. However, Mountpoint uses [multipart upload](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) when writing new objects, and multipart upload allows a maximum of 10,000 parts for an object. This means Mountpoint can only upload objects up to 80,000 MiB (78.1 GiB) in size. If your application tries to write objects larger than this limit, writes will fail with an out of space error.
Expand Down
1 change: 1 addition & 0 deletions doc/METRICS.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ Mountpoint emits the following metrics:

| Metric | Type | Dimensions | Description |
|--------|------|------------|-------------|
| `fs.write_handle_limit_exceeded` | Counter | | Number of `open()` calls for write rejected because the [concurrent-writers cap](CONFIGURATION.md#maximum-number-of-files-open-for-write) was reached |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before documenting the new metric, we should add it to metrics::defs. Or we can leave it for a later review of all new metrics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about the defs file.

Removed from docs for now - we can review all new metrics later as you've suggested.

| `fuse.io_size` | Histogram | `fuse_request` (read, write) | Bytes transferred per FUSE request |
| `fuse.request_errors` | Counter | `fuse_request` (read, write, etc.) | Number of FUSE request errors |
| `fuse.request_latency` | Histogram | `fuse_request` (read, write, etc.) | Time to process a FUSE request |
Expand Down
2 changes: 2 additions & 0 deletions doc/SEMANTICS.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,8 @@ Your application should not write to the same object from multiple instances at

By default, Mountpoint ensures that new file uploads to a single key are atomic. As soon as an upload completes, other clients are able to see the new key and the entire content of the object. If the `--incremental-upload` flag is set, however, Mountpoint may issue multiple separate uploads during file writes to append data to the object. After each upload, the appended object in your S3 bucket will be visible to other clients.

Mountpoint enforces a cap on the number of files that may be open for write at the same time, derived from `--memory-target` and `--write-part-size`. When the cap is reached, `open()` for write returns `ENOMEM` until an existing write handle is closed. See [CONFIGURATION.md](CONFIGURATION.md#maximum-number-of-files-open-for-write) for more details.

### Optional metadata and object content caching

Mountpoint also offers optional metadata and object content caching.
Expand Down
146 changes: 145 additions & 1 deletion mountpoint-s3-fs/src/fs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ use crate::prefetch::{Prefetcher, PrefetcherBuilder};
use crate::sync::atomic::{AtomicU64, Ordering};
use crate::sync::{Arc, AsyncMutex, AsyncRwLock};
use crate::upload::{Uploader, UploaderConfig};
use crate::write_handle_limiter::WriteHandleLimiter;

mod config;
pub use config::{CacheConfig, S3FilesystemConfig};
Expand Down Expand Up @@ -55,6 +56,7 @@ where
metablock: Arc<dyn Metablock>,
prefetcher: Prefetcher<Client>,
uploader: Uploader<Client>,
write_handle_limiter: Arc<WriteHandleLimiter>,
next_handle: AtomicU64,
file_handles: AsyncRwLock<HashMap<u64, Arc<FileHandle<Client>>>>,
}
Expand Down Expand Up @@ -150,6 +152,11 @@ where
trace!(?config, "new filesystem");

let pool = pool.clone();
let write_handle_limiter = Arc::new(WriteHandleLimiter::new(
pool.mem_limit(),
pool.data_buffer_budget(),
client.write_part_size(),
));
let prefetcher = prefetch_builder.build(runtime.clone(), pool.clone(), config.prefetcher_config);
let uploader = Uploader::new(
client.clone(),
Expand All @@ -167,6 +174,7 @@ where
metablock: Arc::new(metablock),
prefetcher,
uploader,
write_handle_limiter,
next_handle: AtomicU64::new(1),
file_handles: AsyncRwLock::new(HashMap::new()),
}
Expand Down Expand Up @@ -349,13 +357,18 @@ where

let fh = self.next_handle(); // TODO: can we delay obtaining the next handle until we know we are creating a new file handle?
let write_mode = self.config.write_mode();
let new_handle = self.metablock.open_handle(ino, fh, &write_mode, flags).await?;
let mut new_handle = self
.metablock
.open_handle(ino, fh, &write_mode, flags, Some(&self.write_handle_limiter))
.await?;
let write_slot = new_handle.write_slot.take();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why take() and mut new_handle? We don't actually want to mutate anything.

If this was just to "fix" lifetimes, consider instead unpacking or copying the data you will need later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, done.

let state = FileHandleState::new(&new_handle, flags, self).await?;
let handle = FileHandle {
ino,
location: new_handle.lookup.try_into_s3_location()?,
open_pid: pid,
state: AsyncMutex::new(state),
write_slot,
};
debug!(fh, ino, "new {:?} file handle created", new_handle.mode);
self.file_handles.write().await.insert(fh, Arc::new(handle));
Expand Down Expand Up @@ -803,6 +816,7 @@ mod tests {
.bucket(bucket.to_string())
.enable_backpressure(true)
.initial_read_window_size(1024 * 1024)
.part_size(1024 * 1024)
.build(),
);
// Create "dir1" in the client to avoid creating it locally
Expand Down Expand Up @@ -1090,4 +1104,134 @@ mod tests {
);
S3Filesystem::new(client, prefetcher_builder, pool, runtime, superblock, fs_config)
}

/// Verifies that the limiter rejects opens for write past the configured cap with `ENOMEM`,
/// and that releasing a handle re-opens a slot. Uses a deliberately tight `mem_limit` so the
/// derived cap is small enough to exhaust quickly.
///
/// The MockClient `part_size` is also the value `client.write_part_size()` returns. With
/// `mem_limit = 256 MiB`, `part_size = 32 MiB`, `additional_mem_reserved = max(128, 32) = 128 MiB`,
/// the formula gives `(256 - 128) / 32 = 4` concurrent writers.
#[tokio::test]
async fn test_open_for_write_returns_enomem_when_cap_exhausted() {
let test_name = "test_open_for_write_returns_enomem_when_cap_exhausted";
let bucket = Bucket::new("bucket").unwrap();
let client = MockClient::config()
.bucket(bucket.to_string())
.enable_backpressure(true)
.initial_read_window_size(1024 * 1024)
.part_size(32 * 1024 * 1024)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we define this and other values as constants at the top of this function? And move the calculation described in the rustdoc there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.build();
client.add_object(
&format!("dir1/{}1.txt", test_name),
MockObject::constant(0xa1, 15, ETag::for_tests()),
);

let runtime = Runtime::new(ThreadPool::builder().pool_size(2).create().unwrap());
let pool = PagedPool::new_with_candidate_sizes([32 * 1024 * 1024], 256 * 1024 * 1024);
let prefetcher_builder = Prefetcher::default_builder(client.clone());
let fs_config = S3FilesystemConfig {
allow_overwrite: true,
..Default::default()
};
let superblock = Superblock::new(
client.clone(),
S3Path::new(bucket, Default::default()),
SuperblockConfig {
cache_config: fs_config.cache_config.clone(),
s3_personality: fs_config.s3_personality,
},
);
let fs = S3Filesystem::new(client, prefetcher_builder, pool, runtime, superblock, fs_config);

// Sanity-check that we computed exactly 4 writer slots given the test's tuning.
let cap = fs.write_handle_limiter.max_concurrent_writes();
assert_eq!(cap, 4);

// Resolve the directory inode for mknod calls below.
let dir_entry = fs.lookup(FUSE_ROOT_INODE, "dir1".as_ref()).await.unwrap();
let read_dir_ino = dir_entry.attr.ino;

// Create more files than the write cap, then prove the cap holds.
let mut files = Vec::new();
for i in 0..(cap + 1) {
let dentry = fs
.mknod(
read_dir_ino,
format!("file{i}.bin").as_ref(),
libc::S_IFREG | libc::S_IRWXU,
0,
0,
)
.await
.unwrap();
files.push(dentry);
}

// Open up to the cap: all should succeed.
let mut open_handles = Vec::new();
for dentry in files.iter().take(cap) {
let opened = fs
.open(dentry.attr.ino, OpenFlags::O_WRONLY, 0)
.await
.expect("open within cap should succeed");
open_handles.push(opened);
}

// The next open exceeds the cap → ENOMEM with the expected message.
let err = fs
.open(files[cap].attr.ino, OpenFlags::O_WRONLY, 0)
.await
.expect_err("opening past the cap should return ENOMEM");
assert_eq!(err.errno, libc::ENOMEM);
let msg = format!("{err}");
assert!(
msg.contains("cannot open file for write"),
"unexpected error message: {msg}"
);
assert!(
msg.contains(&cap.to_string()),
"error message should reference cap of {cap}: {msg}"
);

// Re-opening the rejected file *before* freeing a slot still returns ENOMEM (no inode
// state was mutated by the rejected open, and the cap is still full).
let err = fs
.open(files[cap].attr.ino, OpenFlags::O_WRONLY, 0)
.await
.expect_err("re-opening the rejected file while cap is full should still return ENOMEM");
assert_eq!(err.errno, libc::ENOMEM);

// Locks in the fail-fast check order: when the cap is exhausted AND the target file
// already has an active writer (open_handles[0] is still live for files[0]), the user
// sees ENOMEM rather than EPERM. The cheap lock-free limiter check runs before the
// inode-locked conflict check, so cap exhaustion wins. See the commit message for the
// ordering rationale; flipping this order is a deliberate design change.
let err = fs
.open(files[0].attr.ino, OpenFlags::O_WRONLY, 0)
.await
.expect_err("opening an already-writing file at cap should return an error");
assert_eq!(
err.errno,
libc::ENOMEM,
"limiter check should fire before inode-conflict check (got errno {})",
err.errno
);

// Closing one of the open handles releases a slot.
fs.flush(files[0].attr.ino, open_handles[0].fh, 0, 0)
.await
.expect("flush should succeed");
fs.release(files[0].attr.ino, open_handles[0].fh, 0, None, true)
.await
.expect("release should succeed");

// The rejected file can now be opened cleanly. This validates that the ENOMEM rejection
// didn't leave the inode in `LocalOpenForWriting` — the metablock acquires the slot
// before mutating any state, so a rejection is fully reversible.
let _opened_retry = fs
.open(files[cap].attr.ino, OpenFlags::O_WRONLY, 0)
.await
.expect("retrying the previously-rejected file should succeed after a slot is freed");
}
}
1 change: 1 addition & 0 deletions mountpoint-s3-fs/src/fs/error.rs
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,7 @@ impl ToErrno for InodeError {
InodeError::OutOfOrderReadDir { .. } => libc::EBADF,
InodeError::NoSuchDirHandle { .. } => libc::EINVAL,
InodeError::FlexibleRetrievalObjectNotAccessible(_) => libc::EACCES,
InodeError::WriteHandleLimitExceeded(_) => libc::ENOMEM,
}
}
}
Expand Down
6 changes: 6 additions & 0 deletions mountpoint-s3-fs/src/fs/handles.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ use crate::object::ObjectId;
use crate::prefetch::PrefetchGetObject;
use crate::sync::{Arc, AsyncMutex};
use crate::upload::{AppendUploadRequest, UploadRequest};
use crate::write_handle_limiter::WriteHandleSlot;

use super::{Error, InodeNo, OpenFlags, S3Filesystem, ToErrno};

Expand All @@ -23,6 +24,11 @@ where
pub state: AsyncMutex<FileHandleState<Client>>,
/// Process that created the handle
pub open_pid: u32,
/// Slot reserved on the [`MemoryLimiter`] for this handle. `Some` for write handles, `None`
/// for read handles. Released automatically when the `FileHandle` is dropped — held purely
/// for that `Drop` side effect, so the field is never read directly.
#[expect(dead_code, reason = "held for its Drop side effect")]
pub(super) write_slot: Option<WriteHandleSlot>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be in FileHandleState::Write?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, prefer _write_slot to the dead code expect.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion - done.

}

impl<Client> FileHandle<Client>
Expand Down
1 change: 1 addition & 0 deletions mountpoint-s3-fs/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ pub mod s3;
mod superblock;
mod sync;
pub mod upload;
pub mod write_handle_limiter;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving under memory or fs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


pub use async_util::Runtime;
pub use config::MountpointConfig;
Expand Down
2 changes: 2 additions & 0 deletions mountpoint-s3-fs/src/manifest/metablock.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ use crate::metablock::{
use crate::s3::S3Path;
use crate::sync::atomic::{AtomicU64, Ordering};
use crate::sync::{Arc, Mutex, RwLock};
use crate::write_handle_limiter::WriteHandleLimiter;

use super::core::{Manifest, ManifestDirIter, ManifestError};

Expand Down Expand Up @@ -187,6 +188,7 @@ impl Metablock for ManifestMetablock {
_fh: u64,
_write_mode: &WriteMode,
flags: OpenFlags,
_write_handle_limiter: Option<&Arc<WriteHandleLimiter>>,
) -> Result<NewHandle, InodeError> {
let lookup = self.getattr(ino, false).await?;
if flags.contains(OpenFlags::O_WRONLY) {
Expand Down
13 changes: 13 additions & 0 deletions mountpoint-s3-fs/src/memory/limiter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,19 @@ impl MemoryLimiter {
}
}

/// The configured memory limit in bytes. Note this is the total memory target including
/// non-buffer overhead, not the budget available for data buffers — see [`Self::data_buffer_budget`].
pub fn mem_limit(&self) -> u64 {
self.mem_limit
}

/// The static memory budget available for data buffers, i.e. `mem_limit - additional_mem_reserved`.
/// This is the upper bound on buffer-backed allocations and is used by
/// [`crate::write_handle_limiter::WriteHandleLimiter`] to derive its cap.
pub fn data_buffer_budget(&self) -> u64 {
self.mem_limit.saturating_sub(self.additional_mem_reserved)
}

/// Reserve the memory for future uses. Always succeeds, even if it means going beyond
/// the configured memory limit.
pub fn reserve(&self, cursor_id: CursorId, area: BufferArea, size: u64) {
Expand Down
10 changes: 10 additions & 0 deletions mountpoint-s3-fs/src/memory/pool.rs
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,16 @@ impl PagedPool {

// ─── Delegation methods for MemoryLimiter ───────────────────────────────────

/// The configured memory limit in bytes.
pub fn mem_limit(&self) -> u64 {
self.inner.limiter.mem_limit()
}

/// The static memory budget available for data buffers, i.e. `mem_limit - additional_mem_reserved`.
pub fn data_buffer_budget(&self) -> u64 {
self.inner.limiter.data_buffer_budget()
}

/// Reserve memory for future uses. Always succeeds (unconditional).
pub fn reserve(&self, cursor_id: CursorId, area: BufferArea, size: u64) {
self.inner.limiter.reserve(cursor_id, area, size);
Expand Down
10 changes: 10 additions & 0 deletions mountpoint-s3-fs/src/metablock.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ pub use pending_upload::PendingUploadHook;
pub use stat::{InodeKind, InodeNo, InodeStat};

use crate::fs::OpenFlags;
use crate::sync::Arc;
use crate::write_handle_limiter::WriteHandleLimiter;

pub const ROOT_INODE_NO: InodeNo = crate::fs::FUSE_ROOT_INODE;

Expand Down Expand Up @@ -63,6 +65,7 @@ pub trait Metablock: Send + Sync {
fh: u64,
write_mode: &WriteMode,
flags: OpenFlags,
write_handle_limiter: Option<&Arc<WriteHandleLimiter>>,
) -> Result<NewHandle, InodeError>;

/// Increase the size of a file open for writing.
Expand Down Expand Up @@ -226,20 +229,27 @@ pub enum ReadWriteMode {
pub struct NewHandle {
pub lookup: Lookup,
pub mode: ReadWriteMode,
/// Write-handle slot reserved for this handle when the open resolves to write mode.
/// `Some` if the metablock layer reserved a slot during `open_handle`, `None` otherwise
/// (read mode, or no limiter configured). The caller must transfer ownership into its own
/// `FileHandle` so the slot is released when the file handle is dropped.
pub write_slot: Option<crate::write_handle_limiter::WriteHandleSlot>,
}

impl NewHandle {
pub fn read(lookup: Lookup) -> Self {
Self {
lookup,
mode: ReadWriteMode::Read,
write_slot: None,
}
}

pub fn write(lookup: Lookup) -> Self {
Self {
lookup,
mode: ReadWriteMode::Write,
write_slot: None,
}
}
}
11 changes: 11 additions & 0 deletions mountpoint-s3-fs/src/metablock/error.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ use crate::manifest::ManifestError;
use crate::metablock::S3Location;
use crate::sync::Arc;
use crate::upload::UploadError;
use crate::write_handle_limiter::WriteHandleLimitError;

use super::InodeNo;

Expand Down Expand Up @@ -82,6 +83,16 @@ pub enum InodeError {
NoSuchDirHandle { fh: u64 },
#[error("objects in flexible retrieval storage classes are not accessible")]
FlexibleRetrievalObjectNotAccessible(InodeErrorInfo),
#[error(
"cannot open file for write: exceeded max allowed concurrent write file handlers of {max} \
based on memory target {mem_limit_mib}MiB (part size is {write_part_size_mib}MiB). \
Increase --memory-target or decrease --write-part-size to allow for more concurrent writes, \
or close existing open for write file handlers and retry open() operation.",
max = .0.max,
mem_limit_mib = .0.mem_limit_mib,
write_part_size_mib = .0.write_part_size_mib,
)]
WriteHandleLimitExceeded(WriteHandleLimitError),
}

impl InodeError {
Expand Down
Loading
Loading