Skip to content

Commit 97513ab

Browse files
feat: enable THP for guest memory
This commit adds THP for the guest memory, with a new value for the huge_pages option. Signed-off-by: Marco Marangoni <mamarang@amazon.com>
1 parent e04e55f commit 97513ab

17 files changed

Lines changed: 244 additions & 53 deletions

File tree

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,12 @@ and this project adheres to
1010

1111
### Added
1212

13+
- [#6003](https://github.com/firecracker-microvm/firecracker/pull/6003): Added a
14+
new option `Transparent` for the `huge_pages` setting. If set, Firecracker
15+
will use transparent huge pages for the guest memory via
16+
`madvise(MADV_HUGEPAGE)`. Guest memory must be a multiple of 2MB when using
17+
this option.
18+
1319
### Changed
1420

1521
### Deprecated

docs/hugepages.md

Lines changed: 32 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,37 @@
11
# Backing Guest Memory by Huge Pages
22

3-
Firecracker supports backing the guest memory of a VM by 2MB hugetlbfs pages.
4-
This can be enabled by setting the `huge_pages` field of `PUT` or `PATCH`
5-
requests to the `/machine-config` endpoint to `2M`.
6-
7-
Backing guest memory by huge pages can bring performance improvements for
8-
specific workloads, due to less TLB contention and less overhead during
9-
virtual->physical address resolution. It can also help reduce the number of
10-
KVM_EXITS required to rebuild extended page tables post snapshot restore, as
11-
well as improve boot times (by up to 50% as measured by Firecracker's
3+
Firecracker supports three modes for the `huge_pages` field of `PUT` or `PATCH`
4+
requests to the `/machine-config` endpoint:
5+
6+
- `None` (default): Uses regular 4K pages with no huge page behavior.
7+
- `Transparent`: Uses `madvise(MADV_HUGEPAGE)` to request transparent huge pages
8+
for guest memory. Guest memory size must be a multiple of 2MB.
9+
- `2M`: Backs guest memory by 2MB hugetlbfs pages.
10+
11+
## Transparent Huge Pages (THP)
12+
13+
Setting `huge_pages` to `Transparent` enables transparent huge pages for guest
14+
memory via `madvise(MADV_HUGEPAGE)`. This allows the kernel to opportunistically
15+
back guest memory with 2MB pages without requiring a pre-allocated hugetlbfs
16+
pool.
17+
18+
Limitations:
19+
20+
- THP is only effective for anonymous memory (non-memfd). When vhost-user-blk
21+
devices are in use, guest memory is memfd-backed and THP will not be applied.
22+
- THP does not integrate with UFFD; no transparent huge pages will be allocated
23+
during userfault-handling while resuming from a snapshot.
24+
25+
Please refer to the [Linux Documentation][thp_docs] for more information.
26+
27+
## Hugetlbfs (2M)
28+
29+
Setting `huge_pages` to `2M` backs guest memory by 2MB hugetlbfs pages. This can
30+
bring performance improvements for specific workloads, due to less TLB
31+
contention and less overhead during virtual->physical address resolution. It can
32+
also help reduce the number of KVM_EXITS required to rebuild extended page
33+
tables post snapshot restore, as well as improve boot times (by up to 50% as
34+
measured by Firecracker's
1235
[boot time performance tests](../tests/integration_tests/performance/test_boottime.py))
1336

1437
Using hugetlbfs requires the host running Firecracker to have a pre-allocated
@@ -43,15 +66,5 @@ the device is unable to reclaim the hugepage backing of the guest and drop RSS.
4366
However, the balloon can still be inflated and used to restrict memory usage in
4467
the guest.
4568

46-
## FAQ
47-
48-
### Why does Firecracker not offer a transparent huge pages (THP) setting?
49-
50-
Firecracker's guest memory can be memfd based. Linux (as of 6.1) does not offer
51-
a way to dynamically enable THP for such memory regions. Additionally, UFFD does
52-
not integrate with THP (no transparent huge pages will be allocated during
53-
userfaulting). Please refer to the [Linux Documentation][thp_docs] for more
54-
information.
55-
5669
[hugetlbfs_docs]: https://docs.kernel.org/admin-guide/mm/hugetlbpage.html
5770
[thp_docs]: https://www.kernel.org/doc/html/next/admin-guide/mm/transhuge.html#hugepages-in-tmpfs-shmem

src/firecracker/src/api_server/request/machine_configuration.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ mod tests {
104104

105105
let huge_pages_cases = [
106106
("None", HugePageConfig::None),
107+
("Transparent", HugePageConfig::Transparent),
107108
("2M", HugePageConfig::Hugetlbfs2M),
108109
];
109110

src/firecracker/swagger/firecracker.yaml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1442,8 +1442,13 @@ definitions:
14421442
type: string
14431443
enum:
14441444
- None
1445+
- Transparent
14451446
- 2M
1446-
description: Which huge pages configuration (if any) should be used to back guest memory.
1447+
default: None
1448+
description: >-
1449+
Which huge pages configuration should be used to back guest memory.
1450+
"None" uses regular 4K pages. "Transparent" enables THP via
1451+
madvise(MADV_HUGEPAGE). "2M" uses explicit hugetlbfs 2MB pages.
14471452
14481453
MemoryBackend:
14491454
type: object

src/vmm/src/devices/virtio/vhost_user.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -487,6 +487,7 @@ pub(crate) mod tests {
487487
libc::MAP_PRIVATE,
488488
Some(file),
489489
false,
490+
libc::MADV_HUGEPAGE,
490491
)
491492
.unwrap()
492493
.into_iter()

src/vmm/src/persist.rs

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -449,8 +449,13 @@ pub fn restore_from_snapshot(
449449
.into());
450450
}
451451
(
452-
guest_memory_from_file(mem_backend_path, mem_state, track_dirty_pages)
453-
.map_err(RestoreFromSnapshotGuestMemoryError::File)?,
452+
guest_memory_from_file(
453+
mem_backend_path,
454+
mem_state,
455+
track_dirty_pages,
456+
vm_resources.machine_config.huge_pages,
457+
)
458+
.map_err(RestoreFromSnapshotGuestMemoryError::File)?,
454459
None,
455460
)
456461
}
@@ -512,9 +517,11 @@ fn guest_memory_from_file(
512517
mem_file_path: &Path,
513518
mem_state: &GuestMemoryState,
514519
track_dirty_pages: bool,
520+
huge_pages: HugePageConfig,
515521
) -> Result<Vec<GuestRegionMmap>, GuestMemoryFromFileError> {
516522
let mem_file = File::open(mem_file_path)?;
517-
let guest_mem = memory::snapshot_file(mem_file, mem_state.regions(), track_dirty_pages)?;
523+
let guest_mem =
524+
memory::snapshot_file(mem_file, mem_state.regions(), track_dirty_pages, huge_pages)?;
518525
Ok(guest_mem)
519526
}
520527

src/vmm/src/resources.rs

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -580,6 +580,7 @@ mod tests {
580580
use crate::vmm_config::RateLimiterConfig;
581581
use crate::vmm_config::boot_source::{BootConfig, BootSource, BootSourceConfig};
582582
use crate::vmm_config::drive::{BlockBuilder, BlockDeviceConfig};
583+
use crate::vmm_config::machine_config::HugePageConfig::{Hugetlbfs2M, Transparent};
583584
use crate::vmm_config::machine_config::{HugePageConfig, MachineConfig, MachineConfigError};
584585
use crate::vmm_config::net::{NetBuilder, NetworkInterfaceConfig};
585586
use crate::vmm_config::vsock::tests::default_config;
@@ -1476,6 +1477,26 @@ mod tests {
14761477
Err(MachineConfigError::InvalidMemorySize)
14771478
);
14781479

1480+
// Odd memory size - not supported by THP/Hugetlbfs
1481+
aux_vm_config.mem_size_mib = Some(1025);
1482+
aux_vm_config.huge_pages = Some(Transparent);
1483+
assert_eq!(
1484+
vm_resources.update_machine_config(&aux_vm_config),
1485+
Err(MachineConfigError::InvalidMemorySize)
1486+
);
1487+
aux_vm_config.huge_pages = Some(Hugetlbfs2M);
1488+
assert_eq!(
1489+
vm_resources.update_machine_config(&aux_vm_config),
1490+
Err(MachineConfigError::InvalidMemorySize)
1491+
);
1492+
// Odd size supported by HugePageConfig::None
1493+
aux_vm_config.huge_pages = Some(HugePageConfig::None);
1494+
vm_resources.update_machine_config(&aux_vm_config).unwrap();
1495+
assert_eq!(
1496+
MachineConfigUpdate::from(vm_resources.machine_config.clone()),
1497+
aux_vm_config
1498+
);
1499+
14791500
// Incompatible mem_size_mib with balloon size.
14801501
vm_resources.machine_config.mem_size_mib = 128;
14811502
vm_resources

src/vmm/src/vmm_config/machine_config.rs

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,11 @@ pub enum MachineConfigError {
3434
/// Describes the possible (huge)page configurations for a microVM's memory.
3535
#[derive(Clone, Copy, Debug, Default, PartialEq, Eq, Serialize, Deserialize)]
3636
pub enum HugePageConfig {
37-
/// Do not use hugepages, e.g. back guest memory by 4K
37+
/// Back guest memory by 4K pages, no hugepage behavior
3838
#[default]
3939
None,
40+
/// Use madvise(MADV_HUGEPAGE) for transparent huge pages
41+
Transparent,
4042
/// Back guest memory by 2MB hugetlbfs pages
4143
#[serde(rename = "2M")]
4244
Hugetlbfs2M,
@@ -49,6 +51,10 @@ impl HugePageConfig {
4951
let divisor = match self {
5052
// Any integer memory size expressed in MiB will be a multiple of 4096KiB.
5153
HugePageConfig::None => 1,
54+
// Note: THP technically supports memory not 2MB aligned, however that would mean
55+
// some pages at the tail would be forced to be 4k size. To avoid performance/fragmentation surprises,
56+
// having a memory multiple of 2MB is wiser.
57+
HugePageConfig::Transparent => 2,
5258
HugePageConfig::Hugetlbfs2M => 2,
5359
};
5460

@@ -59,11 +65,20 @@ impl HugePageConfig {
5965
/// create a mapping backed by huge pages as described by this [`HugePageConfig`].
6066
pub fn mmap_flags(&self) -> libc::c_int {
6167
match self {
62-
HugePageConfig::None => 0,
68+
HugePageConfig::None | HugePageConfig::Transparent => 0,
6369
HugePageConfig::Hugetlbfs2M => libc::MAP_HUGETLB | libc::MAP_HUGE_2MB,
6470
}
6571
}
6672

73+
/// Returns the flags required to pass to [libc::madvise], after allocating anonymous guest memory.
74+
/// Note: returning [libc::MADV_NORMAL] might skip the call to `madvise` entirely.
75+
pub fn madvise_flags(&self) -> libc::c_int {
76+
match self {
77+
HugePageConfig::Transparent => libc::MADV_HUGEPAGE,
78+
HugePageConfig::None | HugePageConfig::Hugetlbfs2M => libc::MADV_NORMAL,
79+
}
80+
}
81+
6782
/// Returns `true` iff this [`HugePageConfig`] describes a hugetlbfs-based configuration.
6883
pub fn is_hugetlbfs(&self) -> bool {
6984
matches!(self, HugePageConfig::Hugetlbfs2M)
@@ -72,7 +87,7 @@ impl HugePageConfig {
7287
/// Gets the page size in bytes of this [`HugePageConfig`].
7388
pub fn page_size(&self) -> usize {
7489
match self {
75-
HugePageConfig::None => 4096,
90+
HugePageConfig::None | HugePageConfig::Transparent => 4096,
7691
HugePageConfig::Hugetlbfs2M => 2 * 1024 * 1024,
7792
}
7893
}
@@ -81,7 +96,7 @@ impl HugePageConfig {
8196
impl From<HugePageConfig> for Option<memfd::HugetlbSize> {
8297
fn from(value: HugePageConfig) -> Self {
8398
match value {
84-
HugePageConfig::None => None,
99+
HugePageConfig::None | HugePageConfig::Transparent => None,
85100
HugePageConfig::Hugetlbfs2M => Some(memfd::HugetlbSize::Huge2MB),
86101
}
87102
}

src/vmm/src/vstate/memory.rs

Lines changed: 42 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
// found in the THIRD-PARTY file.
77

88
use std::fs::File;
9+
use std::io;
910
use std::io::SeekFrom;
1011
use std::ops::Deref;
1112
use std::sync::{Arc, Mutex};
@@ -22,12 +23,12 @@ pub use vm_memory::{
2223
};
2324
use vm_memory::{GuestMemoryError, GuestMemoryRegionBytes, VolatileSlice, WriteVolatile};
2425

25-
use crate::DirtyBitmap;
2626
use crate::arch::host_page_size;
2727
use crate::logger::error;
2828
use crate::utils::u64_to_usize;
2929
use crate::vmm_config::machine_config::HugePageConfig;
3030
use crate::vstate::vm::{KvmVm, VmError};
31+
use crate::{DirtyBitmap, warn_unrestricted};
3132

3233
/// Type of GuestRegionMmap.
3334
pub type GuestRegionMmap = vm_memory::GuestRegionMmap<Option<AtomicBitmap>>;
@@ -528,6 +529,7 @@ pub fn create(
528529
mmap_flags: libc::c_int,
529530
file: Option<File>,
530531
track_dirty_pages: bool,
532+
madvise_flags: libc::c_int,
531533
) -> Result<Vec<GuestRegionMmap>, MemoryError> {
532534
let mut offset = 0;
533535
let file = file.map(Arc::new);
@@ -559,6 +561,18 @@ pub fn create(
559561
start,
560562
)
561563
.ok_or(MemoryError::VmMemoryError)
564+
.inspect(|region| {
565+
if madvise_flags != libc::MADV_NORMAL {
566+
// SAFETY: The referenced memory was just mapped.
567+
let ret = unsafe {
568+
libc::madvise(region.as_ptr().cast(), region.size(), madvise_flags)
569+
};
570+
if ret != 0 {
571+
let e = io::Error::last_os_error();
572+
warn_unrestricted!("Madvise call failed for guest memory: {e}");
573+
}
574+
}
575+
})
562576
})
563577
.collect::<Result<Vec<_>, _>>()
564578
}
@@ -577,6 +591,7 @@ pub fn memfd_backed(
577591
libc::MAP_SHARED | huge_pages.mmap_flags(),
578592
Some(memfd_file),
579593
track_dirty_pages,
594+
huge_pages.madvise_flags(),
580595
)
581596
}
582597

@@ -591,6 +606,7 @@ pub fn anonymous(
591606
libc::MAP_PRIVATE | libc::MAP_ANONYMOUS | huge_pages.mmap_flags(),
592607
None,
593608
track_dirty_pages,
609+
huge_pages.madvise_flags(),
594610
)
595611
}
596612

@@ -600,6 +616,7 @@ pub fn snapshot_file(
600616
file: File,
601617
regions: impl Iterator<Item = (GuestAddress, usize)>,
602618
track_dirty_pages: bool,
619+
huge_pages: HugePageConfig,
603620
) -> Result<Vec<GuestRegionMmap>, MemoryError> {
604621
let regions: Vec<_> = regions.collect();
605622
let memory_size = regions
@@ -619,6 +636,7 @@ pub fn snapshot_file(
619636
libc::MAP_PRIVATE,
620637
Some(file),
621638
track_dirty_pages,
639+
huge_pages.madvise_flags(),
622640
)
623641
}
624642

@@ -951,8 +969,13 @@ mod tests {
951969
file.write_all(&vec![0x42u8; page_size]).unwrap();
952970

953971
let regions = vec![(GuestAddress(0), page_size)];
954-
let guest_regions =
955-
snapshot_file(file, regions.into_iter(), dirty_page_tracking).unwrap();
972+
let guest_regions = snapshot_file(
973+
file,
974+
regions.into_iter(),
975+
dirty_page_tracking,
976+
HugePageConfig::None,
977+
)
978+
.unwrap();
956979
assert_eq!(guest_regions.len(), 1);
957980
guest_regions.iter().for_each(|region| {
958981
assert_eq!(region.bitmap().is_some(), dirty_page_tracking);
@@ -973,7 +996,8 @@ mod tests {
973996
(GuestAddress(0x10000), page_size),
974997
(GuestAddress(0x20000), page_size),
975998
];
976-
let guest_regions = snapshot_file(file, regions.into_iter(), false).unwrap();
999+
let guest_regions =
1000+
snapshot_file(file, regions.into_iter(), false, HugePageConfig::None).unwrap();
9771001
assert_eq!(guest_regions.len(), 3);
9781002
}
9791003

@@ -985,7 +1009,7 @@ mod tests {
9851009
file.write_all(&vec![0x42u8; page_size]).unwrap();
9861010

9871011
let regions = vec![(GuestAddress(0), 2 * page_size)];
988-
let result = snapshot_file(file, regions.into_iter(), false);
1012+
let result = snapshot_file(file, regions.into_iter(), false, HugePageConfig::None);
9891013
assert!(matches!(result.unwrap_err(), MemoryError::OffsetTooLarge));
9901014
}
9911015

@@ -1175,8 +1199,15 @@ mod tests {
11751199
let mut memory_file = TempFile::new().unwrap().into_file();
11761200
guest_memory.dump(&mut memory_file).unwrap();
11771201

1178-
let restored_guest_memory =
1179-
into_region_ext(snapshot_file(memory_file, memory_state.regions(), false).unwrap());
1202+
let restored_guest_memory = into_region_ext(
1203+
snapshot_file(
1204+
memory_file,
1205+
memory_state.regions(),
1206+
false,
1207+
HugePageConfig::None,
1208+
)
1209+
.unwrap(),
1210+
);
11801211

11811212
// Check that the region contents are the same.
11821213
let mut restored_region = vec![0u8; page_size * 2];
@@ -1240,8 +1271,9 @@ mod tests {
12401271
.unwrap();
12411272

12421273
// We can restore from this because this is the first dirty dump.
1243-
let restored_guest_memory =
1244-
into_region_ext(snapshot_file(file, memory_state.regions(), false).unwrap());
1274+
let restored_guest_memory = into_region_ext(
1275+
snapshot_file(file, memory_state.regions(), false, HugePageConfig::None).unwrap(),
1276+
);
12451277

12461278
// Check that the region contents are the same.
12471279
let mut restored_region = vec![0u8; region_size];
@@ -1465,6 +1497,7 @@ mod tests {
14651497
memory_file,
14661498
std::iter::once((GuestAddress(0), 2 * page_size)),
14671499
false,
1500+
HugePageConfig::None,
14681501
)
14691502
.unwrap(),
14701503
);

0 commit comments

Comments
 (0)