Fix NVMe raw_instance_storage device enumeration for all instance families#196
Fix NVMe raw_instance_storage device enumeration for all instance families#196
Conversation
… validation of disk mappings
Update test to expect /dev/nvme2n1 (instance storage) instead of /dev/nvme0n1 (root device). This aligns with the corrected NVMe device enumeration logic: - nvme0n1: root EBS - nvme1n1: ephemeral EBS (when configured) - nvme2n1+: instance storage devices
Device paths for NVMe raw ephemeral disks now start at nvme0n1 instead of nvme2n1, as the agent performs runtime discovery and the hints are informational only.
|
After discussing this on the community meeting, some people noted that the way we hardcode the NVMe numbering is dangerous since it is not guaranteed that the EBS volumes will be on the same ones every time. |
|
I did not entirely understand what is the issue, so maybe you can just ignore my comment. I am just wondering if this might have been fixed already by cloudfoundry/bosh-linux-stemcell-builder#462 ? At least from the issue description it looks like the exact same issue that I investigated some time ago. |
You might be right here. We did not have a bosh director release since you reverted the PR cloudfoundry/bosh-agent#391. I guess that is still the root cause of this issue ? @neddp @fmoehler |
|
Hi @fmoehler, Thank you for pointing out PR cloudfoundry/bosh-linux-stemcell-builder#462! That PR fixes EBS volume identification (volumes with AWS metadata). This PR addresses instance storage discovery (see #155 for reference), which cannot use the same approach because instance storage volumes have no AWS metadata. The two PRs are complementary. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughRefactored NVMe instance storage device mapping to use virtualization-type-aware device name generation instead of hardcoded instance family checks. Introduced new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/bosh_aws_cpi/lib/cloud/aws/block_device_manager.rb`:
- Around line 177-188: In raw_ephemeral_device_name, replace the magic numbers
99 and 97 used to compute disk letters with character literals (e.g., 'c'.ord
and 'a'.ord) so the intent is explicit; update the branches that build
"/dev/sd#{(99 + index).chr}" and "/dev/xvdb#{(97 + index).chr}" to compute the
base ordinal from 'c' and 'a' respectively using `@virtualization_type` and index,
leaving the nvme and error branches unchanged.
- Around line 6-9: Update the NVME_INSTANCE_FAMILIES constant in
block_device_manager.rb to include the missing Nitro-based families by adding
the following identifiers to the array: c7g c8a c8gb c8gn c8i c8id c8i-flex m7g
m8a m8azn m8gb m8gn m8i m8id m8i-flex r7g r8a r8gb r8gn r8i r8id r8i-flex i4g
i7i i7ie i8g i8ge g7e p6-b200 p6-b300 trn2 trn2u; alternatively, implement a
runtime NVMe detection fallback in the code paths that use
NVME_INSTANCE_FAMILIES (e.g., methods referencing NVME_INSTANCE_FAMILIES in
block_device_manager.rb) so unknown families on Nitro are detected by checking
/dev/nvme* presence instead of relying solely on the static list.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e16af91d-7bbd-4b84-b8e1-62d180f1d166
📒 Files selected for processing (2)
src/bosh_aws_cpi/lib/cloud/aws/block_device_manager.rbsrc/bosh_aws_cpi/spec/unit/block_device_manager_spec.rb
| # Instance families that use NVMe device naming (/dev/nvme*). | ||
| # This includes Nitro-based instances and some Xen-based instances with NVMe storage (e.g., i3 family). | ||
| # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html#ec2-nitro-instances | ||
| NVME_INSTANCE_FAMILIES = %w[a1 c5 c5a c5ad c5d c5n c6a c6g c6gd c6gn c6i c6id c6in c7i c7a d3 d3en g4dn g4ad g5 g6 g6e i3en i4i inf1 m5 m5a m5ad m5d m5dn m5n m5zn m6a m6g m6gd m6i m6id m6idn m6in m7i m7a m7i-flex p3dn p4d p5 r5 r5a r5ad r5b r5d r5dn r5n r6a r6g r6gd r6i r6in r6id r6idn r7i r7a r7iz t3 t3a t4g z1d x2iezn].freeze | ||
| NVME_INSTANCE_FAMILIES = %w[a1 c5 c5a c5ad c5d c5n c6a c6g c6gd c6gn c6i c6id c6in c7i c7a d3 d3en g4dn g4ad g5 g6 g6e i3 i3en i4i inf1 m5 m5a m5ad m5d m5dn m5n m5zn m6a m6g m6gd m6i m6id m6idn m6in m7i m7a m7i-flex p3dn p4d p5 r5 r5a r5ad r5b r5d r5dn r5n r6a r6g r6gd r6i r6in r6id r6idn r7i r7a r7iz t3 t3a t4g z1d x2iezn].freeze |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
AWS EC2 Nitro instance types 2025
💡 Result:
AWS EC2 Nitro instance types encompass all current-generation instances built on the AWS Nitro System, which powers the majority of modern EC2 offerings as of 2026. Xen-based instances are legacy and not Nitro-based. The official AWS documentation lists Nitro-based instances across categories. Nitro versions range from v2 to v6, with newer 2025 announcements on v6 (e.g., M8gn, M8gb, C8a, R8i, I7i, I8ge). Key lists from docs: Virtualized Nitro instances (Nitro v6 highlighted as latest): - General Purpose: M8a, M8azn, M8gb, M8gn, M8i, M8id, M8i-flex; earlier like M5*, M6*, M7*, T3/T3a/T4g - Compute Optimized: C8a, C8gb, C8gn, C8i, C8id, C8i-flex; earlier C5*, C6*, C7* - Memory Optimized: R8a, R8gb, R8gn, R8i, R8id, R8i-flex, X8aedz, X8i; earlier R5*, R6*, R7*, U*, X* - Storage Optimized: I8ge; earlier I3en, I4*, I7* - Accelerated: G7e, P6-B200, P6-B300; many others like G4*, G5*, G6*, Inf*, P4*, P5*, Trn* - HPC: Hpc8a; earlier Hpc6*, Hpc7* Bare metal Nitro instances mirror virtualized but dedicated hardware, e.g., M8i.metal, C8i.metal, etc. 2025 Nitro highlights: M8gn/M8gb (Dec 2025, Graviton4, up to 600Gbps net), C8a (Dec 2025, AMD EPYC), R8i/R8i-flex (Sep 2025, Intel Xeon6), I7i (Apr 2025, storage opt), I8ge (Aug 2025). Full current Nitro-based from UserGuide: M5/M5a/M5ad/M5d/M5dn/M5n/M5zn, M6a/M6g/M6gd/M6i/M6id/M6idn/M6in, M7a/M7g/M7gd/M7i/M7i-flex/M8*, T3/T3a/T4g (general); similar for other categories up to *8 series; storage D3/D3en/I3en/I4g/I4i/I7i/I7ie/I8g/I8ge; accel DL1/DL2q/G4ad/G4dn/G5/G5g/G6/G6e/G6f/Gr6/Gr6f/G7e/Inf1/Inf2/P4d/P4de/P5/P5e/P5en/P6-B200/P6-B300/P6e-GB200/Trn1/Trn1n/Trn2/Trn2u/VT1; HPC Hpc6a/Hpc6id/Hpc7a/Hpc7g/Hpc8a. All post-2017 launches are Nitro-based. For complete specs/sizes/regions, see AWS docs.
Citations:
- 1: https://docs.aws.amazon.com/ec2/latest/instancetypes/ec2-nitro-instances.html
- 2: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
- 3: https://docs.aws.amazon.com/ec2/latest/instancetypes/instance-types.html
- 4: https://aws.amazon.com/ec2/instance-types/i7i/
- 5: https://aws.amazon.com/ec2/instance-types/r8i
- 6: https://aws.amazon.com/about-aws/whats-new/2025/12/generally-available-amazon-ec2-m8gn-m8gb-instances/
- 7: https://aws.amazon.com/about-aws/whats-new/2025/12/compute-optimized-amazon-ec2-c8a-instances/
- 8: https://aws.amazon.com/about-aws/whats-new/2025/04/amazon-ec2-i7i-high-performance-storage-optimized-instances/
- 9: https://docs.aws.amazon.com/ec2/latest/instancetypes/doc-history.html
- 10: https://aws.amazon.com/about-aws/whats-new/2025/11/memory-optimized-amazon-ec2-r8a-instances/
Update NVME_INSTANCE_FAMILIES to include current Nitro instance families.
The list is missing multiple instance families released in 2024-2025. According to AWS documentation, missing Nitro-based families include:
- Compute:
c7g,c8a,c8gb,c8gn,c8i,c8id,c8i-flex - General Purpose:
m7g,m8a,m8azn,m8gb,m8gn,m8i,m8id,m8i-flex - Memory:
r7g,r8a,r8gb,r8gn,r8i,r8id,r8i-flex - Storage:
i4g,i7i,i7ie,i8g,i8ge - Accelerated:
g7e,p6-b200,p6-b300 - Training:
trn2,trn2u
Without these families, the CPI will generate incorrect device paths on new instance types. Consider either adding these families to the list, or implementing a safer fallback mechanism for unknown instance families on Nitro (e.g., detect NVMe device naming at runtime).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/bosh_aws_cpi/lib/cloud/aws/block_device_manager.rb` around lines 6 - 9,
Update the NVME_INSTANCE_FAMILIES constant in block_device_manager.rb to include
the missing Nitro-based families by adding the following identifiers to the
array: c7g c8a c8gb c8gn c8i c8id c8i-flex m7g m8a m8azn m8gb m8gn m8i m8id
m8i-flex r7g r8a r8gb r8gn r8i r8id r8i-flex i4g i7i i7ie i8g i8ge g7e p6-b200
p6-b300 trn2 trn2u; alternatively, implement a runtime NVMe detection fallback
in the code paths that use NVME_INSTANCE_FAMILIES (e.g., methods referencing
NVME_INSTANCE_FAMILIES in block_device_manager.rb) so unknown families on Nitro
are detected by checking /dev/nvme* presence instead of relying solely on the
static list.
| def raw_ephemeral_device_name(index, requires_nvme) | ||
| if requires_nvme | ||
| # Simple sequential hints - agent will discover actual devices via EBS symlink exclusion | ||
| "/dev/nvme#{index}n1" | ||
| elsif @virtualization_type == 'paravirtual' | ||
| "/dev/sd#{(99 + index).chr}" # 99 is 'c'.ord - starts at sdc, sdd, sde... | ||
| elsif @virtualization_type == 'hvm' | ||
| "/dev/xvdb#{(97 + index).chr}" # 97 is 'a'.ord - starts at xvdba, xvdbb... | ||
| else | ||
| raise Bosh::Clouds::CloudError, "unknown virtualization type #{@virtualization_type}" | ||
| end | ||
| end |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider using character literals instead of magic numbers for clarity.
While the comments explain the values, using 'c'.ord and 'a'.ord directly would be more self-documenting and reduce the risk of comment drift.
♻️ Proposed refactor
def raw_ephemeral_device_name(index, requires_nvme)
if requires_nvme
# Simple sequential hints - agent will discover actual devices via EBS symlink exclusion
"/dev/nvme#{index}n1"
elsif `@virtualization_type` == 'paravirtual'
- "/dev/sd#{(99 + index).chr}" # 99 is 'c'.ord - starts at sdc, sdd, sde...
+ "/dev/sd#{('c'.ord + index).chr}" # starts at sdc, sdd, sde...
elsif `@virtualization_type` == 'hvm'
- "/dev/xvdb#{(97 + index).chr}" # 97 is 'a'.ord - starts at xvdba, xvdbb...
+ "/dev/xvdb#{('a'.ord + index).chr}" # starts at xvdba, xvdbb...
else
raise Bosh::Clouds::CloudError, "unknown virtualization type #{`@virtualization_type`}"
end
end🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/bosh_aws_cpi/lib/cloud/aws/block_device_manager.rb` around lines 177 -
188, In raw_ephemeral_device_name, replace the magic numbers 99 and 97 used to
compute disk letters with character literals (e.g., 'c'.ord and 'a'.ord) so the
intent is explicit; update the branches that build "/dev/sd#{(99 + index).chr}"
and "/dev/xvdb#{(97 + index).chr}" to compute the base ordinal from 'c' and 'a'
respectively using `@virtualization_type` and index, leaving the nvme and error
branches unchanged.
The
raw_instance_storagefeature was causing VM boot failures and timeouts on NVMe-based instances (i3, i3en, i4i, c6id, m6id, r6id, etc.) when attempting to use AWS instance storage as raw ephemeral disks.Root Cause
NVMe device enumeration order is non-deterministic on AWS Nitro instances. The kernel discovers NVMe devices based on PCIe enumeration order, which varies between boots and instance types. This means:
/dev/nvme0n1might be the root EBS volume OR instance storage/dev/nvme1n1might be instance storage OR the root EBS volumeThe previous implementation made a critical incorrect assumption:
/dev/nvme0n1and/dev/nvme1n1were always instance storage on i3/i3en instancesAdditionally, the CPI only handled i3/i3en instance families correctly, causing issues with newer NVMe instance types (i4i, c6id, m6id, r6id, etc.).
Solution
Implemented agent-side runtime discovery using AWS-maintained EBS volume symlinks:
How It Works
CPI side (simplified):
/dev/nvme0n1,/dev/nvme1n1, etc.Agent side (new discovery logic):
/dev/nvme*n1/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol*Potentially fixes #155
This must be merged together with the Agent changes - cloudfoundry/bosh-agent#396