Skip to content

Retrieve default disk type from Azure API #716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

s4heid
Copy link
Contributor

@s4heid s4heid commented Mar 28, 2025

Summary

Fixes #715.

Checklist:

Please check each of the boxes below for which you have completed the corresponding task:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • All unit tests pass locally (after my changes)
  • Rubocop reports zero errors (after my changes)

Please include below the summary portions of the output from the following 2 scripts:

pushd src/bosh_azure_cpi
  bundle install
  bundle exec rake spec:unit
  bundle exec rake rubocop
popd

NOTE: Please see how to setup dev environment and run unit tests in docs/development.md.

Unit Test output:

$ bundle exec rspec spec/unit
......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
...........................................................................................................................................................................................................................................................

Finished in 4.95 seconds (files took 2 seconds to load)
1105 examples, 0 failures

Coverage report generated for RSpec to /workspaces/bosh-azure-cpi-release/src/bosh_azure_cpi/coverage.
Line Coverage: 49.45% (20239 / 40927)

Rubocop output:

(TODO: Paste rubocop script summary output here)

Changelog

  • Use the Azure API to decided whether to use Premium or Standard storage for the default disk type
  • Use the Azure API to determine suitable instance types for the calculation of the vm cloud properties
  • NOTICE: This change might impact your deployment if your deployment manifest does not specifically set a disk type and your are using an instance type that is not matched by this regular expression.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the way the Azure CPI retrieves default disk types and maps VM instance types by leveraging the Azure API. Key changes include:

  • Updating test suite calls to pass a location parameter instead of an available VM sizes object.
  • Refactoring method signatures and logic in vm_manager, instance type mapper, and disk manager to use the Azure API.
  • Enhancing caching and error logging in fetching VM SKUs and determining premium storage support.

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated no comments.

Show a summary per file
File Description
spec/unit/vm_manager/create/stemcell_spec.rb Added double for disk_manager2 and stubbing default storage account type.
spec/unit/vm_manager/create/get_root_disk_type_spec.rb Updated method signature to include a location parameter in tests.
spec/unit/vm_manager/create/availability_set_spec.rb Updated stubbing of supports_premium_storage? calls with location.
spec/unit/instance_type_mapper_spec.rb Refactored test cases for filtering, ordering, caching and handling restrictions for VM SKUs.
spec/unit/disk_manager2_spec.rb Added tests for premium storage support and default storage account type.
lib/cloud/azure/vms/vm_manager_disk.rb Updated _get_root_disk_type to accept location and source storage type accordingly.
lib/cloud/azure/vms/vm_manager.rb Updated calls to premium storage checks and adjusted logo for diagnostics/stemcell info.
lib/cloud/azure/vms/instance_type_mapper.rb Refactored map method to fetch VM SKUs by location, apply caching and sort VM sizes.
lib/cloud/azure/utils/helpers.rb Removed legacy premium support helper methods.
lib/cloud/azure/restapi/azure_client.rb Updated maximum fault domain retrieval to use list_resource_skus API.
lib/cloud/azure/disk/disk_manager2.rb Added methods for determining default storage account type and premium support using caching.
lib/cloud/azure/cloud.rb Updated instance type mapping and disk creation to pass location appropriately.
Comments suppressed due to low confidence (1)

src/bosh_azure_cpi/lib/cloud/azure/vms/instance_type_mapper.rb:84

  • [nitpick] The regex used in get_series_score may not fully cover naming patterns with additional characters (e.g., 'Standard_D2s_v5'). Consider refining the regex (e.g., to /standard_#{series}[\D]*(\d+)/i) to ensure it correctly matches varied VM series formats.
def get_series_score(vm_name)

@github-project-automation github-project-automation bot moved this from Pending Review | Discussion to Pending Merge | Prioritized in Foundational Infrastructure Working Group Apr 7, 2025
@jpalermo
Copy link
Contributor

This seems like the best solution to the problem.

The Resource SKU endpoint is a pretty large amount of JSON to be grabbing. I saw some in memory caching done, but the CPI is re-executed for every command it runs, so that only helps with multiple calls within a single CPI invocation.

We could cache the json on disk, but just blindly adding complicated caching seems like a bad idea without some idea of the cost of this improvement. Do you have numbers for how long a create_vm takes before and after this change? Doesn't need to be comprehensive or detailed, I'm just wondering ballpark how much extra time VM/disk creation is going to take.

@s4heid
Copy link
Contributor Author

s4heid commented Apr 17, 2025

@jpalermo, the file is not small, and you have a very valid point with this. I don't have the numbers yet, but I will follow up on this and provide some data. I like the idea of file-based caching; I also considered this, but couldn't see it as a pattern in the CPI.
Where would be a good location to cache this file?

@aramprice aramprice requested review from a team and fmoehler and removed request for a team April 17, 2025 16:15
@jpalermo
Copy link
Contributor

/var/vcap/sys/run/JOB_NAME is normally the spot for data like that. You'll want to be careful though since when running a bosh create-env, to bootstrap a director, that path likely won't exist. Probably best to just not cache in that case.

@s4heid
Copy link
Contributor Author

s4heid commented Apr 23, 2025

With the previous implementation, the call to list_resource_skus took several seconds for each invocation, so I believe the file-based caching approach is justified. Now, with file caching, an eviction policy of 24 hours is applied, and the file is directly cached in list_resource_skus. The write operation is protected with a lock for concurrent access.

During testing, I noticed the response body is always logged when using the get_resource_by_id method, which makes the CPI logs very large (and unreadable with text wrapping enabled) when listing resource skus. Therefore, I added truncation for very large response bodies (>10000 characters). @jpalermo Do you have any concerns with this or a better suggestion?

@beyhan beyhan requested a review from ystros April 24, 2025 14:43
@beyhan
Copy link
Member

beyhan commented Apr 24, 2025

@ystros will be great if you could review this pr. Main concern is the last comment.

@beyhan beyhan moved this from Pending Merge | Prioritized to Pending Review | Discussion in Foundational Infrastructure Working Group Apr 24, 2025
@aramprice
Copy link
Member

Roughly noting some concerns that came up in the WG meeting:

  • what happens when many requests happen in parallel (e.g. 32x create-vm)
    • many requests w/o cache
    • many attempts to write cache file
  • how will an operator know whether the cache is being hit?
  • what is the operator process for clearing the cache?
    • could the whole response be logged at least once (cache-miss)?
    • could subsequent cache-hit requests print out the cache file location?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Pending Review | Discussion
Development

Successfully merging this pull request may close these issues.

create_disk is unable to determine the correct default storage type for Azure's latest VM types
5 participants