docs: add CloudFormation timing analysis and deferred-IAM plan#79
docs: add CloudFormation timing analysis and deferred-IAM plan#79cdoan1 wants to merge 1 commit into
Conversation
…ion plan Add comprehensive documentation analyzing stack creation timing and proposing the deferred-IAM optimization to reduce provisioning time from ~16-24 min to ~5-8 min. - docs/cloudformation-timing.md: Timing analysis of VPC/IAM/OIDC stack creation, wait points, and proposed timing observability features - docs/deferred-iam-implementation.md: Complete implementation plan for deferred-IAM flow including code examples, testing strategy, and 4-week rollout plan Key insight: Creating IAM stack AFTER OIDC provider (instead of before) eliminates the 10-15 min IAM eventual consistency delay caused by trust policy updates. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cdoan1 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughTwo new documentation files are added. ChangesDeferred-IAM and CloudFormation Timing Documentation
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (2 errors, 1 warning)
✅ Passed checks (8 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
docs/deferred-iam-implementation.md (1)
286-333: ⚡ Quick winClarify timing instrumentation scope for all CloudFormation stack operations.
Line 332 states "Similar changes for
UpdateStackandDeleteStack," but the document only shows implementation details forCreateStack(lines 289–329). For consistency and completeness:
- UpdateStack should capture
Duration,StartedAt,CompletedAtthe same way- DeleteStack should also capture timing if the
StackOutputstruct is returned- Verify that all three operations return the extended
StackOutputstruct or if timing is optional for someThis is important because the deferred-IAM flow will call
UpdateStackon the IAM stack (line 259), and timing instrumentation should be consistent across all stack operations.Consider adding explicit examples or a code block for
UpdateStacktiming in the same manner asCreateStack, rather than deferring to a generic "similar changes" comment.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/deferred-iam-implementation.md` around lines 286 - 333, The document currently shows detailed implementation for the CreateStack method with timing instrumentation but only references "similar changes" for UpdateStack and DeleteStack without concrete examples. Add explicit code blocks for the UpdateStack and DeleteStack methods in the diff that mirror the timing pattern shown in CreateStack, capturing startTime before the operation begins and endTime after the waiter completes, then assigning Duration, StartedAt, and CompletedAt to the returned StackOutput struct. This ensures clarity about how timing instrumentation is consistently applied across all three CloudFormation stack operations and confirms that all three methods return the extended StackOutput struct with timing information, which is critical for the deferred-IAM flow that calls UpdateStack on the IAM stack.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/deferred-iam-implementation.md`:
- Around line 214-217: The createOIDCWithoutIAMUpdate function is currently a
placeholder that calls clusteroidc.CreateOIDC without actually using the
SkipIAMUpdate flag, so the IAM update will still run. Modify the function to set
the SkipIAMUpdate field to true on the CreateOIDCRequest before passing it to
clusteroidc.CreateOIDC. This ensures that when the request reaches CreateOIDC,
the flag will be checked and the IAM update step will be skipped as intended for
the deferred-IAM flow.
- Around line 688-724: The test function TestProvisionInfrastructure_DeferredIAM
is missing critical imports and helper function definitions required for
compilation. Add the necessary imports at the top of the test file including the
testing package, context package, testify/require, testify/assert, and any AWS
SDK packages for cloudformation and config. For the getLocalStackConfig helper
function that is called within TestProvisionInfrastructure_DeferredIAM, either
define it directly in the test file or import it from the appropriate test
utility package (such as test/localstack/setup.go) and ensure it is properly
exported with a capital letter if it needs to be imported from another package.
- Around line 226-269: The documentation for the `SkipIAMUpdate` flag in the
CreateOIDCRequest struct needs clarification on its purpose and behavior.
Currently, the code shows the flag allows callers to skip the IAM trust policy
update, but the documentation should explicitly explain whether this flag is
meant to optimize the deferred-IAM flow by avoiding unnecessary CloudFormation
API calls when the IAM stack doesn't exist, or whether it simply exposes the
existing implicit error-handling behavior as an explicit control. Add a section
that clearly distinguishes between the current behavior (always attempting the
update with internal error handling) and the proposed behavior (explicit
skipping via the flag), including the backward compatibility implications for
callers that depend on each approach.
- Around line 131-139: The documentation references function names that do not
exist in the actual codebase. Replace the function call
`createOIDCWithoutIAMUpdate()` with the correct exported function
`clusteroidc.CreateOIDC()` to match the actual service API. Additionally, the
function `extractOIDCIssuerDomain()` shown in the documentation has no
implementation in the codebase and should either be removed from the pseudocode
or replaced with the actual implementation if one exists, ensuring all code
examples in the documentation accurately reflect the real API signatures and
available functions.
---
Nitpick comments:
In `@docs/deferred-iam-implementation.md`:
- Around line 286-333: The document currently shows detailed implementation for
the CreateStack method with timing instrumentation but only references "similar
changes" for UpdateStack and DeleteStack without concrete examples. Add explicit
code blocks for the UpdateStack and DeleteStack methods in the diff that mirror
the timing pattern shown in CreateStack, capturing startTime before the
operation begins and endTime after the waiter completes, then assigning
Duration, StartedAt, and CompletedAt to the returned StackOutput struct. This
ensures clarity about how timing instrumentation is consistently applied across
all three CloudFormation stack operations and confirms that all three methods
return the extended StackOutput struct with timing information, which is
critical for the deferred-IAM flow that calls UpdateStack on the IAM stack.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift-online/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1c0cd5ee-954c-4579-a9f5-c8d4bfffddd0
📒 Files selected for processing (2)
docs/cloudformation-timing.mddocs/deferred-iam-implementation.md
| vpcResp, err := clustervpc.CreateVPC(ctx, &clustervpc.CreateVPCRequest{ | ||
| ClusterName: req.ClusterName, | ||
| VpcCidr: req.VpcCidr, | ||
| PublicSubnetCidrs: req.PublicSubnetCidrs, | ||
| PrivateSubnetCidrs: req.PrivateSubnetCidrs, | ||
| AvailabilityZones: req.AvailabilityZones, | ||
| SingleNatGateway: req.SingleNatGateway, | ||
| AWSConfig: req.AWSConfig, | ||
| }) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify function signatures
echo "=== Checking CreateVPC signature ==="
rg 'type CreateVPCRequest struct' internal/services/clustervpc/
echo -e "\n=== Checking CreateOIDC signature ==="
rg 'type CreateOIDCRequest struct' internal/services/clusteroidc/
echo -e "\n=== Checking CreateIAM signature ==="
rg 'type CreateIAMRequest struct' internal/services/clusteriam/
echo -e "\n=== Checking return types ==="
rg 'type CreateVPCResponse struct' internal/services/clustervpc/
rg 'type CreateOIDCResponse struct' internal/services/clusteroidc/Repository: openshift-online/rosa-regional-platform-cli
Length of output: 593
🏁 Script executed:
#!/bin/bash
# Get full struct definitions
echo "=== CreateVPCRequest struct ==="
rg -A 15 'type CreateVPCRequest struct' internal/services/clustervpc/service.go
echo -e "\n=== CreateOIDCRequest struct ==="
rg -A 15 'type CreateOIDCRequest struct' internal/services/clusteroidc/service.go
echo -e "\n=== CreateIAMRequest struct ==="
rg -A 15 'type CreateIAMRequest struct' internal/services/clusteriam/service.go
echo -e "\n=== CreateVPC function signature ==="
rg -B 2 -A 10 'func.*CreateVPC.*ctx.*CreateVPCRequest' internal/services/clustervpc/service.go
echo -e "\n=== CreateIAM function signature ==="
rg -B 2 -A 10 'func.*CreateIAM.*ctx.*CreateIAMRequest' internal/services/clusteriam/service.go
echo -e "\n=== Response structs ==="
rg -A 10 'type CreateVPCResponse struct' internal/services/clustervpc/service.go
rg -A 10 'type CreateOIDCResponse struct' internal/services/clusteroidc/service.go
rg -A 10 'type CreateIAMResponse struct' internal/services/clusteriam/service.goRepository: openshift-online/rosa-regional-platform-cli
Length of output: 2869
🏁 Script executed:
#!/bin/bash
# Search for createOIDCWithoutIAMUpdate function
echo "=== Searching for createOIDCWithoutIAMUpdate function ==="
rg 'func.*createOIDCWithoutIAMUpdate' --type go
echo -e "\n=== Checking clusteroidc package for OIDC functions ==="
rg 'func.*Create' internal/services/clusteroidc/service.go | head -20
echo -e "\n=== Looking at docs file for ProvisionInfrastructure function ==="
rg -B 5 -A 30 'func.*ProvisionInfrastructure' docs/deferred-iam-implementation.md
echo -e "\n=== Checking if functions are exported (capital first letter) ==="
rg '^func (CreateVPC|CreateIAM|CreateOIDC)' internal/services/clustervpc/service.go internal/services/clusteriam/service.go internal/services/clusteroidc/service.goRepository: openshift-online/rosa-regional-platform-cli
Length of output: 3397
🏁 Script executed:
#!/bin/bash
echo "=== Reading lines 131-192 from docs file ==="
sed -n '131,192p' docs/deferred-iam-implementation.mdRepository: openshift-online/rosa-regional-platform-cli
Length of output: 2422
🏁 Script executed:
#!/bin/bash
echo "=== Searching entire codebase for createOIDCWithoutIAMUpdate ==="
rg 'createOIDCWithoutIAMUpdate' --type go
echo -e "\n=== Searching entire codebase for CreateOIDCWithoutIAMUpdate (capitalized) ==="
rg 'CreateOIDCWithoutIAMUpdate' --type go
echo -e "\n=== Searching docs directory for all function definitions ==="
rg 'func .*\(' docs/deferred-iam-implementation.md | grep -i oidc
echo -e "\n=== Check if docs file is intended as pseudocode or actual implementation ==="
head -50 docs/deferred-iam-implementation.md | grep -i 'pseudocode\|example\|implementation\|mock'Repository: openshift-online/rosa-regional-platform-cli
Length of output: 752
Fix function references to match actual service API.
The review raised valid concerns about function signature verification. Verification found:
✓ clustervpc.CreateVPC() is exported with correct parameter struct
✓ clusteriam.CreateIAM() is exported with correct parameter struct
✗ Line 158: createOIDCWithoutIAMUpdate() does not exist in the codebase — it should be clusteroidc.CreateOIDC() instead
All request struct fields match the parameters shown. However, the docs pseudocode references createOIDCWithoutIAMUpdate (lowercase, unexported wrapper) which is not implemented. The actual service only provides clusteroidc.CreateOIDC().
Additionally, extractOIDCIssuerDomain() on line 171 is shown in the docs but has no implementation in the codebase.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/deferred-iam-implementation.md` around lines 131 - 139, The
documentation references function names that do not exist in the actual
codebase. Replace the function call `createOIDCWithoutIAMUpdate()` with the
correct exported function `clusteroidc.CreateOIDC()` to match the actual service
API. Additionally, the function `extractOIDCIssuerDomain()` shown in the
documentation has no implementation in the codebase and should either be removed
from the pseudocode or replaced with the actual implementation if one exists,
ensuring all code examples in the documentation accurately reflect the real API
signatures and available functions.
| func createOIDCWithoutIAMUpdate(ctx context.Context, req *clusteroidc.CreateOIDCRequest) (*clusteroidc.CreateOIDCResponse, error) { | ||
| // TODO: Need to refactor clusteroidc.CreateOIDC to make IAM update optional | ||
| // For now, this is a placeholder showing the intent | ||
| return clusteroidc.CreateOIDC(ctx, req) |
There was a problem hiding this comment.
Placeholder implementation of createOIDCWithoutIAMUpdate won't actually skip IAM update.
Lines 214–217 define a wrapper function that calls clusteroidc.CreateOIDC(ctx, req) without using the SkipIAMUpdate flag introduced in the refactored CreateOIDCRequest (lines 231–237). This placeholder will not achieve the deferred-IAM flow because the IAM update will still run.
The correct implementation should:
- Set
SkipIAMUpdate: trueon the request, or - Call a variant of
CreateOIDCthat explicitly skips the update, or - Directly call the lower-level CloudFormation stack creation without the update step.
This needs to be resolved before the code is implemented to ensure the deferred-IAM flow actually eliminates the 10–15 minute eventual consistency delay.
💡 Suggested approach
If SkipIAMUpdate is passed through CreateOIDCRequest (as shown in lines 231–237), the wrapper should explicitly set it:
func createOIDCWithoutIAMUpdate(ctx context.Context, req *clusteroidc.CreateOIDCRequest) (*clusteroidc.CreateOIDCResponse, error) {
- // TODO: Need to refactor clusteroidc.CreateOIDC to make IAM update optional
- // For now, this is a placeholder showing the intent
- return clusteroidc.CreateOIDC(ctx, req)
+ req.SkipIAMUpdate = true
+ return clusteroidc.CreateOIDC(ctx, req)
}Alternatively, if clusteroidc.CreateOIDC will be fully refactored to check the flag internally (lines 256–262), ensure the flag is set by the caller in the ProvisionInfrastructure function at line 158.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/deferred-iam-implementation.md` around lines 214 - 217, The
createOIDCWithoutIAMUpdate function is currently a placeholder that calls
clusteroidc.CreateOIDC without actually using the SkipIAMUpdate flag, so the IAM
update will still run. Modify the function to set the SkipIAMUpdate field to
true on the CreateOIDCRequest before passing it to clusteroidc.CreateOIDC. This
ensures that when the request reaches CreateOIDC, the flag will be checked and
the IAM update step will be skipped as intended for the deferred-IAM flow.
| ### 2. Refactor `internal/services/clusteroidc/service.go` | ||
|
|
||
| Make IAM trust policy update optional: | ||
|
|
||
| ```go | ||
| type CreateOIDCRequest struct { | ||
| ClusterName string | ||
| OIDCIssuerURL string | ||
| OIDCThumbprint string | ||
| AWSConfig aws.Config | ||
| SkipIAMUpdate bool // NEW: skip IAM trust policy update | ||
| } | ||
|
|
||
| func CreateOIDC(ctx context.Context, req *CreateOIDCRequest) (*CreateOIDCResponse, error) { | ||
| // ... existing thumbprint and template logic ... | ||
|
|
||
| // Create (or update) the OIDC provider stack | ||
| output, err := cfnClient.CreateStack(ctx, cfParams) | ||
| if err != nil { | ||
| var alreadyExists *cloudformation.StackAlreadyExistsError | ||
| if errors.As(err, &alreadyExists) { | ||
| output, err = updateOIDCStack(ctx, cfnClient, req, oidcStackName, templateBody, thumbprint) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| } else { | ||
| return nil, fmt.Errorf("failed to create OIDC stack: %w", err) | ||
| } | ||
| } | ||
|
|
||
| // NEW: Only update IAM stack if not skipped | ||
| if !req.SkipIAMUpdate { | ||
| iamStackName := fmt.Sprintf("rosa-%s-iam", req.ClusterName) | ||
| if err := updateIAMTrustPolicies(ctx, cfnClient, req.ClusterName, iamStackName, oidcIssuerDomain); err != nil { | ||
| return nil, fmt.Errorf("OIDC provider created but failed to update IAM trust policies: %w", err) | ||
| } | ||
| } | ||
|
|
||
| return &CreateOIDCResponse{ | ||
| StackID: output.StackID, | ||
| Outputs: output.Outputs, | ||
| }, nil | ||
| } | ||
| ``` |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how clusteroidc.CreateOIDC currently handles missing IAM stack
rg -A 10 'StackNotFoundError' internal/services/clusteroidc/service.goRepository: openshift-online/rosa-regional-platform-cli
Length of output: 553
🏁 Script executed:
#!/bin/bash
# Get the full CreateOIDC function to understand the complete flow
rg -B 5 -A 40 'func CreateOIDC' internal/services/clusteroidc/service.goRepository: openshift-online/rosa-regional-platform-cli
Length of output: 1664
🏁 Script executed:
#!/bin/bash
# Get the IAM stack update logic in CreateOIDC
rg -B 2 -A 15 'updateIAMTrustPolicies\|iamStackName :=' internal/services/clusteroidc/service.go | head -60Repository: openshift-online/rosa-regional-platform-cli
Length of output: 69
🏁 Script executed:
#!/bin/bash
# Search for IAM-related logic after OIDC stack creation
rg -A 50 'func CreateOIDC' internal/services/clusteroidc/service.go | grep -A 20 'iamStackName\|IAM'Repository: openshift-online/rosa-regional-platform-cli
Length of output: 806
🏁 Script executed:
#!/bin/bash
# Get the complete CreateOIDC function - read the file directly
wc -l internal/services/clusteroidc/service.goRepository: openshift-online/rosa-regional-platform-cli
Length of output: 132
🏁 Script executed:
#!/bin/bash
# Read the entire service.go file to understand the complete flow
cat -n internal/services/clusteroidc/service.goRepository: openshift-online/rosa-regional-platform-cli
Length of output: 7087
Clarify whether SkipIAMUpdate flag adds explicit control or merely formalizes error handling.
The current code (lines 94–98 and 130–163) always calls updateIAMTrustPolicies, which internally catches StackNotFoundError and silently skips the update. The proposed SkipIAMUpdate flag would allow callers to avoid the UpdateStack API call entirely when the IAM stack doesn't exist.
This is a valid enhancement for the deferred-IAM flow—it prevents unnecessary CloudFormation API calls when you know the IAM stack hasn't been created yet. However, the distinction matters for backward compatibility:
- Current: Always attempts IAM update; skipping happens implicitly via error handling
- Proposed: Skipping would be explicit and avoid the API call
The documentation should clarify: Is the flag meant to optimize the deferred-IAM flow (avoid unnecessary API calls), or just expose the existing implicit behavior as an explicit control? The two approaches have different implications for callers and error handling.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/deferred-iam-implementation.md` around lines 226 - 269, The
documentation for the `SkipIAMUpdate` flag in the CreateOIDCRequest struct needs
clarification on its purpose and behavior. Currently, the code shows the flag
allows callers to skip the IAM trust policy update, but the documentation should
explicitly explain whether this flag is meant to optimize the deferred-IAM flow
by avoiding unnecessary CloudFormation API calls when the IAM stack doesn't
exist, or whether it simply exposes the existing implicit error-handling
behavior as an explicit control. Add a section that clearly distinguishes
between the current behavior (always attempting the update with internal error
handling) and the proposed behavior (explicit skipping via the flag), including
the backward compatibility implications for callers that depend on each
approach.
| func TestProvisionInfrastructure_DeferredIAM(t *testing.T) { | ||
| // Setup LocalStack | ||
| ctx := context.Background() | ||
| cfg := getLocalStackConfig(t) | ||
|
|
||
| req := &provision.ProvisionRequest{ | ||
| ClusterName: "test-cluster", | ||
| Region: "us-east-1", | ||
| OIDCIssuerURL: "https://example.com/test-cluster", | ||
| // ... other params | ||
| AWSConfig: cfg, | ||
| } | ||
|
|
||
| resp, err := provision.ProvisionInfrastructure(ctx, req) | ||
| require.NoError(t, err) | ||
|
|
||
| // Verify stacks exist | ||
| cfnClient := cloudformation.NewClient(cfg) | ||
| vpcStack, err := cfnClient.DescribeStack(ctx, "rosa-test-cluster-vpc") | ||
| require.NoError(t, err) | ||
|
|
||
| oidcStack, err := cfnClient.DescribeStack(ctx, "rosa-test-cluster-oidc") | ||
| require.NoError(t, err) | ||
|
|
||
| iamStack, err := cfnClient.DescribeStack(ctx, "rosa-test-cluster-iam") | ||
| require.NoError(t, err) | ||
|
|
||
| // Verify IAM stack was created AFTER OIDC (check creation timestamps) | ||
| assert.True(t, iamStack.CreationTime.After(*oidcStack.CreationTime)) | ||
|
|
||
| // Verify timing data | ||
| assert.Greater(t, resp.VPC.Duration, time.Duration(0)) | ||
| assert.Greater(t, resp.OIDC.Duration, time.Duration(0)) | ||
| assert.Greater(t, resp.IAM.Duration, time.Duration(0)) | ||
| assert.Greater(t, resp.TotalDuration, time.Duration(0)) | ||
| } | ||
| ``` |
There was a problem hiding this comment.
Complete test example with missing imports and function definitions.
The test function TestProvisionInfrastructure_DeferredIAM (lines 688–724) uses:
getLocalStackConfig(t)— Not imported or definederrors.As()— Requiresimport "errors", not shownrequire.NoError()— Requires testify/require importassert.True(),assert.Greater()— Requires testify/assert import
For the documentation example to be complete and correct, add the necessary imports and ensure getLocalStackConfig is defined or reference an existing helper in the test suite.
🔧 Proposed fix: Add imports and helper reference
func TestProvisionInfrastructure_DeferredIAM(t *testing.T) {
+ // Load LocalStack configuration from test suite helper
ctx := context.Background()
cfg := getLocalStackConfig(t) // From test/localstack/setup.go
req := &provision.ProvisionRequest{Ensure test/localstack/setup.go (or relevant test helper) exports GetLocalStackConfig function, or document where this should come from.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/deferred-iam-implementation.md` around lines 688 - 724, The test
function TestProvisionInfrastructure_DeferredIAM is missing critical imports and
helper function definitions required for compilation. Add the necessary imports
at the top of the test file including the testing package, context package,
testify/require, testify/assert, and any AWS SDK packages for cloudformation and
config. For the getLocalStackConfig helper function that is called within
TestProvisionInfrastructure_DeferredIAM, either define it directly in the test
file or import it from the appropriate test utility package (such as
test/localstack/setup.go) and ensure it is properly exported with a capital
letter if it needs to be imported from another package.
Add comprehensive documentation analyzing stack creation timing and proposing the deferred-IAM optimization to reduce provisioning time from ~16-24 min to ~5-8 min.
Key insight: Creating IAM stack AFTER OIDC provider (instead of before) eliminates the 10-15 min IAM eventual consistency delay caused by trust policy updates.
Description
Type of Change
Testing
make test)Checklist
Summary by CodeRabbit