Skip to content

v.1.0.0 release: feat: add optional analytics environment (SageMaker Studio + EMR + Co…#32

Merged
Jmevorach merged 50 commits into
mainfrom
spec/analytics-environment
May 6, 2026
Merged

v.1.0.0 release: feat: add optional analytics environment (SageMaker Studio + EMR + Co…#32
Jmevorach merged 50 commits into
mainfrom
spec/analytics-environment

Conversation

@Jmevorach
Copy link
Copy Markdown
Contributor

…gnito)

Adds a feature-toggled ML and analytics environment to GCO. Off by default; enabled via gco analytics enable which flips analytics_environment.enabled=true in cdk.json.

Highlights:

  • New GCOAnalyticsStack deploys SageMaker Studio (IAM auth, VpcOnly), EMR Serverless (emr-7.13.0), Cognito user pool (SRP auth), per-user Studio EFS, a Studio-only S3 bucket, and a presigned-URL Lambda.
  • API Gateway is extended with /studio/* routes via an AnalyticsApiConfig mutator parameter — no second API Gateway. /api/v1/* and /inference/* continue to use IAM authorization.
  • New always-on Cluster_Shared_Bucket + Cluster_Shared_KMS_Key in GCOGlobalStack. Every regional cluster's job-pod role gets unconditional RW on it via the gco-cluster-shared-bucket ConfigMap (applied to gco-jobs, gco-system, gco-inference). SageMaker role also gets RW when analytics is on.
  • New gco analytics CLI group: enable, disable, status, users add/list/remove, studio login, doctor, iterate. Stdlib- only SRP implementation in cli/analytics_user_mgmt.py.
  • New lifecycle script scripts/test_analytics_lifecycle.py with pure state-detection + next-step + remediation helpers plus an effectful main driver for the deploy → test → destroy → verify-clean cycle.
  • Five property-based invariants validated via Hypothesis: regional job-pod role only references the cluster-shared bucket (never the Studio-only bucket), the SageMaker-role grant on the cluster-shared bucket is biconditional with the analytics toggle, synth roundtrip recovers the original toggle pair, the Studio domain uses stock images (no custom CustomImages / ECR repo / Dockerfiles), and the cluster-shared-bucket ConfigMap is present on every regional cluster regardless of the toggle.
  • Three example manifests wire the shared-bucket ConfigMap via envFrom: cluster-shared-bucket-upload, analytics-s3-upload, analytics-database-export (with an optional-Aurora guard).
  • Docs: new docs/ANALYTICS.md, new docs/CLUSTER_SHARED_BUCKET.md, docs/CLI.md analytics section, README feature bullet.
  • diagrams/generate.py extended with an analytics-stack target and all diagrams regenerated; the analytics diagram is embedded in the analytics guide.
  • cdk-nag matrix extended with analytics-enabled and analytics-enabled-hyperpod configs — 8 configs pass with zero unsuppressed findings.
  • CI coverage gate raised from 85% to 90%; observed coverage is ~92.23%.

Key design decisions:

  • Cluster_Shared_Bucket is owned by GCOGlobalStack, not by the analytics stack, so cluster jobs can use it regardless of the analytics toggle. Analytics-side changes stay gated.
  • All removable resources (EFS, Cognito, KMS, S3) use RemovalPolicy.DESTROY so the deploy/destroy iteration loop stays fast. Operators opt into retain via per-resource cdk.json keys.
  • No custom SageMaker image, no ECR repo, no Dockerfiles — notebooks use the stock SageMaker Distribution image and users pip install any extras into their EFS-mounted home folder.

Testing:

  • pytest tests/ --cov=gco --cov=cli -n auto → 4030 passed; coverage 92.23%. Pytest-xdist parallelism occasionally races on CDK asset staging — all racing tests pass when re-run serially.
  • python scripts/test_cdk_synthesis.py --verbose → 26/26 configs.
  • pytest tests/test_nag_compliance.py → 8/8 configs green.

Summary

Type of change

  • feat: New feature (non-breaking)
  • fix: Bug fix (non-breaking)
  • docs: Documentation only
  • refactor: Code refactor (no behavior change)
  • perf: Performance improvement
  • test: Test-only change
  • ci: CI / tooling change
  • chore: Maintenance (dep bumps, etc.)
  • breaking: Breaking change (major version bump)

Testing

  • pytest tests/ passes locally
  • cdk synth succeeds (if CDK code changed)
  • New tests added for new behavior
  • Ran the change against a real AWS account (describe below)

Checklist

  • Documentation updated (README, docs/, inline docstrings) as needed
  • No secrets, credentials, or customer data committed
  • requirements-lock.txt regenerated if pyproject.toml changed
  • Changes align with the architecture described in docs/ARCHITECTURE.md

Related issues

@Jmevorach Jmevorach changed the title WIP: feat: add optional analytics environment (SageMaker Studio + EMR + Co… WIP: v.1.0.0 release: feat: add optional analytics environment (SageMaker Studio + EMR + Co… May 4, 2026
Comment thread cli/analytics_user_mgmt.py Dismissed
Comment thread tests/test_analytics_stack.py Fixed
Comment thread tests/test_analytics_stack.py Fixed
Comment thread tests/test_analytics_stack.py Fixed
@Jmevorach Jmevorach force-pushed the spec/analytics-environment branch 2 times, most recently from f3af3c3 to 3f06a22 Compare May 4, 2026 16:11
Comment thread cli/analytics_user_mgmt.py Fixed
@Jmevorach Jmevorach force-pushed the spec/analytics-environment branch 2 times, most recently from fe21528 to df7db66 Compare May 4, 2026 16:45
Comment thread scripts/test_analytics_lifecycle.py Fixed
…gnito)

Adds a feature-toggled ML and analytics environment to GCO. Off by
default; enabled via `gco analytics enable` which flips
`analytics_environment.enabled=true` in `cdk.json`.

Highlights:

- New `GCOAnalyticsStack` deploys SageMaker Studio (IAM auth, VpcOnly),
  EMR Serverless (emr-7.13.0), Cognito user pool (SRP auth), per-user
  Studio EFS, a Studio-only S3 bucket, and a presigned-URL Lambda.
- API Gateway is extended with `/studio/*` routes via an
  `AnalyticsApiConfig` mutator parameter — no second API Gateway.
  `/api/v1/*` and `/inference/*` continue to use IAM authorization.
- New always-on `Cluster_Shared_Bucket` + `Cluster_Shared_KMS_Key` in
  `GCOGlobalStack`. Every regional cluster's job-pod role gets
  unconditional RW on it via the `gco-cluster-shared-bucket` ConfigMap
  (applied to `gco-jobs`, `gco-system`, `gco-inference`). SageMaker
  role also gets RW when analytics is on.
- New `gco analytics` CLI group: `enable`, `disable`, `status`,
  `users add/list/remove`, `studio login`, `doctor`, `iterate`. Stdlib-
  only SRP implementation in `cli/analytics_user_mgmt.py`.
- New lifecycle script `scripts/test_analytics_lifecycle.py` with pure
  state-detection + next-step + remediation helpers plus an effectful
  `main` driver for the deploy → test → destroy → verify-clean cycle.
- Five property-based invariants validated via Hypothesis: regional
  job-pod role only references the cluster-shared bucket (never the
  Studio-only bucket), the SageMaker-role grant on the cluster-shared
  bucket is biconditional with the analytics toggle, synth roundtrip
  recovers the original toggle pair, the Studio domain uses stock
  images (no custom `CustomImages` / ECR repo / Dockerfiles), and the
  cluster-shared-bucket ConfigMap is present on every regional cluster
  regardless of the toggle.
- Three example manifests wire the shared-bucket ConfigMap via
  `envFrom`: cluster-shared-bucket-upload, analytics-s3-upload,
  analytics-database-export (with an optional-Aurora guard).
- Docs: new `docs/ANALYTICS.md`, new `docs/CLUSTER_SHARED_BUCKET.md`,
  `docs/CLI.md` analytics section, README feature bullet.
- `diagrams/generate.py` extended with an analytics-stack target and
  all diagrams regenerated; the analytics diagram is embedded in the
  analytics guide.
- cdk-nag matrix extended with `analytics-enabled` and
  `analytics-enabled-hyperpod` configs — 8 configs pass with zero
  unsuppressed findings.
- CI coverage gate raised from 85% to 90%; observed coverage is
  ~92.23%.

Key design decisions:

- `Cluster_Shared_Bucket` is owned by `GCOGlobalStack`, not by the
  analytics stack, so cluster jobs can use it regardless of the
  analytics toggle. Analytics-side changes stay gated.
- All removable resources (EFS, Cognito, KMS, S3) use
  `RemovalPolicy.DESTROY` so the deploy/destroy iteration loop stays
  fast. Operators opt into retain via per-resource cdk.json keys.
- No custom SageMaker image, no ECR repo, no Dockerfiles — notebooks
  use the stock SageMaker Distribution image and users `pip install`
  any extras into their EFS-mounted home folder.

Testing:

- pytest tests/ --cov=gco --cov=cli -n auto → 4030 passed; coverage
  92.23%. Pytest-xdist parallelism occasionally races on CDK asset
  staging — all racing tests pass when re-run serially.
- python scripts/test_cdk_synthesis.py --verbose → 26/26 configs.
- pytest tests/test_nag_compliance.py → 8/8 configs green.
@Jmevorach Jmevorach force-pushed the spec/analytics-environment branch from df7db66 to b4bb7b1 Compare May 4, 2026 17:02
Jmevorach added 9 commits May 4, 2026 15:04
Matches the == pin style used by every other direct dependency in
pyproject.toml and the urllib3 pin already used in each Lambda
requirements.txt. requirements-lock.txt already resolves to 2.6.3,
so no lock change is needed.
Comment thread cli/analytics_user_mgmt.py Fixed
Comment thread cli/analytics_user_mgmt.py Fixed
… admin auth

- Remove scripts/test_analytics_lifecycle.{py,sh} and their tests
- Remove the 'gco analytics iterate' CLI command (thin wrapper)
- Replace hand-rolled SRP crypto (~200 lines) with boto3
  admin_initiate_auth (ADMIN_USER_PASSWORD_AUTH flow)
- Remove pycognito dependency (not needed with admin auth)
- Remove CodeQL py/weak-sensitive-data-hashing suppression
- Enable admin_user_password auth flow on Cognito client
- Fix em-dash chars in IAM role/SG descriptions (IAM rejects U+2014)
- Add EFS mount target dependency on SageMaker domain (fixes race)
- Delete test_analytics_default_image.py (overly paranoid)
- Update docs/README.md with new ANALYTICS.md and CLUSTER_SHARED_BUCKET.md
@Jmevorach Jmevorach force-pushed the spec/analytics-environment branch from 44f9f57 to 4b58564 Compare May 5, 2026 16:01
Jmevorach added 7 commits May 5, 2026 12:46
- API Gateway: add resource policy statement allowing Cognito-authorized
  requests on /studio/* (fixes 403 for non-IAM principals)
- Lambda IAM: add elasticfilesystem:TagResource to presigned-URL Lambda
  role (required when creating access points with tags)
- Analytics stack: add admin_user_password auth flow to Cognito client
- Aurora export job: use AWS_REGION env var (matches working pgvector
  example) and add AWS_ROLE_ARN from ConfigMap
- Add SageMaker Studio UI actions (DescribeDomain, ListSpaces, ListApps,
  CreateApp, CreatePresignedDomainUrl, etc.) to the execution role so
  Studio renders without permission errors
- Add EMR Serverless actions (ListApplications, StartJobRun,
  AccessLivyEndpoints, etc.) so Studio's Data panel can discover and
  connect to the EMR Serverless Spark application
- Add EFS resource policy with DescribeMountTargets to unblock user
  profile provisioning (fixes the 'PermissionError: DescribeMountTargets'
  failure during CreateUserProfile)
- Remove last remaining task reference in section header
- Add nag suppressions for the new wildcard ARN patterns
- Add lambda/analytics-cleanup/ with proper error handling for stack
  deletion (deletes user profiles, EFS access points, SageMaker SGs)
- Switch analytics VPC from PRIVATE_ISOLATED to PRIVATE_WITH_EGRESS
  (adds NAT gateway so notebooks can pip install and git clone)
- Add 4 new analytics config variants to test matrix (efs-retain,
  cognito-retain, custom-domain-prefix, all-retain)
- Add nag suppressions for cleanup Lambda IAM policy
- Fix cross-region aggregator test race condition (URL-based routing)
- Update tests for new VPC posture (IGW + NAT present)
- Add SageMaker Studio UI + EMR Serverless + KMS CreateGrant + AddTags
  permissions to execution role
- Update docs/ANALYTICS.md and lambda/analytics-cleanup/README.md
@Jmevorach Jmevorach force-pushed the spec/analytics-environment branch from dc67785 to 637d3a1 Compare May 5, 2026 18:39
Jmevorach added 4 commits May 5, 2026 18:34
SageMaker VpcOnly mode creates a default security group that only allows
NFS traffic (ports 988, 1018-1023, 2049). This blocks all internet
access from notebooks, preventing pip install, git clone, etc.

Add a security group with allow_all_outbound=True to the domain's
DefaultUserSettings.SecurityGroups so notebook traffic can reach the
internet through the NAT gateway.
- Add 5 Studio screenshots: landing screen, JupyterLab app, JupyterLab
  landing page, git clone in terminal, EMR Serverless data panel
- Integrate into ANALYTICS.md sections (a), (d), and (e) inline with
  the relevant instructions
- Update images/README.md with a new SageMaker Studio Screenshots table
Add a callout at the top of the guide explaining that us-east-1 and
us-east-2 are example regions and users should substitute their own
deployment_regions values. Add inline reminders in sections (e) and (f)
where copy-paste is most likely.
During stack deletion, the presigned-URL Lambda could recreate user
profiles (via in-flight login requests) after the cleanup Lambda had
already deleted them. Adding a dependency ensures CloudFormation
deletes the presigned-URL Lambda first, then fires the cleanup,
preventing the race condition.
@Jmevorach Jmevorach force-pushed the spec/analytics-environment branch from b020501 to 51e1e32 Compare May 6, 2026 11:12
Jmevorach added 8 commits May 6, 2026 07:15
The SageMaker-managed EFS wasn't being found because:
1. describe_file_systems wasn't paginated (missed if >10 file systems)
2. Only checked CreationToken matching domain ID

Now also checks the ManagedByAmazonSageMakerResource tag which
SageMaker sets on its internally-created EFS file systems.
- Add --exclusively flag when destroying a specific stack so CDK doesn't
  cascade to dependents (e.g. destroying gco-analytics no longer also
  destroys gco-api-gateway)
- Add cleanup Lambda role to EFS resource policy so it can call
  DescribeAccessPoints/DescribeFileSystems during stack deletion
  (fixes AccessDeniedException that left orphaned SageMaker-managed EFS)
SageMaker's inbound/outbound NFS security groups cross-reference each
other, creating a circular dependency that prevents deletion. Now the
cleanup Lambda revokes all ingress/egress rules first (breaking the
cross-references), then deletes the security groups in a second pass.
DescribeAccessPoints, DescribeFileSystems, and DescribeMountTargets are
control-plane APIs that cannot be scoped to a specific EFS file system
in a resource policy. EFS rejects the policy with 'action cannot be
applied to a specific resource'. The IAM role policy (Resource:*) on
both the execution role and cleanup Lambda role handles authorization
for these actions instead.
EFS enforces the resource policy intersection model for
DescribeMountTargets when any resource policy exists on the file system.
SageMaker auto-creates a resource policy (ClientRootAccess + ClientWrite
with AccessedViaMountTarget condition) which blocks DescribeMountTargets.

Adding DescribeMountTargets explicitly (without the condition) to our
CDK-managed resource policy ensures SageMaker can validate the EFS
during user profile provisioning.
Tighten the EFS resource policy from AnyPrincipal to only the
SageMaker execution role and sagemaker.amazonaws.com service principal.
Move the policy addition to _create_execution_role_and_grants (after
the role exists) since _create_studio_efs runs before the role is
created.
…ck imports

CloudFormation blocks deletion of stacks with consumed exports. The
analytics stack exports the Cognito pool ARN which gco-api-gateway
imports for /studio/* routes. Before destroying analytics, we now
temporarily disable the toggle and redeploy gco-api-gateway to remove
the routes and drop the import, then re-enable and destroy.
- Add ec2:RevokeSecurityGroupIngress/Egress to cleanup Lambda IAM
  (needed to break SG cross-references before deletion)
- Use AccountRootPrincipal in EFS resource policy (scoped to same
  account only) with all EFS actions the cleanup Lambda needs
  (DescribeAccessPoints, DeleteAccessPoint, DescribeMountTargets,
  DeleteMountTarget, DeleteFileSystem)
- This ensures the cleanup Lambda can delete access points and the
  SageMaker-managed EFS during stack destruction
@Jmevorach Jmevorach force-pushed the spec/analytics-environment branch from 5c93c60 to 9a1f439 Compare May 6, 2026 12:24
- TestStackManagerDestroyExclusively: verifies --exclusively flag is
  passed, API gateway dependency removal is called for analytics stacks,
  and non-analytics stacks skip both
- TestStackManagerDeployAnalyticsAutoApiGateway: verifies deploying
  gco-analytics triggers a follow-up gco-api-gateway deploy, and
  non-analytics stacks don't trigger it
@Jmevorach Jmevorach force-pushed the spec/analytics-environment branch from 30cbb03 to 4a9bb5c Compare May 6, 2026 12:38
Jmevorach added 2 commits May 6, 2026 08:41
CloudFormation was deleting the cleanup Lambda's inline policy
(DefaultPolicy) before the custom resource Lambda finished executing,
causing AccessDeniedException on EFS and EC2 calls.

Switch to a customer-managed IAM policy and add an explicit dependency
from the custom resource to the policy. This ensures CloudFormation
keeps the policy alive until after the cleanup Lambda completes.
During deletion, dependency ordering is reversed: the custom resource
fires first (Lambda runs with full permissions), then the policy is
deleted.
@Jmevorach Jmevorach changed the title WIP: v.1.0.0 release: feat: add optional analytics environment (SageMaker Studio + EMR + Co… v.1.0.0 release: feat: add optional analytics environment (SageMaker Studio + EMR + Co… May 6, 2026
Jmevorach added 9 commits May 6, 2026 09:02
…ilter

The EFS resource policy intersection model blocks DescribeAccessPoints
(can't be added to resource policies) and DescribeFileSystems (without
a filter). Fix:

1. Remove _delete_access_points call — CloudFormation deletes access
   points on the CDK-managed EFS automatically when the EFS is removed.
2. Use DescribeFileSystems(CreationToken=domain_id) to find the
   SageMaker-managed EFS directly — this targets a specific FS that
   has no resource policy, avoiding the intersection model.
3. Add DescribeFileSystems to the CDK-managed EFS resource policy
   (may or may not be accepted by EFS — if rejected at deploy, we'll
   remove it).
…stems

The EFS resource policy (auto-added by SageMaker + our CDK additions)
triggers the intersection authorization model, blocking
DescribeFileSystems even with IAM Resource:* permissions.

Fix: the cleanup Lambda now calls DeleteFileSystemPolicy on the
CDK-managed EFS as its first EFS operation. This removes the resource
policy, disabling the intersection model, so subsequent
DescribeFileSystems(CreationToken=domain_id) calls succeed.

Added efs:DeleteFileSystemPolicy to the cleanup Lambda's managed policy.
Replace _delete_access_points mock with _delete_efs_resource_policy
mock in TestHandler tests since the main handler flow no longer calls
_delete_access_points directly (CloudFormation handles CDK-managed EFS
access points).
The destroy flow's _remove_api_gateway_analytics_dependency calls
self.deploy() which conflicts with the parent CDK process's cdk.out
lock. Using a temporary output directory avoids the 'Other CLIs are
currently reading from cdk.out' error.
The EFS resource policy intersection model blocks DeleteFileSystemPolicy
itself unless it's explicitly allowed in the policy. Adding it allows
the cleanup Lambda to remove the policy, which then unblocks
DescribeFileSystems for finding the SageMaker-managed EFS.
Instead of calling DescribeFileSystems (blocked by EFS resource policy
intersection model), the cleanup Lambda now calls
sagemaker:DescribeDomain to get HomeEfsFileSystemId directly. The
domain still exists when the custom resource fires, so this call
succeeds reliably.

Added sagemaker:DescribeDomain to the cleanup Lambda's managed policy.
Updated tests to mock the new DescribeDomain-based approach.
The efs: shorthand works for normal IAM evaluation but fails when the
EFS resource policy intersection model is active. Using the full
elasticfilesystem: prefix resolves the AccessDeniedException on
DescribeMountTargets, DeleteFileSystemPolicy, and other EFS actions.

Also added _get_sagemaker_home_efs_id helper to delete the resource
policy on the SageMaker-managed EFS before calling DescribeMountTargets.
@Jmevorach Jmevorach merged commit c54bf35 into main May 6, 2026
48 checks passed
@Jmevorach Jmevorach deleted the spec/analytics-environment branch May 6, 2026 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants