v.1.0.0 release: feat: add optional analytics environment (SageMaker Studio + EMR + Co…#32
Merged
Merged
Conversation
f3af3c3 to
3f06a22
Compare
fe21528 to
df7db66
Compare
…gnito) Adds a feature-toggled ML and analytics environment to GCO. Off by default; enabled via `gco analytics enable` which flips `analytics_environment.enabled=true` in `cdk.json`. Highlights: - New `GCOAnalyticsStack` deploys SageMaker Studio (IAM auth, VpcOnly), EMR Serverless (emr-7.13.0), Cognito user pool (SRP auth), per-user Studio EFS, a Studio-only S3 bucket, and a presigned-URL Lambda. - API Gateway is extended with `/studio/*` routes via an `AnalyticsApiConfig` mutator parameter — no second API Gateway. `/api/v1/*` and `/inference/*` continue to use IAM authorization. - New always-on `Cluster_Shared_Bucket` + `Cluster_Shared_KMS_Key` in `GCOGlobalStack`. Every regional cluster's job-pod role gets unconditional RW on it via the `gco-cluster-shared-bucket` ConfigMap (applied to `gco-jobs`, `gco-system`, `gco-inference`). SageMaker role also gets RW when analytics is on. - New `gco analytics` CLI group: `enable`, `disable`, `status`, `users add/list/remove`, `studio login`, `doctor`, `iterate`. Stdlib- only SRP implementation in `cli/analytics_user_mgmt.py`. - New lifecycle script `scripts/test_analytics_lifecycle.py` with pure state-detection + next-step + remediation helpers plus an effectful `main` driver for the deploy → test → destroy → verify-clean cycle. - Five property-based invariants validated via Hypothesis: regional job-pod role only references the cluster-shared bucket (never the Studio-only bucket), the SageMaker-role grant on the cluster-shared bucket is biconditional with the analytics toggle, synth roundtrip recovers the original toggle pair, the Studio domain uses stock images (no custom `CustomImages` / ECR repo / Dockerfiles), and the cluster-shared-bucket ConfigMap is present on every regional cluster regardless of the toggle. - Three example manifests wire the shared-bucket ConfigMap via `envFrom`: cluster-shared-bucket-upload, analytics-s3-upload, analytics-database-export (with an optional-Aurora guard). - Docs: new `docs/ANALYTICS.md`, new `docs/CLUSTER_SHARED_BUCKET.md`, `docs/CLI.md` analytics section, README feature bullet. - `diagrams/generate.py` extended with an analytics-stack target and all diagrams regenerated; the analytics diagram is embedded in the analytics guide. - cdk-nag matrix extended with `analytics-enabled` and `analytics-enabled-hyperpod` configs — 8 configs pass with zero unsuppressed findings. - CI coverage gate raised from 85% to 90%; observed coverage is ~92.23%. Key design decisions: - `Cluster_Shared_Bucket` is owned by `GCOGlobalStack`, not by the analytics stack, so cluster jobs can use it regardless of the analytics toggle. Analytics-side changes stay gated. - All removable resources (EFS, Cognito, KMS, S3) use `RemovalPolicy.DESTROY` so the deploy/destroy iteration loop stays fast. Operators opt into retain via per-resource cdk.json keys. - No custom SageMaker image, no ECR repo, no Dockerfiles — notebooks use the stock SageMaker Distribution image and users `pip install` any extras into their EFS-mounted home folder. Testing: - pytest tests/ --cov=gco --cov=cli -n auto → 4030 passed; coverage 92.23%. Pytest-xdist parallelism occasionally races on CDK asset staging — all racing tests pass when re-run serially. - python scripts/test_cdk_synthesis.py --verbose → 26/26 configs. - pytest tests/test_nag_compliance.py → 8/8 configs green.
df7db66 to
b4bb7b1
Compare
Matches the == pin style used by every other direct dependency in pyproject.toml and the urllib3 pin already used in each Lambda requirements.txt. requirements-lock.txt already resolves to 2.6.3, so no lock change is needed.
… admin auth
- Remove scripts/test_analytics_lifecycle.{py,sh} and their tests
- Remove the 'gco analytics iterate' CLI command (thin wrapper)
- Replace hand-rolled SRP crypto (~200 lines) with boto3
admin_initiate_auth (ADMIN_USER_PASSWORD_AUTH flow)
- Remove pycognito dependency (not needed with admin auth)
- Remove CodeQL py/weak-sensitive-data-hashing suppression
- Enable admin_user_password auth flow on Cognito client
- Fix em-dash chars in IAM role/SG descriptions (IAM rejects U+2014)
- Add EFS mount target dependency on SageMaker domain (fixes race)
- Delete test_analytics_default_image.py (overly paranoid)
- Update docs/README.md with new ANALYTICS.md and CLUSTER_SHARED_BUCKET.md
44f9f57 to
4b58564
Compare
- API Gateway: add resource policy statement allowing Cognito-authorized requests on /studio/* (fixes 403 for non-IAM principals) - Lambda IAM: add elasticfilesystem:TagResource to presigned-URL Lambda role (required when creating access points with tags) - Analytics stack: add admin_user_password auth flow to Cognito client - Aurora export job: use AWS_REGION env var (matches working pgvector example) and add AWS_ROLE_ARN from ConfigMap
- Add SageMaker Studio UI actions (DescribeDomain, ListSpaces, ListApps, CreateApp, CreatePresignedDomainUrl, etc.) to the execution role so Studio renders without permission errors - Add EMR Serverless actions (ListApplications, StartJobRun, AccessLivyEndpoints, etc.) so Studio's Data panel can discover and connect to the EMR Serverless Spark application - Add EFS resource policy with DescribeMountTargets to unblock user profile provisioning (fixes the 'PermissionError: DescribeMountTargets' failure during CreateUserProfile) - Remove last remaining task reference in section header - Add nag suppressions for the new wildcard ARN patterns
- Add lambda/analytics-cleanup/ with proper error handling for stack deletion (deletes user profiles, EFS access points, SageMaker SGs) - Switch analytics VPC from PRIVATE_ISOLATED to PRIVATE_WITH_EGRESS (adds NAT gateway so notebooks can pip install and git clone) - Add 4 new analytics config variants to test matrix (efs-retain, cognito-retain, custom-domain-prefix, all-retain) - Add nag suppressions for cleanup Lambda IAM policy - Fix cross-region aggregator test race condition (URL-based routing) - Update tests for new VPC posture (IGW + NAT present) - Add SageMaker Studio UI + EMR Serverless + KMS CreateGrant + AddTags permissions to execution role - Update docs/ANALYTICS.md and lambda/analytics-cleanup/README.md
dc67785 to
637d3a1
Compare
SageMaker VpcOnly mode creates a default security group that only allows NFS traffic (ports 988, 1018-1023, 2049). This blocks all internet access from notebooks, preventing pip install, git clone, etc. Add a security group with allow_all_outbound=True to the domain's DefaultUserSettings.SecurityGroups so notebook traffic can reach the internet through the NAT gateway.
- Add 5 Studio screenshots: landing screen, JupyterLab app, JupyterLab landing page, git clone in terminal, EMR Serverless data panel - Integrate into ANALYTICS.md sections (a), (d), and (e) inline with the relevant instructions - Update images/README.md with a new SageMaker Studio Screenshots table
Add a callout at the top of the guide explaining that us-east-1 and us-east-2 are example regions and users should substitute their own deployment_regions values. Add inline reminders in sections (e) and (f) where copy-paste is most likely.
During stack deletion, the presigned-URL Lambda could recreate user profiles (via in-flight login requests) after the cleanup Lambda had already deleted them. Adding a dependency ensures CloudFormation deletes the presigned-URL Lambda first, then fires the cleanup, preventing the race condition.
b020501 to
51e1e32
Compare
The SageMaker-managed EFS wasn't being found because: 1. describe_file_systems wasn't paginated (missed if >10 file systems) 2. Only checked CreationToken matching domain ID Now also checks the ManagedByAmazonSageMakerResource tag which SageMaker sets on its internally-created EFS file systems.
- Add --exclusively flag when destroying a specific stack so CDK doesn't cascade to dependents (e.g. destroying gco-analytics no longer also destroys gco-api-gateway) - Add cleanup Lambda role to EFS resource policy so it can call DescribeAccessPoints/DescribeFileSystems during stack deletion (fixes AccessDeniedException that left orphaned SageMaker-managed EFS)
SageMaker's inbound/outbound NFS security groups cross-reference each other, creating a circular dependency that prevents deletion. Now the cleanup Lambda revokes all ingress/egress rules first (breaking the cross-references), then deletes the security groups in a second pass.
DescribeAccessPoints, DescribeFileSystems, and DescribeMountTargets are control-plane APIs that cannot be scoped to a specific EFS file system in a resource policy. EFS rejects the policy with 'action cannot be applied to a specific resource'. The IAM role policy (Resource:*) on both the execution role and cleanup Lambda role handles authorization for these actions instead.
EFS enforces the resource policy intersection model for DescribeMountTargets when any resource policy exists on the file system. SageMaker auto-creates a resource policy (ClientRootAccess + ClientWrite with AccessedViaMountTarget condition) which blocks DescribeMountTargets. Adding DescribeMountTargets explicitly (without the condition) to our CDK-managed resource policy ensures SageMaker can validate the EFS during user profile provisioning.
Tighten the EFS resource policy from AnyPrincipal to only the SageMaker execution role and sagemaker.amazonaws.com service principal. Move the policy addition to _create_execution_role_and_grants (after the role exists) since _create_studio_efs runs before the role is created.
…ck imports CloudFormation blocks deletion of stacks with consumed exports. The analytics stack exports the Cognito pool ARN which gco-api-gateway imports for /studio/* routes. Before destroying analytics, we now temporarily disable the toggle and redeploy gco-api-gateway to remove the routes and drop the import, then re-enable and destroy.
- Add ec2:RevokeSecurityGroupIngress/Egress to cleanup Lambda IAM (needed to break SG cross-references before deletion) - Use AccountRootPrincipal in EFS resource policy (scoped to same account only) with all EFS actions the cleanup Lambda needs (DescribeAccessPoints, DeleteAccessPoint, DescribeMountTargets, DeleteMountTarget, DeleteFileSystem) - This ensures the cleanup Lambda can delete access points and the SageMaker-managed EFS during stack destruction
5c93c60 to
9a1f439
Compare
- TestStackManagerDestroyExclusively: verifies --exclusively flag is passed, API gateway dependency removal is called for analytics stacks, and non-analytics stacks skip both - TestStackManagerDeployAnalyticsAutoApiGateway: verifies deploying gco-analytics triggers a follow-up gco-api-gateway deploy, and non-analytics stacks don't trigger it
30cbb03 to
4a9bb5c
Compare
CloudFormation was deleting the cleanup Lambda's inline policy (DefaultPolicy) before the custom resource Lambda finished executing, causing AccessDeniedException on EFS and EC2 calls. Switch to a customer-managed IAM policy and add an explicit dependency from the custom resource to the policy. This ensures CloudFormation keeps the policy alive until after the cleanup Lambda completes. During deletion, dependency ordering is reversed: the custom resource fires first (Lambda runs with full permissions), then the policy is deleted.
…ilter The EFS resource policy intersection model blocks DescribeAccessPoints (can't be added to resource policies) and DescribeFileSystems (without a filter). Fix: 1. Remove _delete_access_points call — CloudFormation deletes access points on the CDK-managed EFS automatically when the EFS is removed. 2. Use DescribeFileSystems(CreationToken=domain_id) to find the SageMaker-managed EFS directly — this targets a specific FS that has no resource policy, avoiding the intersection model. 3. Add DescribeFileSystems to the CDK-managed EFS resource policy (may or may not be accepted by EFS — if rejected at deploy, we'll remove it).
…stems The EFS resource policy (auto-added by SageMaker + our CDK additions) triggers the intersection authorization model, blocking DescribeFileSystems even with IAM Resource:* permissions. Fix: the cleanup Lambda now calls DeleteFileSystemPolicy on the CDK-managed EFS as its first EFS operation. This removes the resource policy, disabling the intersection model, so subsequent DescribeFileSystems(CreationToken=domain_id) calls succeed. Added efs:DeleteFileSystemPolicy to the cleanup Lambda's managed policy.
Replace _delete_access_points mock with _delete_efs_resource_policy mock in TestHandler tests since the main handler flow no longer calls _delete_access_points directly (CloudFormation handles CDK-managed EFS access points).
The destroy flow's _remove_api_gateway_analytics_dependency calls self.deploy() which conflicts with the parent CDK process's cdk.out lock. Using a temporary output directory avoids the 'Other CLIs are currently reading from cdk.out' error.
The EFS resource policy intersection model blocks DeleteFileSystemPolicy itself unless it's explicitly allowed in the policy. Adding it allows the cleanup Lambda to remove the policy, which then unblocks DescribeFileSystems for finding the SageMaker-managed EFS.
Instead of calling DescribeFileSystems (blocked by EFS resource policy intersection model), the cleanup Lambda now calls sagemaker:DescribeDomain to get HomeEfsFileSystemId directly. The domain still exists when the custom resource fires, so this call succeeds reliably. Added sagemaker:DescribeDomain to the cleanup Lambda's managed policy. Updated tests to mock the new DescribeDomain-based approach.
The efs: shorthand works for normal IAM evaluation but fails when the EFS resource policy intersection model is active. Using the full elasticfilesystem: prefix resolves the AccessDeniedException on DescribeMountTargets, DeleteFileSystemPolicy, and other EFS actions. Also added _get_sagemaker_home_efs_id helper to delete the resource policy on the SageMaker-managed EFS before calling DescribeMountTargets.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…gnito)
Adds a feature-toggled ML and analytics environment to GCO. Off by default; enabled via
gco analytics enablewhich flipsanalytics_environment.enabled=trueincdk.json.Highlights:
GCOAnalyticsStackdeploys SageMaker Studio (IAM auth, VpcOnly), EMR Serverless (emr-7.13.0), Cognito user pool (SRP auth), per-user Studio EFS, a Studio-only S3 bucket, and a presigned-URL Lambda./studio/*routes via anAnalyticsApiConfigmutator parameter — no second API Gateway./api/v1/*and/inference/*continue to use IAM authorization.Cluster_Shared_Bucket+Cluster_Shared_KMS_KeyinGCOGlobalStack. Every regional cluster's job-pod role gets unconditional RW on it via thegco-cluster-shared-bucketConfigMap (applied togco-jobs,gco-system,gco-inference). SageMaker role also gets RW when analytics is on.gco analyticsCLI group:enable,disable,status,users add/list/remove,studio login,doctor,iterate. Stdlib- only SRP implementation incli/analytics_user_mgmt.py.scripts/test_analytics_lifecycle.pywith pure state-detection + next-step + remediation helpers plus an effectfulmaindriver for the deploy → test → destroy → verify-clean cycle.CustomImages/ ECR repo / Dockerfiles), and the cluster-shared-bucket ConfigMap is present on every regional cluster regardless of the toggle.envFrom: cluster-shared-bucket-upload, analytics-s3-upload, analytics-database-export (with an optional-Aurora guard).docs/ANALYTICS.md, newdocs/CLUSTER_SHARED_BUCKET.md,docs/CLI.mdanalytics section, README feature bullet.diagrams/generate.pyextended with an analytics-stack target and all diagrams regenerated; the analytics diagram is embedded in the analytics guide.analytics-enabledandanalytics-enabled-hyperpodconfigs — 8 configs pass with zero unsuppressed findings.Key design decisions:
Cluster_Shared_Bucketis owned byGCOGlobalStack, not by the analytics stack, so cluster jobs can use it regardless of the analytics toggle. Analytics-side changes stay gated.RemovalPolicy.DESTROYso the deploy/destroy iteration loop stays fast. Operators opt into retain via per-resource cdk.json keys.pip installany extras into their EFS-mounted home folder.Testing:
Summary
Type of change
feat:New feature (non-breaking)fix:Bug fix (non-breaking)docs:Documentation onlyrefactor:Code refactor (no behavior change)perf:Performance improvementtest:Test-only changeci:CI / tooling changechore:Maintenance (dep bumps, etc.)breaking:Breaking change (major version bump)Testing
pytest tests/passes locallycdk synthsucceeds (if CDK code changed)Checklist
docs/, inline docstrings) as neededrequirements-lock.txtregenerated ifpyproject.tomlchangeddocs/ARCHITECTURE.mdRelated issues