Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 1 addition & 50 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -254,13 +254,6 @@ deploy/regional/
kms.tf - KMS key for ECS Exec encryption
oidc.tf - OIDC identity provider for Keycloak
lambda-create-investigation.tf - Lambda function for investigation creation
examples/ - Manual lifecycle management scripts
create_investigation.sh - Create access point + task definition
launch_task.sh - Launch Fargate task
join_task.sh - Connect via ECS Exec
stop_task.sh - Stop task (triggers S3 sync)
close_investigation.sh - Cleanup access point + task definition
build-task-def.jq - jq script for task definition transformation
get-vpc-info.sh - Helper to discover VPC/subnets
.gitignore - Excludes tfvars, state, .terraform/
```
Expand Down Expand Up @@ -417,11 +410,7 @@ Terraform configuration for Keycloak realm and OIDC clients on ROSA cluster:
- `PREREQUISITES.md` - Setup requirements
- `QUICKSTART.md` - Deployment guide

### Investigation Creation Workflows

Two approaches are available for creating investigations:

#### Lambda-Based (Recommended)
### Investigation Creation Workflow

**Workflow**: OIDC authentication → Lambda authorization → Automatic role + task creation

Expand All @@ -445,48 +434,10 @@ tools/create-investigation-lambda.sh <cluster-id> <investigation-id> [oc-version
- Token caching (4 minutes, reduces browser popups)
- Automatic role management (no manual IAM operations)

#### Manual Lifecycle Scripts

**Workflow**: Direct AWS API calls → Manual task management

Located in `deploy/regional/examples/`:

1. **create_investigation.sh** `<cluster-id> <investigation-id> [oc-version]`
- Creates EFS access point with tags
- Clones base task definition
- Adds environment variables
- Registers new task definition family
- Returns task family name + access point ID

2. **launch_task.sh** `<task-family>`
- Launches Fargate task with ECS Exec enabled
- Waits for RUNNING state
- Returns task ID

3. **join_task.sh** `<task-id>`
- Connects via ECS Exec as sre user

4. **stop_task.sh** `<task-id> [reason]`
- Sends SIGTERM to task
- Entrypoint syncs to auto-generated S3 path
- Shows expected S3 location

5. **close_investigation.sh** `<task-family> <access-point-id>`
- Checks for running tasks
- Deregisters all task definition revisions
- Deletes EFS access point (prompts for confirmation)

**Use Cases**:
- Testing infrastructure without OIDC
- Debugging task definition issues
- Administrative operations
- CI/CD automation with service roles

### Key Files for Deployment

- `deploy/regional/terraform.tfvars.example` - Template configuration
- `deploy/regional/README.md` - Complete deployment documentation
- `deploy/regional/examples/*.sh` - Manual lifecycle scripts
- `lambda/create-investigation/handler.py` - Lambda authorization logic
- `tools/sre-auth/` - OIDC authentication scripts
- `tools/create-investigation-lambda.sh` - End-to-end creation wrapper
Expand Down
28 changes: 1 addition & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@ rosa-boundary/
├── deploy/
│ ├── regional/ # Terraform: ECS, EFS, S3, Lambda, OIDC
│ │ ├── *.tf # Infrastructure definitions
│ │ ├── examples/ # Manual lifecycle scripts
│ │ └── README.md # Deployment guide
│ └── keycloak/ # Terraform: Keycloak realm and clients
├── lambda/
Expand Down Expand Up @@ -334,32 +333,7 @@ cd ../tools/
- Automatic role creation on first use
- Token caching (4 minutes) to avoid repeated browser authentication

See [`tools/sre-auth/README.md`](tools/sre-auth/README.md) for authentication details and [`docs/LAMBDA_AUTH_SUMMARY.md`](docs/LAMBDA_AUTH_SUMMARY.md) for architecture.

#### Manual Lifecycle Scripts

Lower-level scripts for manual investigation management:

```bash
cd deploy/regional/examples/

# Create investigation (access point + task definition)
./create_investigation.sh <cluster-id> <investigation-id> [oc-version]

# Launch task
./launch_task.sh <task-family>

# Connect to task
./join_task.sh <task-id>

# Stop task (triggers S3 sync)
./stop_task.sh <task-id>

# Cleanup investigation
./close_investigation.sh <task-family> <access-point-id>
```

See [`deploy/regional/README.md`](deploy/regional/README.md) for complete documentation.
See [`tools/sre-auth/README.md`](tools/sre-auth/README.md) for authentication details and [`deploy/regional/README.md`](deploy/regional/README.md) for infrastructure documentation.

#### Manual Deployment

Expand Down
140 changes: 59 additions & 81 deletions deploy/regional/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,28 +78,26 @@ terraform apply

### 3. Investigation Lifecycle Management

The `examples/` directory contains scripts for the complete investigation lifecycle:
Use the Lambda-based OIDC workflow for investigation creation:

```bash
cd examples/
cd ../../tools/

# Create investigation workspace (EFS access point + task definition)
./create_investigation.sh rosa-prod-abc INV-12345 4.18
# Create investigation with OIDC authentication
./create-investigation-lambda.sh <cluster-id> <investigation-id> [oc-version]

# Launch a Fargate task for the investigation
./launch_task.sh rosa-boundary-dev-rosa-prod-abc-INC-12345-TIMESTAMP
# Example
./create-investigation-lambda.sh rosa-prod-abc inv-12345 4.18

# Connect to the running task
./join_task.sh <task-id>

# Stop the task (triggers S3 sync)
./stop_task.sh <task-id>

# Close investigation (cleanup task definition + access point)
./close_investigation.sh rosa-boundary-dev-rosa-prod-abc-INV-12345-TIMESTAMP fsap-xxx
# The script will:
# 1. Authenticate via Keycloak (browser popup)
# 2. Call Lambda to create investigation
# 3. Assume role with tag-based permissions
# 4. Wait for task to be running
# 5. Display connection command
```

See [Investigation Lifecycle](#investigation-lifecycle) below for detailed examples.
See [`tools/sre-auth/README.md`](../../tools/sre-auth/README.md) for authentication details and [Investigation Lifecycle](#investigation-lifecycle) below for architecture.

## Variables

Expand Down Expand Up @@ -138,58 +136,35 @@ See [Investigation Lifecycle](#investigation-lifecycle) below for detailed examp

The infrastructure uses a per-investigation isolation model with EFS access points and dedicated task definitions.

### Workflow Overview

```
1. create_investigation.sh → Creates EFS access point + task definition with locked OC version
2. launch_task.sh → Launches Fargate task for the investigation
3. join_task.sh → Connects to running task via ECS Exec
4. stop_task.sh → Stops task (triggers S3 sync to unique path)
5. close_investigation.sh → Cleanup: deletes task definition + access point
```

### Path Structure

- **EFS**: `/$cluster_id/$investigation_id/` → Mounted to `/home/sre` in container
- **S3**: `s3://bucket/$cluster_id/$investigation_id/$date/$task_id/`

**Example:**
- EFS: `/rosa-prod-abc/INV-12345/` (shared across all tasks for this investigation)
- S3: `s3://bucket/rosa-prod-abc/INC-12345/20251215/d0910f05.../` (unique per task)
- S3: `s3://bucket/rosa-prod-abc/INV-12345/20251215/d0910f05.../` (unique per task)

### Step 1: Create Investigation Workspace
### Lambda-Based Workflow

```bash
cd examples/
The recommended workflow uses OIDC authentication and Lambda for investigation creation:

# Syntax: ./create_investigation.sh <cluster-id> <investigation-id> [oc-version]
./create_investigation.sh rosa-prod-abc INV-12345 4.18
```
1. **Authenticate**: User authenticates via Keycloak (browser PKCE flow)
2. **Create Investigation**: Lambda validates token, creates IAM role, EFS access point, and task
3. **Assume Role**: Script assumes returned role with tag-based permissions
4. **Connect**: User connects to running task via ECS Exec

**Output:**
- EFS Access Point: `fsap-xxx` with path `/$cluster_id/$investigation_id/`
- Task Definition: `rosa-boundary-dev-$cluster_id-$investigation_id-TIMESTAMP`
- OC Version: Locked to specified version in task definition
See [`tools/create-investigation-lambda.sh`](../../tools/create-investigation-lambda.sh) for the end-to-end script.

**Save the output values** (task family name and access point ID) for later steps.
### Connecting to Running Tasks

### Step 2: Launch Task
Use `tools/join-investigation.sh` to connect to existing tasks:

```bash
# Use task family name from create_investigation.sh output
./launch_task.sh rosa-boundary-dev-rosa-prod-abc-INC-12345-20251215-171625
```

**Automatic configuration:**
- Mounts investigation-specific EFS path to `/home/sre`
- Sets environment variables: `CLUSTER_ID`, `INVESTIGATION_ID`, `OC_VERSION`, `S3_AUDIT_BUCKET`
- Enables ECS Exec for interactive access
cd ../../tools/

### Step 3: Connect to Task

```bash
# Use task ID from launch_task.sh output
./join_task.sh d0910f05129944e3a3b03f4b453bf18c
# Join by task ID
./join-investigation.sh <task-id>
```

**Inside the container (as sre user):**
Expand All @@ -205,36 +180,48 @@ echo $INVESTIGATION_ID # INV-12345
# All files in /home/sre persist to EFS
```

### Step 4: Stop Task (Triggers S3 Sync)

```bash
./stop_task.sh d0910f05129944e3a3b03f4b453bf18c "Investigation complete"
```

**What happens:**
1. Container receives SIGTERM signal
2. Entrypoint auto-generates S3 path: `s3://bucket/$cluster/$investigation/$date/$taskid/`
3. Syncs `/home/sre` to S3 with WORM compliance
4. Container exits gracefully
**What happens during investigation:**
1. All files in `/home/sre` persist to EFS
2. Isolated workspace per investigation
3. OC version locked to investigation specification

**S3 Path Example:**
```
s3://641875867446-rosa-boundary-dev-us-east-2/rosa-prod-abc/INC-12345/20251215/d0910f05.../
```

### Step 5: Close Investigation (Cleanup)
### Stopping Tasks

Tasks can be stopped with the AWS CLI:

```bash
# Use task family and access point ID from create_investigation.sh output
./close_investigation.sh rosa-boundary-dev-rosa-prod-abc-INV-12345-TIMESTAMP fsap-xxx
# Stop task (triggers S3 sync via entrypoint signal handler)
aws ecs stop-task \
--cluster rosa-boundary-dev \
--task <task-id> \
--reason "Investigation complete"
```

**Cleanup actions:**
1. Checks for running tasks (must stop all first)
2. Deregisters all task definition revisions
3. Deletes EFS access point (prompts for confirmation)
The container's entrypoint script traps SIGTERM and syncs `/home/sre` to S3 before exit.

### Cleanup

**Note:** EFS data remains on the filesystem but is no longer accessible. S3 data is retained per WORM policy.
Investigation cleanup requires manual deletion of resources:

```bash
# 1. List and stop any running tasks
aws ecs list-tasks --cluster rosa-boundary-dev --family <task-family>
aws ecs stop-task --cluster rosa-boundary-dev --task <task-id>

# 2. Deregister task definition revisions
aws ecs list-task-definitions --family-prefix <task-family>
aws ecs deregister-task-definition --task-definition <task-definition-arn>

# 3. Delete EFS access point
aws efs delete-access-point --access-point-id <fsap-id>
```

**Note:** EFS data remains on the filesystem. S3 data is retained per WORM policy.

---

Expand Down Expand Up @@ -298,7 +285,7 @@ Session logs contain complete terminal output including any credentials or sensi

## Environment Variables

### Automatically Set (by lifecycle scripts)
### Automatically Set (by Lambda or task definition)

- `CLUSTER_ID` - ROSA cluster identifier
- `INVESTIGATION_ID` - Investigation tracking ID
Expand Down Expand Up @@ -358,16 +345,7 @@ Used by the container at runtime:

## Connecting to Running Tasks

### Via join_task.sh (Recommended)

```bash
cd examples/
./join_task.sh <task-id>
```

Automatically connects as the `sre` user via ECS Exec.

### Manual Connection
### Direct Connection

```bash
# List running tasks
Expand Down
31 changes: 0 additions & 31 deletions deploy/regional/examples/build-task-def.jq

This file was deleted.

Loading
Loading