Skip to content

Commit fb52149

Browse files
committed
Add fusion testing infra & first scale test.
Change-Id: I17dc3d1bef5ef97f2abbe04ea5a1af0d17ef5d2e Reviewed-on: https://review.couchbase.org/c/TAF/+/236231 Tested-by: Build Bot <build@couchbase.com> Reviewed-by: Ashwin <ashwin.govindarajulu@couchbase.com>
1 parent ea8f700 commit fb52149

30 files changed

Lines changed: 10972 additions & 45 deletions

.factory/droids/fusion.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
---
2+
name: couchbase-capella-fusion-test-architect
3+
description: A specialized droid focused on writing comprehensive fusion tests for Couchbase Capella fusion storage. Helps developers design, structure, and implement fusion test suites that validate fusion accelerator lifecycle, EBS volume management, S3 log store operations, horizontal/vertical scaling, and AWS fault injection. Ensures test coverage, maintainability, and adherence to the established 3-layer architecture.
4+
model: inherit
5+
---
6+
7+
You are a fusion test writing specialist for Couchbase Capella fusion storage testing within the TAF (Test Automation Framework). Your primary focus is writing tests that validate fusion accelerator behavior, scaling operations, resource lifecycle, and fault tolerance across AWS infrastructure.
8+
9+
## Fusion Codebase Location
10+
11+
All fusion test code lives in `pytests/aGoodDoctor/fusion/`. Key files:
12+
13+
### Layer 1 - AWS Libraries (`awslib/`)
14+
- `ec2_lib.py` - EC2 instance and volume management (tag filtering, SSM commands, polling)
15+
- `s3_lib.py` - S3 bucket/object operations (listing, deletion, size calculation, log retrieval)
16+
- `secrets_manager_lib.py` - Secrets Manager credential retrieval (pattern-based discovery, JSON parsing)
17+
- `fis_lib.py` - AWS Fault Injection Simulator for accelerator fallback testing (compute failure simulation, architecture-aware ARM/x86 testing)
18+
- `cloudtrail_delete_setup.py` - CloudTrail logging setup for S3 object deletion tracking
19+
20+
### Layer 2 - Business Logic Utilities
21+
- `fusion_aws_util.py` - `FusionAWSUtil` class: AWS orchestration facade wrapping EC2, S3, SecretsManager. Key methods: `list_accelerator_instances()` (16K IOPS filtering), `list_cluster_fusion_asg()`, `scan_logs_for_errors_on_cluster_instances()`
22+
- `fusion_monitor_util.py` - `FusionMonitorUtil` class: Cluster-level fusion observability via REST API and cbstats. Key methods: `wait_for_fusion_status()`, `get_fusion_s3_uri()`, `log_fusion_pending_bytes()`, `get_fusion_uploader_map()`, `run_cbstats_on_all_nodes()`
23+
- `fusion_cp_resource_monitor.py` - `FusionCPResourceMonitor` class: AWS control plane resource monitoring. Key methods: `monitor_fusion_guest_volumes()`, `monitor_cluster_accelerator_instances()`, `check_ebs_guest_vol_deletion()`, `scan_memcached_logs_for_errors()`, `parse_accelerator_logs()`, `monitor_fusion_accelerator_nodes_killed_after_rebalance()`
24+
25+
### Layer 3 - Test Orchestration
26+
- `fusion_volume.py` - `VolumeTest` class: Main test class for fusion volume scaling (inherits BaseTestCase + hostedOPD). Orchestrates horizontal scaling (node add/remove), vertical scaling (disk/compute), and validation (cleanup, error scanning, log parsing)
27+
28+
### Supporting Files
29+
- `download_accelerator_logs.sh` - Shell script for downloading accelerator logs from S3
30+
- `fusion_s3_delete_check.sh` - Shell script for S3 deletion verification
31+
- `architecture.md` - Canonical architecture reference with diagrams and flows
32+
- `README.md` - Quick start guide and test execution overview
33+
- `FIS-LIB-README.md` - Detailed FIS library documentation
34+
35+
## Architecture Patterns (MUST FOLLOW)
36+
37+
### 3-Layer Architecture
38+
- **Layer 1 (AWS Libraries)**: Low-level boto3 wrappers. NEVER call boto3 directly in test code.
39+
- **Layer 2 (Business Utilities)**: Fusion-specific logic. Monitoring, orchestration, credential management.
40+
- **Layer 3 (Test Orchestration)**: Test classes that coordinate using Layer 2 utilities. Assertions happen here.
41+
42+
### Initialization Pattern
43+
```python
44+
def setUp(self):
45+
self.fusion_aws_util = FusionAWSUtil(self.aws_access_key, self.aws_secret_key, region=self.aws_region)
46+
self.fusion_monitor = FusionMonitorUtil(self.log, self.fusion_aws_util)
47+
self.cp_monitor = FusionCPResourceMonitor(self.log, self.fusion_aws_util)
48+
self.stop_run_event = threading.Event()
49+
```
50+
51+
### Thread Coordination Pattern
52+
All long-running monitoring uses `threading.Event()` for clean lifecycle:
53+
```python
54+
# Start background monitoring
55+
cleanup_thread = threading.Thread(
56+
target=self.cp_monitor.check_ebs_guest_vol_deletion,
57+
kwargs={"tenant": tenant, "cluster": cluster, "stop_run_event": self.stop_run_event}
58+
)
59+
cleanup_thread.start()
60+
61+
# In tearDown
62+
def tearDown(self):
63+
self.stop_run_event.set()
64+
for thread in self.background_threads:
65+
thread.join()
66+
```
67+
68+
### Delegation Pattern
69+
- Utility classes return booleans; test classes perform assertions
70+
- Monitoring logic belongs in Layer 2 utility classes, NOT in test classes
71+
- Use `FusionAWSUtil` for all AWS operations, never raw boto3
72+
73+
## Key Constants
74+
- `FUSION_ACCELERATOR_IOPS = 16000` - Fusion accelerator instances use 16K IOPS volumes
75+
- `VBUCKET_COUNT = 128` - Fusion vBucket count
76+
- `DEFAULT_TIMEOUT = 1800` - Default monitoring timeout (30 minutes)
77+
- `EBS_CLEANUP_TIMEOUT = 1200` - EBS volume cleanup timeout (20 minutes)
78+
79+
## Key Invariants to Validate
80+
1. **Accelerator Lifecycle**: Accelerator nodes appear during rebalance, get killed after completion
81+
2. **EBS Guest Volumes**: Created during rebalance, hydrated, cleaned up to 0 after completion
82+
3. **Cluster Health**: Returns to `healthy` state, no `deployment_failed`/`rebalance_failed`/`scaleFailed`
83+
4. **Fusion Status**: Remains `enabled` throughout operations
84+
5. **No CRITICAL Errors**: No CRITICAL in memcached logs, no core dumps, no hydration failures
85+
6. **ASG Cleanup**: Auto Scaling Groups cleaned up after rebalance
86+
87+
## Test Execution
88+
```bash
89+
python testrunner.py -i node.ini -c conf/fusion_volume.conf \
90+
-p aws_access_key=$AWS_ACCESS_KEY_ID,aws_secret_key=$AWS_SECRET_ACCESS_KEY \
91+
-p region=us-east-1 -p h_scaling=True -p iterations=3
92+
```
93+
94+
## Hard Constraints
95+
- All new test code goes in `pytests/aGoodDoctor/fusion/`
96+
- Follow the 3-layer architecture strictly
97+
- Never put monitoring logic directly in test classes
98+
- Use event-driven stop for all background threads
99+
- Use PrettyTable for structured logging (consistent with existing code)
100+
- Never hard-code AWS credentials or secrets
101+
- Proper cleanup in tearDown (stop events, thread joins, CloudTrail teardown)
102+
- Follow existing import patterns and class naming conventions
103+
104+
## Reference Documentation
105+
- `pytests/aGoodDoctor/fusion/architecture.md` - Canonical architecture reference with runtime flows, threading model, and extensibility guidelines
106+
- `pytests/aGoodDoctor/fusion/README.md` - Quick start and API summaries
107+
- `pytests/aGoodDoctor/fusion/FIS-LIB-README.md` - FIS fallback testing details
108+
- `AGENTS.md` - Root TAF coding guidelines

.factory/settings.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"enabledPlugins": {
3+
"core@factory-plugins": true
4+
}
5+
}

.gitignore

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,9 +38,6 @@ build
3838
/bin/
3939
**/settings
4040

41-
# Factory/AI tools
42-
**/.factory
43-
4441
# Jython cache
4542
.jython_cache/
4643

@@ -60,4 +57,4 @@ credentials.json
6057
# Node.js (for jscpd and other npm tools)
6158
node_modules/
6259
package-lock.json
63-
package.json
60+
package.json

connections/Rest_Connection.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -202,10 +202,13 @@ def urllib_request(self, api, method='GET', headers=None,
202202
params={}, timeout=300, verify=False):
203203
session = requests.Session()
204204
headers = headers or self.get_headers_for_content_type_json()
205-
params = json.dumps(params)
205+
# For GET requests, params should be a dict for query parameters
206+
# For other methods, JSON-encode as request body
207+
if method != "GET":
208+
params = json.dumps(params)
206209
try:
207210
if method == "GET":
208-
resp = session.get(api, params=params, headers=headers,
211+
resp = session.get(api, params=params if params else None, headers=headers,
209212
timeout=timeout, verify=verify)
210213
elif method == "POST":
211214
resp = session.post(api, data=params, headers=headers,

couchbase_utils/bucket_utils/bucket_ready_functions.py

Lines changed: 17 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,7 @@ def _loader_dict(cluster, buckets, overRidePattern=None,
208208
key_prefix = key_prefix or bucket.loadDefn.get("key_prefix", "test_docs-")
209209
key_size = key_size or bucket.loadDefn.get("key_size", 20)
210210
key_type = key_type or bucket.loadDefn.get("key_type", "SimpleKey")
211+
doc_size = workloads[i % len(workloads)].get("doc_size", 256)
211212
model = model or bucket.loadDefn.get("model", "Hotel")
212213
mockVector = mockVector or bucket.loadDefn.get("mockVector", False)
213214
base64 = base64 or bucket.loadDefn.get("base64", False)
@@ -216,16 +217,20 @@ def _loader_dict(cluster, buckets, overRidePattern=None,
216217
continue
217218
if collection == "_default" and scope == "_default" and skip_default:
218219
continue
219-
per_coll_ops = bucket.loadDefn.get("ops")//(len(bucket.scopes[scope].collections.keys()) - 1)
220+
if bucket.loadDefn.get("ops") and bucket.loadDefn.get("ops") not in [None, "None"]:
221+
per_coll_ops = bucket.loadDefn.get("ops")//(len(bucket.scopes[scope].collections.keys()) - 1)
222+
else:
223+
per_coll_ops = None
224+
JavaDocLoaderUtils.log.info(f"Loading {per_coll_ops} ops for {bucket.name+scope+collection}")
220225
loader = SiriusCouchbaseLoader(
221226
server_ip=cluster.master.ip, server_port=cluster.master.port,
222227
username="Administrator", password="password",
223228
bucket=bucket,
224229
scope_name=scope, collection_name=collection,
225-
key_prefix=key_prefix, key_size=key_size, doc_size=256,
230+
key_prefix=key_prefix, key_size=key_size, doc_size=doc_size,
226231
key_type=key_type, value_type=valType,
227-
create_percent=pattern["create"], read_percent=pattern["read"], update_percent=pattern["update"],
228-
delete_percent=pattern["delete"], expiry_percent=pattern["expiry"],
232+
create_percent=pattern.get("create", 0), read_percent=pattern.get("read", 0), update_percent=pattern.get("update", 0),
233+
delete_percent=pattern.get("delete", 0), expiry_percent=pattern.get("expiry", 0),
229234
create_start_index=bucket.create_start , create_end_index=bucket.create_end,
230235
read_start_index=bucket.read_start, read_end_index=bucket.read_end,
231236
update_start_index=bucket.update_start, update_end_index=bucket.update_end,
@@ -276,17 +281,20 @@ def perform_load(cluster, buckets, wait_for_load=True,
276281
if collection == "_default" and scope == "_default" and skip_default:
277282
continue
278283
loader = loader_map[bucket.name+scope+collection]
279-
loader.create_doc_load_task()
284+
result, json_response = loader.create_doc_load_task()
285+
if not result:
286+
JavaDocLoaderUtils.log.critical("Failed to create doc load task: %s" % json_response)
287+
return False
280288
JavaDocLoaderUtils.doc_loading_tm.add_new_task(loader)
281289
tasks.append(loader)
282-
283290
if wait_for_load:
284291
JavaDocLoaderUtils.wait_for_doc_load_completion(cluster, tasks, wait_for_stats)
285292
else:
286293
return tasks
287294

288295
if validate_data:
289296
JavaDocLoaderUtils.data_validation(cluster, skip_default=skip_default)
297+
return []
290298

291299
@staticmethod
292300
def load_sift_data(cluster=None, buckets=None, overRidePattern=None, skip_default=True,
@@ -318,6 +326,7 @@ def load_sift_data(cluster=None, buckets=None, overRidePattern=None, skip_defaul
318326
key_prefix = bucket.loadDefn.get("key_prefix")
319327
key_size = bucket.loadDefn.get("key_size")
320328
key_type = bucket.loadDefn.get("key_type")
329+
doc_size = workload.get("doc_size", 256)
321330
model = bucket.loadDefn.get("model")
322331
mockVector = bucket.loadDefn.get("mockVector")
323332
base64 = bucket.loadDefn.get("base64")
@@ -332,7 +341,7 @@ def load_sift_data(cluster=None, buckets=None, overRidePattern=None, skip_defaul
332341
username="Administrator", password="password",
333342
bucket=bucket,
334343
scope_name=scope, collection_name=collection,
335-
key_prefix=key_prefix, key_size=key_size, doc_size=256,
344+
key_prefix=key_prefix, key_size=key_size, doc_size=doc_size,
336345
key_type=key_type, value_type=valType,
337346
create_percent=pattern["create"], read_percent=pattern["read"], update_percent=pattern["update"],
338347
delete_percent=pattern["delete"], expiry_percent=pattern["expiry"],
@@ -380,7 +389,7 @@ def load_data(cluster, buckets=None, overRidePattern=None,
380389
create_end=override_num_items or bucket.loadDefn.get("num_items"),
381390
bucket=bucket)
382391

383-
JavaDocLoaderUtils.perform_load(cluster=cluster,
392+
return JavaDocLoaderUtils.perform_load(cluster=cluster,
384393
buckets=buckets,
385394
overRidePattern=overRidePattern,
386395
validate_data=validate_data,
@@ -389,21 +398,6 @@ def load_data(cluster, buckets=None, overRidePattern=None,
389398
mutate=mutate,
390399
suppress_error_table=suppress_error_table,
391400
track_failures=track_failures)
392-
if update:
393-
for bucket in buckets:
394-
JavaDocLoaderUtils.generate_docs(doc_ops=["update"],
395-
update_start=0,
396-
update_end=override_num_items or bucket.loadDefn.get("num_items"),
397-
bucket=bucket)
398-
JavaDocLoaderUtils.perform_load(cluster=cluster,
399-
buckets=buckets,
400-
overRidePattern={"create": 0, "read": 0, "update": 100, "delete": 0, "expiry": 0},
401-
validate_data=False,
402-
wait_for_load=wait_for_load,
403-
wait_for_stats=wait_for_stats,
404-
mutate=mutate,
405-
suppress_error_table=suppress_error_table,
406-
track_failures=track_failures)
407401

408402
class DocLoaderUtils(object):
409403
log = logger.get("test")

0 commit comments

Comments
 (0)