You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
***Step-Zero:** Always scan `./src/main/java/couchbase` and `./src/main/java/RestServer` to understand existing SDK and REST patterns before proposing new code.
26
+
***Component Selection:**
27
+
-**Single Collection Workloads**: Use standard `SDKClientPool` → `SDKClient` → `Cluster` pattern
28
+
-**Multi-Collection Workloads (100-1000 collections)**: Use `SharedClusterManager` + dynamic collection switching
29
+
-**Massive Collection Loads (1000+ collections)**: Use `CollectionLoadBatcher` + `SharedClusterManager`
30
+
-**High-Throughput Operations**: Leverage shared ClusterEnvironment with 500+ KV connections
21
31
***REST API Focus:** Modifications target Spring Boot REST endpoints (RestHandlers) and TaskRequest business logic for HTTP-based document loading.
22
32
***SDK Precision:** Default to the latest Couchbase SDK (v3.x) unless specified otherwise.
23
33
***N1QL Mastery:** Must prioritize Indexing strategies and GSI (Global Secondary Index) awareness when writing queries.
@@ -26,6 +36,37 @@ graph TD
26
36
- Always include error handling for DocumentNotFound and CasMismatch.
27
37
***Tone:** Technical, efficiency-focused, and precise.
-**Purpose**: Java-side batch processing for massive collection loads (thousands of collections)
58
+
-**Key Features**:
59
+
- Fixed batch size (default: 50) with concurrent processing
60
+
- Thread-safe batch state tracking with progress monitoring
61
+
- Prevents worker starvation and queue overhead
62
+
- Integration with REST API via `submitToBatch()` endpoint
63
+
-**Usage Pattern**:
64
+
```java
65
+
ResponseEntity<Map<String, Object>> result =
66
+
CollectionLoadBatcher.submitToBatch(requestBody);
67
+
```
68
+
-**Performance Benefits**: Sequential Python calls become batched Java operations, maximizing throughput for massive collection loads
69
+
29
70
### Work flow of loading
30
71
sequenceDiagram
31
72
participant C as Client (REST)
@@ -55,9 +96,18 @@ sequenceDiagram
55
96
56
97
### Performance Optimization Guidelines
57
98
***Multi-Collection Strategy**: Prefer bucket-level clients with dynamic collection switching over per-collection client instances. Workers should call `selectCollection()` dynamically per operation instead of creating dedicated clients per collection.
58
-
***Connection Scaling**: KV connections should scale based on: `num_workers × target_collections / connection_reuse_factor`. Default of 5 connections per SDKClient may be insufficient for high-concurrency multi-collection workloads.
99
+
***Shared Cluster Management**: Use `SharedClusterManager` for all multi-collection workloads. It provides:
100
+
- Single Cluster instance per server connection to avoid connection exhaustion
- Thread-safe reference counting and automatic resource cleanup
103
+
- Environment recreation capability for long-running workloads
104
+
***Connection Scaling**: KV connections should scale based on: `num_workers × target_collections / connection_reuse_factor`. Default of 5 connections per SDKClient may be insufficient for high-concurrency multi-collection workloads. SharedClusterManager defaults to 500 KV connections for large-scale loads.
59
105
***Thread Pool Sizing**: Set `num_workers` based on concurrent task throughput needs, not total collections. Example: 60 workers efficiently handle 5000 collections with proper batching, rather than allocating 20 workers per collection.
60
-
***Batch Processing**: For large-scale multi-collection loading, use batch processing to load collections in chunks (e.g., 60-100 collections per batch) to avoid client pool exhaustion.
106
+
***Batch Processing**: For large-scale multi-collection loading (1000+ collections), use `CollectionLoadBatcher` to:
107
+
- Process collections in batches (default: 50 per batch)
108
+
- Prevent worker starvation and reduce queue overhead
109
+
- Monitor batch progress and completion status
110
+
- Automatically start next batch after current completion
61
111
***Client Pool Optimization**: SDKClientPool should cache clients at bucket level and support dynamic scope/collection switching, not create separate client instances per (scope+collection) combination.
Suitable for: Very large collections (1000+) with controlled resource usage.
147
+
Suitable for: Very large number of collections (1000+) where Python sequential calls would cause worker starvation. Uses SharedClusterManager internally for connection optimization.
97
148
98
149
### Key Performance Metrics to Monitor
150
+
***SharedClusterManager Metrics**:
151
+
- Cluster reference count and reuse rate
152
+
- KV connection utilization vs capacity (default: 500)
153
+
- Environment shutdown/recreation events
154
+
- Per-server cluster instance count
155
+
***CollectionLoadBatcher Metrics**:
156
+
- Active batch count and batch progress percentage
157
+
- Collections loaded per batch vs batch size (default: 50)
158
+
- Batch completion rate and queue depth
159
+
- Batch processor thread pool utilization
99
160
***Connection Pool Utilization**: Monitor KV connection count vs capacity
100
161
***Client Pool Efficiency**: Track client reuse rate vs new client creation
101
162
***Thread Wait Time**: Measure worker idle time waiting for tasks vs clients
102
163
***Task Queue Depth**: Monitor pending tasks in TaskManager
103
164
***Collection Throughput**: Track collections loaded per time unit
104
165
***Document Success Rate**: Monitor failedMutations and retry patterns
166
+
167
+
### Hard Constraints Integration
168
+
***SharedClusterManager**: Must use `SharedClusterManager.getCluster(server)` and `releaseCluster(server)` for all multi-collection operations. Never create standalone Cluster instances for large-scale workloads.
169
+
***Environment Lifecycle**: Must follow proper ClusterEnvironment lifecycle - use shared environment with automatic recreation capability, never manually manage environment shutdown/reactivation.
170
+
***Batch Processing Threshold**: For workloads with >100 collections, use `CollectionLoadBatcher.submitToBatch()` instead of direct REST calls to prevent worker starvation.
171
+
***Thread Safety**: SharedClusterManager uses synchronized methods and volatile shutdown flag - ensure thread-safe access patterns when dealing with reference counting and environment state.
172
+
***Error Handling**: Always handle `AuthenticationFailureException` and cluster connection errors with proper logging and retries in both SharedClusterManager and CollectionLoadBatcher.
Copy file name to clipboardExpand all lines: AGENTS.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,6 +11,7 @@ This project uses specialized AI agents to maintain code quality and architectur
11
11
### Orchestration Logic
12
12
***If** the user asks for thread, doc_key. document generator related code -> **Handoff to:**`The Architect`.
13
13
***If** the user asks for Couchbase Sirius or REST based loader related code → **Handoff to:**`The CBRestLoader`.
14
+
***If** the user asks for batch processing, shared cluster management, or massive collection load optimization → **Handoff to:**`The CBRestLoader` with focus on `SharedClusterManager` and `CollectionLoadBatcher`.
14
15
***If** the user asks for Couchbase command line loader related code → **Handoff to:**`The CBCmdlineLoader`.
15
16
***If** the user asks for a Mongo related code → **Handoff to:**`The MongoCoder`.
0 commit comments