Skip to content

Commit b09a4ee

Browse files
hemanthsavaserehsavaserepolyzos
authored
1794 : Add graceful shutdown documentation (#1796)
* docs(maintenance): add graceful shutdown procedures documentation Add comprehensive documentation for graceful shutdown procedures in Fluss, covering server shutdown processes, component-specific shutdown sequences, best practices, and troubleshooting guidelines. The document provides implementation details for both Coordinator and Tablet servers, along with configuration references and monitoring recommendations. * Changes done to remove unecessary docs * add some improvements --------- Co-authored-by: Hemanth <[email protected]> Co-authored-by: ipolyzos <[email protected]>
1 parent e51fcda commit b09a4ee

File tree

1 file changed

+121
-0
lines changed

1 file changed

+121
-0
lines changed
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Graceful Shutdown
2+
3+
Apache Fluss provides a **comprehensive graceful shutdown mechanism** to ensure data integrity and proper resource cleanup when stopping servers or services.
4+
5+
This guide describes the shutdown procedures, configuration options, and best practices for each Fluss component.
6+
7+
## Overview
8+
9+
Graceful shutdown in Fluss ensures that:
10+
- All ongoing operations complete safely
11+
- Resources are properly released
12+
- Data consistency is maintained
13+
- Network connections are cleanly closed
14+
- Background tasks are terminated properly
15+
16+
These guarantees prevent data corruption and ensure smooth restarts of the system.
17+
18+
## Server Shutdown
19+
20+
### Coordinator Server Shutdown
21+
22+
The **Coordinator Server** uses a multi-stage shutdown process to safely terminate all services in the correct order.
23+
#### Shutdown Process
24+
1. **Shutdown Hook Registration**: The server registers a JVM shutdown hook that triggers graceful shutdown on process termination
25+
2. **Service Termination**: All services are stopped in a specific order to maintain consistency:
26+
27+
**Coordinator Server Shutdown Order:**
28+
1. Server Metric Group → Metric Registry (async)
29+
2. Auto Partition Manager → IO Executor (5s timeout)
30+
3. Coordinator Event Processor → Coordinator Channel Manager
31+
4. RPC Server (async) → Coordinator Service
32+
5. Coordinator Context → Lake Table Tiering Manager
33+
6. ZooKeeper Client → Authorizer
34+
7. Dynamic Config Manager → Lake Catalog Dynamic Loader
35+
8. RPC Client → Client Metric Group
36+
37+
3. **Resource Cleanup**: Executors, connections, and other resources are properly closed
38+
39+
```bash
40+
# Graceful shutdown via SIGTERM
41+
kill -TERM <coordinator-pid>
42+
43+
# Or using the shutdown script (if available)
44+
./bin/stop-coordinator.sh
45+
```
46+
47+
### Tablet Server Shutdown
48+
49+
The **Tablet Server** supports a **controlled shutdown process** designed to minimize data unavailability and ensure leadership handover before termination.
50+
51+
**Shutdown Order:**
52+
1. Tablet Server Metric Group → Metric Registry (async)
53+
2. RPC Server (async) → Tablet Service
54+
3. ZooKeeper Client → RPC Client → Client Metric Group
55+
4. Scheduler → KV Manager → Remote Log Manager
56+
5. Log Manager → Replica Manager
57+
6. Authorizer → Dynamic Config Manager → Lake Catalog Dynamic Loader
58+
59+
#### Controlled Shutdown Process
60+
61+
1. **Leadership Transfer**: The server attempts to transfer leadership of all buckets it leads to other replicas
62+
2. **Retry Logic**: If leadership transfer fails, the server retries with configurable intervals
63+
3. **Timeout Handling**: After maximum retries, the server proceeds with unclean shutdown if necessary
64+
65+
```bash
66+
# Initiate controlled shutdown
67+
kill -TERM <tablet-server-pid>
68+
```
69+
70+
#### Configuration Options
71+
72+
- **Controlled Shutdown Retries**: Number of attempts to transfer leadership (`default:` 3 retries)
73+
- **Retry Interval**: Time between retry attempts (`default`: 1000L)
74+
75+
## Monitoring Shutdown
76+
77+
### Logging
78+
79+
Fluss provides detailed logging during shutdown processes:
80+
81+
- **INFO**: Normal shutdown progress
82+
- **WARN**: Retry attempts or timeout warnings
83+
- **ERROR**: Shutdown failures or exceptions
84+
85+
### Metrics
86+
87+
Monitor shutdown-related metrics:
88+
89+
- Shutdown duration
90+
- Failed shutdown attempts
91+
- Resource cleanup status
92+
93+
## Troubleshooting
94+
95+
### Common Issues
96+
| Issue | Possible Causes | Recommended Actions |
97+
| -------------------- | --------------------------------------------------------------- | ------------------------------------------------------------------------------- |
98+
| **Hanging shutdown** | Blocking operations, thread pool misconfiguration, or deadlocks | Check for blocking calls without timeouts, inspect thread dumps |
99+
| **Resource leaks** | Unclosed resources or connections | Verify all `AutoCloseable` resources and file handles are closed |
100+
| **Data loss** | Unclean shutdown or failed leadership transfer | Always use controlled shutdown for Tablet Servers and verify replication factor |
101+
102+
### Debug Steps
103+
104+
1. Enable debug logging for shutdown components
105+
2. Monitor JVM thread dumps during shutdown
106+
3. Check system resource usage
107+
4. Verify network connection states
108+
109+
## Configuration Reference
110+
111+
| Configuration | Description | Default |
112+
|---------------|-------------|---------|
113+
| `controlled.shutdown.max.retries` | Maximum retries for controlled shutdown | 3 |
114+
| `controlled.shutdown.retry.interval.ms` | Interval between retry attempts | 5000 |
115+
| `shutdown.timeout.ms` | General shutdown timeout | 30000 |
116+
117+
## See Also
118+
119+
- [Configuration](../configuration.md)
120+
- [Monitoring and Observability](../observability/monitor-metrics.md)
121+
- [Upgrading Fluss](upgrading.md)

0 commit comments

Comments
 (0)