|
| 1 | +# Graceful Shutdown |
| 2 | + |
| 3 | +Apache Fluss provides a **comprehensive graceful shutdown mechanism** to ensure data integrity and proper resource cleanup when stopping servers or services. |
| 4 | + |
| 5 | +This guide describes the shutdown procedures, configuration options, and best practices for each Fluss component. |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +Graceful shutdown in Fluss ensures that: |
| 10 | +- All ongoing operations complete safely |
| 11 | +- Resources are properly released |
| 12 | +- Data consistency is maintained |
| 13 | +- Network connections are cleanly closed |
| 14 | +- Background tasks are terminated properly |
| 15 | + |
| 16 | +These guarantees prevent data corruption and ensure smooth restarts of the system. |
| 17 | + |
| 18 | +## Server Shutdown |
| 19 | + |
| 20 | +### Coordinator Server Shutdown |
| 21 | + |
| 22 | +The **Coordinator Server** uses a multi-stage shutdown process to safely terminate all services in the correct order. |
| 23 | +#### Shutdown Process |
| 24 | +1. **Shutdown Hook Registration**: The server registers a JVM shutdown hook that triggers graceful shutdown on process termination |
| 25 | +2. **Service Termination**: All services are stopped in a specific order to maintain consistency: |
| 26 | + |
| 27 | + **Coordinator Server Shutdown Order:** |
| 28 | + 1. Server Metric Group → Metric Registry (async) |
| 29 | + 2. Auto Partition Manager → IO Executor (5s timeout) |
| 30 | + 3. Coordinator Event Processor → Coordinator Channel Manager |
| 31 | + 4. RPC Server (async) → Coordinator Service |
| 32 | + 5. Coordinator Context → Lake Table Tiering Manager |
| 33 | + 6. ZooKeeper Client → Authorizer |
| 34 | + 7. Dynamic Config Manager → Lake Catalog Dynamic Loader |
| 35 | + 8. RPC Client → Client Metric Group |
| 36 | + |
| 37 | +3. **Resource Cleanup**: Executors, connections, and other resources are properly closed |
| 38 | + |
| 39 | +```bash |
| 40 | +# Graceful shutdown via SIGTERM |
| 41 | +kill -TERM <coordinator-pid> |
| 42 | + |
| 43 | +# Or using the shutdown script (if available) |
| 44 | +./bin/stop-coordinator.sh |
| 45 | +``` |
| 46 | + |
| 47 | +### Tablet Server Shutdown |
| 48 | + |
| 49 | +The **Tablet Server** supports a **controlled shutdown process** designed to minimize data unavailability and ensure leadership handover before termination. |
| 50 | + |
| 51 | +**Shutdown Order:** |
| 52 | +1. Tablet Server Metric Group → Metric Registry (async) |
| 53 | +2. RPC Server (async) → Tablet Service |
| 54 | +3. ZooKeeper Client → RPC Client → Client Metric Group |
| 55 | +4. Scheduler → KV Manager → Remote Log Manager |
| 56 | +5. Log Manager → Replica Manager |
| 57 | +6. Authorizer → Dynamic Config Manager → Lake Catalog Dynamic Loader |
| 58 | + |
| 59 | +#### Controlled Shutdown Process |
| 60 | + |
| 61 | +1. **Leadership Transfer**: The server attempts to transfer leadership of all buckets it leads to other replicas |
| 62 | +2. **Retry Logic**: If leadership transfer fails, the server retries with configurable intervals |
| 63 | +3. **Timeout Handling**: After maximum retries, the server proceeds with unclean shutdown if necessary |
| 64 | + |
| 65 | +```bash |
| 66 | +# Initiate controlled shutdown |
| 67 | +kill -TERM <tablet-server-pid> |
| 68 | +``` |
| 69 | + |
| 70 | +#### Configuration Options |
| 71 | + |
| 72 | +- **Controlled Shutdown Retries**: Number of attempts to transfer leadership (`default:` 3 retries) |
| 73 | +- **Retry Interval**: Time between retry attempts (`default`: 1000L) |
| 74 | + |
| 75 | +## Monitoring Shutdown |
| 76 | + |
| 77 | +### Logging |
| 78 | + |
| 79 | +Fluss provides detailed logging during shutdown processes: |
| 80 | + |
| 81 | +- **INFO**: Normal shutdown progress |
| 82 | +- **WARN**: Retry attempts or timeout warnings |
| 83 | +- **ERROR**: Shutdown failures or exceptions |
| 84 | + |
| 85 | +### Metrics |
| 86 | + |
| 87 | +Monitor shutdown-related metrics: |
| 88 | + |
| 89 | +- Shutdown duration |
| 90 | +- Failed shutdown attempts |
| 91 | +- Resource cleanup status |
| 92 | + |
| 93 | +## Troubleshooting |
| 94 | + |
| 95 | +### Common Issues |
| 96 | +| Issue | Possible Causes | Recommended Actions | |
| 97 | +| -------------------- | --------------------------------------------------------------- | ------------------------------------------------------------------------------- | |
| 98 | +| **Hanging shutdown** | Blocking operations, thread pool misconfiguration, or deadlocks | Check for blocking calls without timeouts, inspect thread dumps | |
| 99 | +| **Resource leaks** | Unclosed resources or connections | Verify all `AutoCloseable` resources and file handles are closed | |
| 100 | +| **Data loss** | Unclean shutdown or failed leadership transfer | Always use controlled shutdown for Tablet Servers and verify replication factor | |
| 101 | + |
| 102 | +### Debug Steps |
| 103 | + |
| 104 | +1. Enable debug logging for shutdown components |
| 105 | +2. Monitor JVM thread dumps during shutdown |
| 106 | +3. Check system resource usage |
| 107 | +4. Verify network connection states |
| 108 | + |
| 109 | +## Configuration Reference |
| 110 | + |
| 111 | +| Configuration | Description | Default | |
| 112 | +|---------------|-------------|---------| |
| 113 | +| `controlled.shutdown.max.retries` | Maximum retries for controlled shutdown | 3 | |
| 114 | +| `controlled.shutdown.retry.interval.ms` | Interval between retry attempts | 5000 | |
| 115 | +| `shutdown.timeout.ms` | General shutdown timeout | 30000 | |
| 116 | + |
| 117 | +## See Also |
| 118 | + |
| 119 | +- [Configuration](../configuration.md) |
| 120 | +- [Monitoring and Observability](../observability/monitor-metrics.md) |
| 121 | +- [Upgrading Fluss](upgrading.md) |
0 commit comments