|
| 1 | +# Connection Pool |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The load-balancing algorithm is designed to optimize the allocation and |
| 6 | +management of database connections in a way that maximizes Quality of Service |
| 7 | +(QoS). This involves minimizing the overall time spent on connecting and |
| 8 | +reconnecting (connection efficiency) while ensuring that latencies remain |
| 9 | +similar across different streams of connections (fairness). |
| 10 | + |
| 11 | +## Architecture |
| 12 | + |
| 13 | +This library is split into four major components: |
| 14 | + |
| 15 | + 1. The low-level blocks/block, connections, and metrics code. This code |
| 16 | + creates, destroys and transfers connections without understanding of |
| 17 | + policies, quotas or any sort of algorithm. We ensure that the blocks and |
| 18 | + metrics are reliable, and use this as a building block for our pool. |
| 19 | + 2. The algorithm. This performs planning operations for acquisition, release |
| 20 | + and rebalancing of the pool. The algorithm does not perform operations, but |
| 21 | + rather informs that caller what it should do. |
| 22 | + 3. The pool itself. This drives the blocks and the connector interface, and |
| 23 | + polls the algorithm to plan next steps during acquisition, release and |
| 24 | + during the timer-based planning callback. |
| 25 | + 4. The Python integration code. This is behind an optional feature, and exposes |
| 26 | + PyO3-based interface that allows a connection factory to be implemented in |
| 27 | + Python. |
| 28 | + |
| 29 | +## Details |
| 30 | + |
| 31 | +Demand for connections is measured in terms of “database time,” which is |
| 32 | +calculated as the product of the number of connections and the average hold time |
| 33 | +of these connections. This metric provides a basis for determining how resources |
| 34 | +should be distributed among different database blocks to meet their needs |
| 35 | +effectively. |
| 36 | + |
| 37 | +To maximize QoS, the algorithm aims to minimize the time spent on managing |
| 38 | +connections and keep the latencies low and uniform across various connection |
| 39 | +streams. This involves allocation strategies that balance the immediate needs of |
| 40 | +different database blocks with the overall system capacity and future demand |
| 41 | +predictions. |
| 42 | + |
| 43 | +When a connection is acquired, the system may be in a state where the pool is |
| 44 | +not currently constrained by demand. In such cases, connections can be allocated |
| 45 | +greedily without complex balancing, as there are sufficient resources to meet |
| 46 | +all demands. This allows for quick connection handling without additional |
| 47 | +overhead. |
| 48 | + |
| 49 | +When the pool is constrained, the “stealing” algorithm aims to transfer |
| 50 | +connections from less utilized or idle database blocks (victims) to those |
| 51 | +experiencing high demand (hunger) to ensure efficient resource use and maintain |
| 52 | +QoS. A victim block is chosen based on its idle state, characterized by holding |
| 53 | +connections but having low or no immediate demand for them. |
| 54 | + |
| 55 | +Upon releasing a connection, the algorithm evaluates which backend (database |
| 56 | +block) needs the connection the most (the hungriest). This decision is based on |
| 57 | +current demand, wait times, and historical usage patterns. By reallocating |
| 58 | +connections to the blocks that need them most, the algorithm ensures that |
| 59 | +resources are utilized efficiently and effectively. |
| 60 | + |
| 61 | +Unused connection capacity is eventually reclaimed to prevent wastage. The |
| 62 | +algorithm includes mechanisms to identify and collect these idle connections, |
| 63 | +redistributing them to blocks with higher demand or returning them to the pool |
| 64 | +for future use. This helps maintain an optimal number of active connections, |
| 65 | +reducing unnecessary resource consumption. |
| 66 | + |
| 67 | +To avoid excessive thrashing, the algorithm ensures that connections are held |
| 68 | +for a minimum period, which is longer than the time it takes to reconnect to a |
| 69 | +database or a configured minimum threshold. This reduces the frequency of |
| 70 | +reallocation, preventing performance degradation due to constant connection |
| 71 | +churn and ensuring that blocks can maintain stable and predictable access to |
| 72 | +resource |
| 73 | + |
| 74 | +## Detailed Algorithm |
| 75 | + |
| 76 | +The algorithm is designed to 1) maximize time spent running queries in a |
| 77 | +database and 2) minimize latency of queries waiting for their turn to run. These |
| 78 | +goals may be in conflict at times. We do this by optimizing the time spent |
| 79 | +switching between databases, which is considered "dead time" -- as the database |
| 80 | +is not actively performing operations. |
| 81 | + |
| 82 | +The demand for a connection is based on estimated total sequential processing |
| 83 | +time. We use the average time that a connection is held, times the number of |
| 84 | +connections in demand as a rough idea of how much total sequential time a |
| 85 | +certain block demands in the future. |
| 86 | + |
| 87 | +At a regular interval, we compute two items for each block: a quota, and a |
| 88 | +"hunger" metric. The hunger metric may indicate that a block is "hungry" |
| 89 | +(wanting more connections), satisfied (having the expected number of |
| 90 | +connections) or overfull (holding more connections than it should). The "hungry" |
| 91 | +score is determined by the estimated total sequential time needed for a block. |
| 92 | +The "overfull" score is determined by the number of extra connections held by |
| 93 | +this block, in combination with how old the longest-held connection is. Quota is |
| 94 | +determined by the connection rate. |
| 95 | + |
| 96 | +We then use the hunger metric and quota in an attempt to rebalance the pool |
| 97 | +proactively to ensure that the connection capacity of each block reflects its |
| 98 | +most recent demand profile. Blocks are sorted into a list of hungry and overfull |
| 99 | +blocks, and we attempt to transfer from the most hungry to the most overfull |
| 100 | +until we run out of either list. We may not be able to perform the rebalance |
| 101 | +fully because of block activity that cannot be interrupted. |
| 102 | + |
| 103 | +If a connection is requested for a block that is hungry, it is allowed to steal |
| 104 | +a connection from the block that most overfull and has idle connections. As the |
| 105 | +"overfull" score is calculated in part by the longest-held connection's age, we |
| 106 | +minimize context switching. |
| 107 | + |
| 108 | +When a connection is released, we choose what happens based on its state. If |
| 109 | +more connections are waiting on this block, we return the connection to the |
| 110 | +block to be re-used immediately. If no connections are waiting but the block is |
| 111 | +hungry, we return it. If the block is satisfied or overfull and we have hungry |
| 112 | +blocks waiting, we transfer it to a hungry block that has waiters. |
| 113 | + |
| 114 | +## Error Handling |
| 115 | + |
| 116 | +The pool will attempt to provide a connection where possible, but connection |
| 117 | +operations may not always be reliable. The error for a connection failure will |
| 118 | +be routed through the acquire operation if the pool detects there are no other |
| 119 | +potential sources for a connection for the acquire. Sources for a connection may |
| 120 | +be a currently-connecting connection, a reconnecting connection, a connection |
| 121 | +that is actively held by someone else or a connection that is sitting idle. |
| 122 | + |
| 123 | +The pool does not currently retry, and retry logic should be included in the |
| 124 | +connect operation. |
0 commit comments