- Database Sharding is the process of splitting a large database into smaller, more manageable pieces called shards, each stored on a separate server.
- Each shard holds a subset of the data and can operate independently, improving scalability and performance.
- Handle Large Databases: Distribute huge datasets that can’t fit on a single server.
- Improve Performance: Reduces query load on any single database instance.
- Horizontal Scalability: Add new servers to handle increasing traffic or data volume.
- Fault Isolation: Issues in one shard do not impact the entire system.
-
Mechanism: Distributes rows of a table across multiple shards based on a shard key.
-
Example: Users with
user_id 1-10000go to Shard 1,10001-20000go to Shard 2. -
Pros:
- Simple and widely used.
- Scales well for large datasets.
-
Cons:
- Complex queries that span multiple shards (cross-shard joins) can be difficult.
-
Mechanism: Splits tables by columns, grouping related columns into different shards.
-
Example: User authentication data in Shard 1, user profile data in Shard 2.
-
Pros:
- Reduces I/O and memory footprint per shard.
- Useful when different columns have different access patterns.
-
Cons:
- Requires joining data across shards for some queries.
-
Mechanism: Uses a lookup table to determine which shard contains a specific piece of data.
-
Pros:
- Flexible, can implement custom rules.
-
Cons:
- Lookup table can become a bottleneck or single point of failure.
-
Mechanism: Applies a hash function to the shard key to determine the shard.
-
Example:
hash(user_id) % number_of_shards -
Pros:
- Even distribution of data.
- Simple to calculate shard location.
-
Cons:
- Adding/removing shards requires rehashing data (resharding).
-
Mechanism: Assigns data to shards based on ranges of the shard key.
-
Example: Users
A-Fin Shard 1,G-Lin Shard 2. -
Pros:
- Supports range queries efficiently.
-
Cons:
- Hotspots can occur if data is not uniformly distributed.
- Scalability: Horizontal scaling by adding more shards.
- Performance: Queries hit smaller datasets, reducing latency.
- High Availability: Shards can be replicated individually.
- Fault Isolation: Failure in one shard doesn’t impact others.
- Cross-Shard Queries: Joins or aggregations across shards are complex.
- Rebalancing: Adding/removing shards requires moving data.
- Shard Key Selection: Poor choice leads to hotspots and uneven load.
- Backup & Restore Complexity: Each shard needs to be backed up separately.
- Application Complexity: Sharding logic often moves into the application layer.
- Choose a shard key that ensures even data distribution.
- Keep shard sizes balanced to avoid hotspots.
- Combine sharding with replication for high availability.
- Use middleware or database-native sharding features (e.g., MongoDB, Cassandra).
- Avoid cross-shard operations when possible to maintain performance.
| Sharding Type | Data Split | Pros | Cons |
|---|---|---|---|
| Horizontal | Rows | Easy, scalable | Cross-shard joins are complex |
| Vertical | Columns | Reduces I/O per shard | Joins required across shards |
| Directory-Based | Custom lookup | Flexible, customizable | Lookup table can be bottleneck |
| Key-Based (Hash) | Hash function | Even distribution, simple | Resharding needed on shard change |
| Range-Based | Range of keys | Efficient for range queries | Hotspots if data not uniform |