127 lines (83 loc) · 4.2 KB

Database Sharding

Definition

Database Sharding is the process of splitting a large database into smaller, more manageable pieces called shards, each stored on a separate server.
Each shard holds a subset of the data and can operate independently, improving scalability and performance.

Why Sharding is Needed

Handle Large Databases: Distribute huge datasets that can’t fit on a single server.
Improve Performance: Reduces query load on any single database instance.
Horizontal Scalability: Add new servers to handle increasing traffic or data volume.
Fault Isolation: Issues in one shard do not impact the entire system.

Types of Sharding

1. Horizontal Sharding (Data Partitioning by Rows)

Mechanism: Distributes rows of a table across multiple shards based on a shard key.
Example: Users with user_id 1-10000 go to Shard 1, 10001-20000 go to Shard 2.
Pros:
- Simple and widely used.
- Scales well for large datasets.
Cons:
- Complex queries that span multiple shards (cross-shard joins) can be difficult.

2. Vertical Sharding (Data Partitioning by Columns)

Mechanism: Splits tables by columns, grouping related columns into different shards.
Example: User authentication data in Shard 1, user profile data in Shard 2.
Pros:
- Reduces I/O and memory footprint per shard.
- Useful when different columns have different access patterns.
Cons:
- Requires joining data across shards for some queries.

3. Directory-Based Sharding

Mechanism: Uses a lookup table to determine which shard contains a specific piece of data.
Pros:
- Flexible, can implement custom rules.
Cons:
- Lookup table can become a bottleneck or single point of failure.

4. Key-Based (Hash) Sharding

Mechanism: Applies a hash function to the shard key to determine the shard.
Example: hash(user_id) % number_of_shards
Pros:
- Even distribution of data.
- Simple to calculate shard location.
Cons:
- Adding/removing shards requires rehashing data (resharding).

5. Range-Based Sharding

Mechanism: Assigns data to shards based on ranges of the shard key.
Example: Users A-F in Shard 1, G-L in Shard 2.
Pros:
- Supports range queries efficiently.
Cons:
- Hotspots can occur if data is not uniformly distributed.

Benefits of Sharding

Scalability: Horizontal scaling by adding more shards.
Performance: Queries hit smaller datasets, reducing latency.
High Availability: Shards can be replicated individually.
Fault Isolation: Failure in one shard doesn’t impact others.

Challenges in Sharding

Cross-Shard Queries: Joins or aggregations across shards are complex.
Rebalancing: Adding/removing shards requires moving data.
Shard Key Selection: Poor choice leads to hotspots and uneven load.
Backup & Restore Complexity: Each shard needs to be backed up separately.
Application Complexity: Sharding logic often moves into the application layer.

Best Practices

Choose a shard key that ensures even data distribution.
Keep shard sizes balanced to avoid hotspots.
Combine sharding with replication for high availability.
Use middleware or database-native sharding features (e.g., MongoDB, Cassandra).
Avoid cross-shard operations when possible to maintain performance.

Summary Table

Sharding Type	Data Split	Pros	Cons
Horizontal	Rows	Easy, scalable	Cross-shard joins are complex
Vertical	Columns	Reduces I/O per shard	Joins required across shards
Directory-Based	Custom lookup	Flexible, customizable	Lookup table can be bottleneck
Key-Based (Hash)	Hash function	Even distribution, simple	Resharding needed on shard change
Range-Based	Range of keys	Efficient for range queries	Hotspots if data not uniform