Skip to content

Latest commit

 

History

History
127 lines (83 loc) · 4.2 KB

File metadata and controls

127 lines (83 loc) · 4.2 KB

Database Sharding

Definition

  • Database Sharding is the process of splitting a large database into smaller, more manageable pieces called shards, each stored on a separate server.
  • Each shard holds a subset of the data and can operate independently, improving scalability and performance.

Why Sharding is Needed

  • Handle Large Databases: Distribute huge datasets that can’t fit on a single server.
  • Improve Performance: Reduces query load on any single database instance.
  • Horizontal Scalability: Add new servers to handle increasing traffic or data volume.
  • Fault Isolation: Issues in one shard do not impact the entire system.

Types of Sharding

1. Horizontal Sharding (Data Partitioning by Rows)

  • Mechanism: Distributes rows of a table across multiple shards based on a shard key.

  • Example: Users with user_id 1-10000 go to Shard 1, 10001-20000 go to Shard 2.

  • Pros:

    • Simple and widely used.
    • Scales well for large datasets.
  • Cons:

    • Complex queries that span multiple shards (cross-shard joins) can be difficult.

2. Vertical Sharding (Data Partitioning by Columns)

  • Mechanism: Splits tables by columns, grouping related columns into different shards.

  • Example: User authentication data in Shard 1, user profile data in Shard 2.

  • Pros:

    • Reduces I/O and memory footprint per shard.
    • Useful when different columns have different access patterns.
  • Cons:

    • Requires joining data across shards for some queries.

3. Directory-Based Sharding

  • Mechanism: Uses a lookup table to determine which shard contains a specific piece of data.

  • Pros:

    • Flexible, can implement custom rules.
  • Cons:

    • Lookup table can become a bottleneck or single point of failure.

4. Key-Based (Hash) Sharding

  • Mechanism: Applies a hash function to the shard key to determine the shard.

  • Example: hash(user_id) % number_of_shards

  • Pros:

    • Even distribution of data.
    • Simple to calculate shard location.
  • Cons:

    • Adding/removing shards requires rehashing data (resharding).

5. Range-Based Sharding

  • Mechanism: Assigns data to shards based on ranges of the shard key.

  • Example: Users A-F in Shard 1, G-L in Shard 2.

  • Pros:

    • Supports range queries efficiently.
  • Cons:

    • Hotspots can occur if data is not uniformly distributed.

Benefits of Sharding

  • Scalability: Horizontal scaling by adding more shards.
  • Performance: Queries hit smaller datasets, reducing latency.
  • High Availability: Shards can be replicated individually.
  • Fault Isolation: Failure in one shard doesn’t impact others.

Challenges in Sharding

  • Cross-Shard Queries: Joins or aggregations across shards are complex.
  • Rebalancing: Adding/removing shards requires moving data.
  • Shard Key Selection: Poor choice leads to hotspots and uneven load.
  • Backup & Restore Complexity: Each shard needs to be backed up separately.
  • Application Complexity: Sharding logic often moves into the application layer.

Best Practices

  • Choose a shard key that ensures even data distribution.
  • Keep shard sizes balanced to avoid hotspots.
  • Combine sharding with replication for high availability.
  • Use middleware or database-native sharding features (e.g., MongoDB, Cassandra).
  • Avoid cross-shard operations when possible to maintain performance.

Summary Table

Sharding Type Data Split Pros Cons
Horizontal Rows Easy, scalable Cross-shard joins are complex
Vertical Columns Reduces I/O per shard Joins required across shards
Directory-Based Custom lookup Flexible, customizable Lookup table can be bottleneck
Key-Based (Hash) Hash function Even distribution, simple Resharding needed on shard change
Range-Based Range of keys Efficient for range queries Hotspots if data not uniform