- B-Tree vs Hash Indexes:
B-Treeis used for range queries (>,<,BETWEEN) and exact matches.Hashindexes are faster for exact matches (=) but cannot handle range queries.
- ACID Properties: Atomicity, Consistency, Isolation, Durability. Write-Ahead Logging (WAL) is the core mechanism ensuring Durability.
- Explain Analyze: The most powerful tool. Shows the query execution plan (Index Scan, Seq Scan, Hash Join). If you see
Sequential Scanon a large table, you need an index.
- Composite Indexes: Indexing multiple columns (
CREATE INDEX idx_user_status ON users(status, created_at)). Order matters! Follow the Left-Most Prefix Rule. - Covering Indexes: An index that includes all the columns needed for the query, meaning the DB doesn't have to look up the actual table rows.
- Index Selectivity: High selectivity (unique values like
email) is great. Low selectivity (boolean values likeis_active) causes the DB to skip the index entirely sometimes.
- Avoid
SELECT *. Retrieve only what you need to reduce memory and network overhead. - N+1 Problem: Querying the DB in a loop (e.g., getting users, then querying their posts one by one). Fix using
JOINor batching tools likeDataLoader. - Pagination: As discussed, switch from Offset to Cursor pagination to avoid large offset performance degradation.
- Read Uncommitted: Dirty reads allowed. Fastest but dangerous.
- Read Committed: Default in Postgres. Solves dirty reads.
- Repeatable Read: Guarantees rows read multiple times in a transaction won't change. Can still have Phantom Reads (new rows added).
- Serializable: Highest isolation. Transactions happen as if they were strictly sequential. Prone to aborts/deadlocks under high concurrency.
- Opening a DB connection is expensive (TCP handshake + DB auth). Use a pooler!
- Tools: PgBouncer (Postgres), HikariCP (Java).
- Sizing: Formula
connections = ((core_count * 2) + effective_spindle_count). Don't over-provision connections; 50 is often better than 500 for a busy DB.
- Single Table Design (DynamoDB): Pre-compute queries. Store multiple entity types in one table using overloaded Partition and Sort Keys (
PK,SK). - No joins allowed. Optimize for reads, pay on writes.
- Wide Column (Cassandra): Great for time-series data. Requires deep understanding of Partition Keys to avoid hot partitions.
- Splitting a large DB into smaller ones (Shards).
- Hash-based Sharding: Distribute data evenly based on
hash(user_id). Downside: Adding/removing shards requires massive data rehashing (unless using Consistent Hashing). - Range-based Sharding: Easy to query ranges, but prone to Hotspots (e.g., putting all new 2024 data on one shard).
- Master-Slave (Leader-Follower): Writes to Master, async reads from Slaves. Highly scalable for read-heavy workloads.
- Replication Lag: The delay between a write on Master and it being visible on the Slave. Must build logic to avoid "read-after-write" issues (e.g., user updates profile and sees old data).
- Master-Master: High availability, but conflict resolution logic is brutal.
- Changing schema on 100 Million rows takes hours and locks tables.
- Zero-Downtime Migration:
- Add new column.
- Dual writes (write to old and new).
- Backfill old data via batch jobs.
- Switch read code to new column.
- Drop old column.