Skip to content

Latest commit

 

History

History
319 lines (213 loc) · 19 KB

File metadata and controls

319 lines (213 loc) · 19 KB

System Design Mock Interview: Design Netflix System

This document summarizes the key content of a system design mock interview. I highly recommend watching the full video if you can.


AI-Powered buttons

Teach Me: 5 Years Old | Beginner | Intermediate | Advanced | (reset auto redirect)

Learn Differently: Analogy | Storytelling | Cheatsheet | Mindmap | Flashcards | Practical Projects | Code Examples | Common Mistakes

Check Understanding: Generate Quiz | Interview Me | Refactor Challenge | Assessment Rubric | Next Steps

One-Page Executive Summary (2–3 min skim)

Problem Prompt (One-liner): Design Netflix’s end-to-end system to stream video at scale with smooth UX worldwide.

Primary Scope

  • In-scope: Client apps, API/gateway, microservices, load balancing, video ingestion/transcoding, CDN (Open Connect), data stores (MySQL & Cassandra), caching (EVCache/memcached), stream/log pipeline (Chukwa → Kafka → Samza → S3/Elasticsearch), recommendations (Spark/ML), observability usage.
  • Out-of-scope: Detailed security model, billing internals beyond mention, exact API schemas.

Non-Functional Priorities: Low latency playback, high availability, global scalability, cost efficiency.

High-Level Architecture (Text)

  • Clients (web, mobile, TV/console) talk to AWS frontends via ELB (two-tier: DNS RR across zones, then instance LB).
  • API Gateway (Zuul) handles routing, auth, shaping, filters; Hystrix for isolation/circuit-breaking. [Personal note: Hystrix is archived; prefer Resilience4j in 2025 for JVM circuit breaking due to active maintenance and modular design.]
  • Microservices (REST/gRPC) for login, homepage, search, recommendations, billing, etc.
  • Data: MySQL (ACID for users/billing) with master–master + read replicas; Cassandra for viewing history/big data. [Personal note: Consider managed multi-AZ/region databases (e.g., Aurora, Vitess, or modern MySQL HA) to simplify failover and reduce operational risk.]
  • Caching: EVCache (memcached-based) clusters; SSD-backed optimization; write to all clusters, read from nearest. [Personal note: Evaluate Redis for richer data structures and built-in replication; memcached remains fine for simple key-value with low overhead.]
  • Video pipeline: Studio ingest → chunking → distributed transcoding → variants/resolutions → push to Netflix Open Connect CDN. Client selects/auto-switches via ABR. [Personal note: For codecs, modern deployments lean on AV1/HEVC; verify codec support for your device mix.]
  • Data pipeline: Chukwa logs → Kafka topics → Samza routers → sinks (S3, Elasticsearch) → dashboards (Kibana/Grafana). [Personal note: Chukwa/Samza are older choices; many stacks now standardize on Kafka + Flink/Kafka Streams; verify for your org.]
  • Open Connect: Netflix-owned CDN boxes at ISPs, consistent hashing across cluster nodes, proactive caching via popularity/location.

Top Trade-offs

  • Global cache hit ratio vs. storage cost at edges.
  • ACID guarantees (MySQL) vs. scale/throughput (Cassandra).
  • Client simplicity vs. smart client for server selection/ABR switching.
  • Operational complexity (many services) vs. agility and failure isolation.

Biggest Risks/Failure Modes

  • Hot shards/hot keys (popular titles), cache stampedes.
  • Cascading failures between microservices; timeouts not tuned.
  • Regional ISP/CDN outages, ELB misconfiguration across zones.

5-Min Review Flashcards

  • Q: What handles routing and filters at the edge? A: Zuul gateway. [Personal note: Consider Envoy/API Gateway alternatives for active ecosystems and HTTP/3.]
  • Q: What prevents cascading failures? A: Hystrix with timeouts, bulkheads, fallbacks.
  • Q: Where are user/billing vs. viewing history stored? A: MySQL vs. Cassandra.
  • Q: How is playback delivered? A: From nearest Open Connect server with ABR switching.
  • Q: What accelerates API throughput/latency? A: EVCache (memcached) + nearest-cluster reads.
  • Q: What ingests logs/events? A: Chukwa → Kafka → Samza → S3/Elasticsearch.

Ask AI: Executive Summary


Interview Tags

  • Domain/Industry: vod, streaming, analytics
  • Product Pattern: video-on-demand, recommendation, cdn, caching, search-index, queue, job-scheduler, pub-sub, notification
  • System Concerns: high-availability, low-latency, geo-replication, eventual-consistency, hot-key, backpressure, autoscaling
  • Infra/Tech: microservices, grpc, rest, kafka, mysql, cassandra, memcached, s3, elasticsearch, spark, cdn

[Personal note: Verify whether to use OpenSearch instead of Elasticsearch depending on licensing and managed options.]

Ask AI: Interview Tags


Problem Understanding

Original Prompt: Build Netflix to stream movies/TV worldwide with seamless UX; non-streaming control plane in AWS, video delivery via Open Connect.

Use Cases

  • Discover content (homepage rows, search), play video with adaptive bitrate, resume/continue watching.
  • Personalization: curated rows, artwork thumbnails per user.
  • Support operations: logging, customer support debugging.

Out of Scope: Detailed auth flows, DRM specifics, ads; only high-level mentions of billing.

APIs: Not stated in video.

Ask AI: Problem Understanding


Requirements & Constraints

Given in Video

  • Low startup time, smooth playback; serve from nearest edge; continuous best-server selection.
  • Two “clouds”: AWS control-plane; Open Connect for streaming.
  • ELB two-tier balancing; Zuul + Hystrix; EVCache; MySQL (ACID) + Cassandra (scale); Spark ML for recommendations; Chukwa/Kafka/Samza → ES/S3.

Assumptions (kept conservative)

  • Multi-region AWS deployment with Route 53; autoscaling for spikes; CDN boxes at ISPs worldwide. (Assumption)

Ask AI: Requirements and Constraints


Back-of-the-Envelope Estimation

Not stated in video—skipping numerical estimation.

Ask AI: Estimation


High-Level Architecture

  • Clients: Web (React), iOS/Android/TV; smart clients continuously probe for best Open Connect server and switch dynamically.
  • Edge/Gateway: ELB (DNS RR across zones → per-instance LB); Zuul filters (inbound/endpoint/outbound) for auth/routing/metrics/headers. [Personal note: Consider HTTP/3/QUIC at the edge for improved startup on flaky networks.]
  • Services: Microservices via HTTP/gRPC; Hystrix isolates dependencies with timeouts, bulkheads, fallbacks.
  • Storage: MySQL master–master with replicas; Cassandra for high-write/large history; EVCache for read acceleration (SSD-backed).
  • Ingest/Transcode: Split source movie into chunks; enqueue tasks; EC2 workers transcode into many formats/resolutions/bitrates; write to S3; push to Open Connect.
  • Open Connect: Proactive caching by global vs. regional popularity; consistent hashing distributes objects across cluster nodes.
  • Data/Logs: Chukwa collects events (UI activity, playback, errors) → Kafka → Samza routes to S3/Elasticsearch; dashboards for support/ops.
  • ML/Personalization: Spark trains models for row selection, sorting, artwork selection.

Ask AI: High-Level Architecture


Deep Dives by Subsystem

8.1 Client Applications

Role: Cross-platform apps (TV, mobile, web); React on web for startup speed/runtime/modularity. Scaling: Smart selection of Open Connect server; ABR adapts bitrate based on bandwidth. Failure Handling: Switches providers mid-stream if a better server is detected.

Ask AI: Subsystem - Client

8.2 API Gateway (Zuul)

Role: Dynamic routing, auth, shaping, security; filters before/after proxying; metrics/header manipulation. Mitigations: Traffic splitting for load testing or canaries; filtering bad requests by user-agent, etc. [Personal note: Modern gateways (Envoy, Kong, Spring Cloud Gateway) may offer richer HTTP/2/3 and service-mesh integration.]

Ask AI: Subsystem - API Gateway

8.3 Resilience (Hystrix)

Role: Latency/fault tolerance, isolate remote calls; timeouts; thread-pool bulkheads; error-rate based tripping; fallbacks; metrics dashboards. [Personal note: Prefer Resilience4j or service-mesh rate limiting/circuit breaking for active support and polyglot stacks.]

Ask AI: Subsystem - Resilience

8.4 Caching (EVCache/memcached)

Role: Write to all clusters; reads from nearest; boosts throughput, cuts latency/cost. SSDs used to extend cache capacity. Threats: Hot keys on trending titles; mitigate with sharding/replication and request coalescing. (Summarized from caching discussion.)

Ask AI: Subsystem - Caching

8.5 Datastores

MySQL: ACID for users/billing; master–master replication; read replicas per DC; failover by DNS (Route 53). [Personal note: DNS-based failover can be slow; consider managed HA with fast failover and read/write endpoints.] Cassandra: Distributed store for viewing history; periodic compression of older history to reduce footprint while keeping fresh data hot.

Ask AI: Subsystem - Datastores

8.6 Transcoding & Delivery

Flow: Ingest large source → split into chunks → queue tasks → EC2 workers transcode into many resolutions/bitrates (~1200 copies mentioned) → store on S3 → push to Open Connect nodes. Clients use ABR. [Personal note: Validate modern codec ladder (AV1/HEVC) and per-device profiles to optimize quality-per-bit.]

Ask AI: Subsystem - Transcoding

8.7 Data/Log Pipeline & Observability

Ingest: Chukwa collects logs/events → Kafka; Samza filters/routes to S3/Elasticsearch. Support dashboards use ES to debug user playback issues, sign-up/login errors, resource usage. [Personal note: Many stacks consolidate on Kafka + Flink/Streams; pick based on team expertise and tooling.]

Ask AI: Subsystem - Data Pipeline

8.8 Recommendations & Personalization

Models: Collaborative filtering and content-based filtering; Spark clusters for training; output drives row selection/sorting and artwork selection experiments. Artwork Personalization: Multiple thumbnails per title; show variants, track CTR/watch-time; pick winners for segments.

Ask AI: Subsystem - Recommendations

8.9 Open Connect (CDN)

Model: Netflix-owned appliances hosted by ISPs (free to partners); reduces transit bandwidth and improves QoE; HA via redundant components. Caching Strategy: Globally popular content cached everywhere; location-specific content staged based on historical viewing; cluster uses consistent hashing to place and locate objects.

Ask AI: Subsystem - Open Connect


Trade-offs & Alternatives

Topic Option A Option B Video’s Leaning Rationale (from video)
Gateway Zuul Envoy/others Zuul Mature filters/routing at Netflix at the time.
Fault Tolerance Hystrix Resilience4j/mesh Hystrix Rich timeout/bulkhead/fallback model.
OLTP MySQL Postgres/Managed MySQL ACID & existing tooling.
Big Data Store Cassandra HBase/Dynamo-like Cassandra Scale for history/log-like workloads.
Stream Proc Samza Flink/Streams Samza Routing/filtering of Kafka topics.

[Personal note: Likely outdated; consider Envoy/Resilience4j/Flink — verify for your stack.]

Ask AI: Trade-offs


Reliability, Availability, and Performance

  • Replication/Consistency: MySQL master–master with replicas for reads; Cassandra distributed for scale; EVCache replicates writes to all clusters and serves reads locally.
  • Latency Budget: Reduce client startup via nearest edge + caching; ABR avoids stalls by adapting to bandwidth.
  • Load Shedding/Degradation: Hystrix fallbacks; route splitting at gateway; cache-first reads where safe.

Ask AI: Reliability and Performance


Security & Privacy

Not stated in video (beyond generic mention of gateway security).

Ask AI: Security and Privacy


Observability

  • Metrics/Logs/Tracing: Hystrix metrics; Elasticsearch + dashboards used by support/ops to trace user issues and system errors (e.g., signups, logins).

Ask AI: Observability


Follow-up Questions (from interviewer)

Not stated in video.

Ask AI: Follow-up Questions


Candidate Questions (if modeled)

Not stated in video.

Ask AI: Candidate Questions


Key Takeaways

  • Separate control plane (AWS) from data plane (Open Connect) for scale.
  • Use smart clients + ABR to mask network variability.
  • Gateway + circuit breakers prevent cascading failures.
  • Cache aggressively with locality-aware reads to cut cost/latency.
  • Blend ACID (MySQL) with scale-out (Cassandra) appropriately.
  • Proactive CDN caching and consistent hashing improve hit ratios. [Personal note: Revisit gateway/circuit-breaker choices and the stream-processing stack to align with 2025 best practices.]

Ask AI: Key Takeaways


Glossary

  • Open Connect: Netflix-operated CDN appliances at ISPs serving video.
  • ABR (Adaptive Bitrate): Streaming technique that adjusts video quality to current bandwidth.
  • Zuul: API gateway for routing/filters and traffic shaping.
  • Hystrix: Library for latency/fault tolerance (timeouts, bulkheads, fallbacks).
  • EVCache: Netflix’s memcached-based distributed cache.
  • Chukwa/Kafka/Samza: Log/stream ingest, pub-sub, and routing stack.

Ask AI: Glossary


Study Plan from This Interview (Optional)

  • Review gateway patterns and circuit-breaking.
  • Prototype ABR streaming and CDN caching strategies.
  • Practice trade-off discussions: SQL vs. NoSQL; cache policies; edge vs. origin delivery. (Assumption)

Ask AI: Study Plan


Attribution


About the summarizer

I'm Ali Sol, a PHP Developer. Learn more: