Channel/Interviewer: Tech Dummies - Narendra Lakshmana Gowda
Duration: 00:36:51
Original Video: https://www.youtube.com/watch?v=umWABit-wbk
This document summarizes the key content of a system design mock interview. I highly recommend watching the full video if you can.
Teach Me: 5 Years Old | Beginner | Intermediate | Advanced | (reset auto redirect)
Learn Differently: Analogy | Storytelling | Cheatsheet | Mindmap | Flashcards | Practical Projects | Code Examples | Common Mistakes
Check Understanding: Generate Quiz | Interview Me | Refactor Challenge | Assessment Rubric | Next Steps
Problem Prompt (One-liner):
Design a ridesharing platform (e.g., Uber/Ola/Lyft/Grab) that matches riders to nearby drivers in real time, at global scale.
Primary Scope:
Real-time dispatch (supply–demand matching), location ingestion, ETA calculation, geo-indexing, messaging between clients and backend, data storage/analytics, logging, and disaster recovery. Out-of-scope fine details: pricing, surge strategy, payments internals, full auth design.
Non-Functional Priorities:
Always-on availability, low-latency mobile experience, horizontal scalability across regions, resiliency to data center failures.
High-Level Architecture (Text):
- Mobile apps (rider/driver) send frequent GPS updates.
- Edge: Web Application Firewall → Load Balancer (L3/L4/L7).
- Ingestion via REST → buffer/stream to Kafka.
- Real-time state fan-out to: NoSQL storage, dispatch optimization service, and other consumers.
- Dispatch ring (Node.js services) with consistent hashing + gossip for responsibility; RPC between nodes.
- Mapping/ETA service (uses map provider + S2-like cell library).
- WebSockets for bidirectional, event-driven updates to riders/drivers.
- Analytics: batch on HDFS/Hadoop; real-time via Spark/Storm; centralized logging via ELK.
- Backup data center that reconstructs trip state from driver apps’ state digests on primary DC failure.
Note: For real-time stream processing, Apache Flink or Kafka Streams are common 2025 choices due to unified batch/stream semantics and operational simplicity.
Top Trade-offs:
- Freshest driver locations vs. network/battery cost (GPS push cadence).
- Global consistency vs. availability/latency (favor eventual where safe).
- In-house ETA modeling vs. map-provider ETA APIs.
- Fine-grained sharding (S2 cells) vs. cross-shard coordination overhead.
- Simpler monolith vs. scalable microservices (operational complexity).
Biggest Risks/Failure Modes:
- Hotspots during peak demand (stadiums/airports).
- Stale driver positions leading to bad matches/ETAs.
- Message loss or consumer lag in streams.
- Data center outage mid-trip.
- Fraud: payment misuse, incentive abuse, account compromise.
5-Min Review Flashcards:
- Why WebSockets here? To push server-initiated updates (offers, accept/decline, status) with low latency.
- What’s the unit of geo-partitioning? S2-like cells (unique IDs per cell) for indexing and routing.
- How do dispatch nodes scale? Consistent hashing (ring) + gossip to learn responsibilities; RPC among nodes.
- Core ingestion path? App → WAF/LB → REST → Kafka; consumers update NoSQL + dispatch state.
- Who computes ETA? Mapping/ETA service using road network and traffic; not simple Euclidean distance.
- DC failure playbook? Fail over to backup DC and rebuild live trip state from driver apps’ “state digest.”
- Logging stack? Elasticsearch + Logstash + Kibana (ELK).
Note: Consider OpenTelemetry + vendor-neutral backends to reduce lock-in in 2025.
Domain/Industry: ridesharing, delivery, maps, analytics
Product Pattern: realtime-chat, pub-sub, notification, queue, job-scheduler
System Concerns: high-availability, low-latency, eventual-consistency, geo-replication, hot-key, backpressure, throttling, autoscaling
Infra/Tech (mentioned): microservices, rest, websocket, kafka, mysql, cassandra, mongodb, hdfs, elasticsearch, spark, storm
Note: Apache Storm is largely legacy; prefer Flink/Kafka Streams unless your org already runs Storm.
Original Prompt: Build a ride-hailing platform whose dispatch system functions as a real-time marketplace to match riders to drivers, show nearby cars, and estimate arrival times.
Use Cases:
- Rider requests a trip (vehicle class, solo/pool).
- Show nearby cars and ETAs.
- Driver location updates every few seconds.
- Offer/accept/timeout flow; assign the first accepting driver.
- Handle high-demand regions; add/remove cities and servers seamlessly.
Out of Scope (explicit/implicit): full pricing/surge details, payments internals, full auth/PII design.
APIs: Not stated in video.
Given in Video — Functional:
- Ingest high-rate GPS pings from drivers.
- Maintain latest driver state for dispatch.
- Compute ETAs using road network (turn costs, traffic).
- Run a bidirectional message channel to apps (offers, status).
- Scale to many regions/cities; easy server add/remove (gossip).
- Provide analytics and fraud detection.
Given in Video — Non-Functional:
- Highly available (no planned downtime visible to users).
- Low latency for offers/accept and map updates.
- Horizontal scalability across data centers/regions.
- Robust disaster recovery.
Assumptions (conservative):
- Rider request latency budget in low seconds end-to-end.
- Driver GPS update cadence ~4s (as stated).
- Eventual consistency acceptable for map/nearby lists; stronger consistency for trip state.
Ask AI: Requirements and Constraints
Not stated in video—skipping numerical estimation.
- Clients: Rider and driver mobile apps send/receive via WebSockets and call REST endpoints for location posting.
- Edge: Web Application Firewall (blocks bots/regions) → L3/L4/L7 Load Balancers route traffic.
- Ingestion: REST → buffer to Kafka topics (location stream). Copies to NoSQL, dispatch state, and other consumers.
- Dispatch Ring: Node.js services; consistent hashing assigns cell ranges; gossip protocol shares responsibilities; RPC calls between nodes when a rider’s search radius spans multiple owners.
Note: Uber historically used Ringpop/TChannel; in 2025 many teams standardize on gRPC over HTTP/2 for service-to-service RPC. - Geo/ETA: Mapping service uses S2-like cell coverage to find candidate drivers, then road-network ETA (turn costs, traffic), not straight-line distance.
- Data: “Schemaless” NoSQL built on MySQL concepts, plus alternatives cited (Cassandra/MongoDB/Bigtable). High read/write availability; online operations (index, backup) without downtime.
Note: For new large-scale OLTP in 2025, consider cloud-native, horizontally scalable SQL (e.g., Spanner/CockroachDB) when strong consistency across regions is needed. - Analytics: Batch (HDFS/Hadoop; periodic DB dumps), plus real-time (Spark/Storm) for trends and model inputs.
- Observability: Centralized logs → Elasticsearch/Logstash/Kibana dashboards.
- DR: Backup DC with empty data; on failover, driver apps upload “state digest” to reconstruct live trips.
Ask AI: High-Level Architecture
Role: Accept GPS updates (~every 4s), maintain latest driver state, push to Kafka and dispatch.
Scaling: Horizontally behind LB; partitions by driverId or cellId.
Consistency: Eventual for map views; stronger for assigned-trip state.
Failure: Retry with idempotent writes; drop stale pings.
Role: Receives ride requests (vehicle class, pooling), forwards to dispatch with rider’s cellId.
Contracts: Send search radius; receive candidate drivers sorted by ETA.
Failure: Timeouts; re-query with expanded radius.
Role: For a given rider cell, compute nearby drivers by drawing a circle, enumerating S2 cells, fetching candidates, and ranking by road ETA. Sends offers to top N via WebSockets; first accept wins.
Scaling/Partitioning: Consistent hashing over cell IDs; gossip to learn ownership; RPC to contact peer owners for overlapping search areas.
Role: Persistent, bidirectional messaging with riders/drivers for offers, accepts, cancellations, trip updates.
Why WS: Event-driven, low-overhead, server-initiated pushes.
Ask AI: Subsystem - WebSockets
Role: S2-like cell modeling; coverage queries for a given radius; ETA via road graph (turn costs, traffic).
Note: Sometimes a finishing-soon trip near the rider beats an idle but farther driver.
Ask AI: Subsystem - Mapping & ETA
Batch: Periodic dumps to HDFS; query with Hadoop “big query tools.”
Streaming: Spark/Storm for real-time trends and detection.
Use Cases: ETA improvement, map enhancement, traffic modeling.
Note: Modern lakehouse stacks (e.g., Apache Iceberg/Delta Lake) can simplify historical + streaming unification in 2025.
Stack: ELK—forward logs to a cluster; dashboards show errors and health.
Threats: Payment fraud (stolen cards), incentive abuse (fake GPS/bookings to hit quotas), compromised accounts (phishing).
Signals: Historical trip patterns, altitude/speed traces, behavioral anomalies. Actions: warn/ban on repeated abuse.
| Topic | Option A | Option B | Leaning | Rationale |
|---|---|---|---|---|
| Messaging to apps | WebSockets | Long-polling/HTTP/2 SSE | WebSockets | Bi-directional, lower latency for pushes. |
| Geo index | S2 cells | Raw lat/lon grids | S2 cells | Unique IDs, good coverage queries, routing/sharding. |
| ETA | Provider APIs | In-house model | Provider APIs (lean) | Road graph + traffic complexity; provider maintained. |
| Storage | MySQL-based “schemaless” | Cassandra/Mongo/Bigtable | Mixed | Must stay highly available & horizontally scalable. |
| Stream proc. | Spark/Storm | Flink/Kafka Streams | Depends | Existing mention; org choice varies. Note: Favor Flink/Kafka Streams in 2025 for mature stateful streaming. |
- Replication/Consistency: Emphasis on availability; eventual consistency acceptable for nearby lists; stricter for trip state.
- Backpressure: Kafka absorbs bursty GPS updates; downstream consumers process at pace.
- Degradation: Expand search radius or relax matching in sparse supply.
- DR: Backup DC boots with no data; driver apps upload “state digest” so trips continue.
Ask AI: Reliability and Performance
- WAF blocks known-bad IPs/bots and unsupported regions.
- Account compromise detection via behavioral anomalies.
- Payment fraud monitoring.
Note: Target TLS 1.3 and strong mobile certificate pinning in 2025.
- Centralized log ingestion; dashboards for error rates and system health (Kibana).
- How do we tune GPS upload cadence for latency vs. battery?
- What’s the acceptable ETA error and success rate targets?
- Which failure modes are most common in production incidents?
- What’s the plan for stadium/airport special handling (preferred access points)?
- Treat dispatch as a real-time marketplace optimized for low latency.
- Geo-partition by S2-like cells; use consistent hashing to split compute.
- WebSockets for push-heavy workflows between servers and mobile apps.
- Use road-network ETAs; straight-line distance misleads.
- Design for DC failure: rebuild live state from clients when needed.
- Centralize logs and invest in analytics to improve ETA and catch fraud.
- S2 Cells: Hierarchical geospatial index dividing the globe into addressable cells.
- Gossip Protocol: Peer-to-peer membership/health dissemination so nodes learn each other’s responsibilities.
- Consistent Hashing: Partitioning scheme that minimizes remapping on node add/remove.
- WebSockets: Full-duplex TCP-based protocol for real-time app messaging.
- ELK: Elasticsearch + Logstash + Kibana logging/visualization stack.
- State Digest: Compact snapshot on driver app used to reconstruct live trip state during DC failover.
- Revisit consistent hashing & gossip basics.
- Implement a toy S2-cell radius search and candidate ranking.
- Build a small WebSocket demo (offer → accept → assign).
- Run a Kafka pipeline ingesting mock GPS and fanning out to two consumers.
- Source Video:
https://www.youtube.com/watch?v=umWABit-wbk - Channel: Tech Dummies - Narendra Lakshmana Gowda
- Note: This document is a summary of the linked mock interview.
I'm Ali Sol, a PHP Developer. Learn more:
- Website: https://alisol.ir
- LinkedIn: https://www.linkedin.com/in/alisolphp