System Design Mock Interview: Design Distributed Job Scheduler | Systems Design Interview Questions With Ex-Google SWE
- Channel/Interviewer: Jordan has no life
- Duration: 00:30:29
- Original Video: https://www.youtube.com/watch?v=pzDwYHRzEnk
This document summarizes the key content of a system design mock interview. I highly recommend watching the full video if you can.
Teach Me: 5 Years Old | Beginner | Intermediate | Advanced | (reset auto redirect)
Learn Differently: Analogy | Storytelling | Cheatsheet | Mindmap | Flashcards | Practical Projects | Code Examples | Common Mistakes
Check Understanding: Generate Quiz | Interview Me | Refactor Challenge | Assessment Rubric | Next Steps
-
Problem Prompt (One-liner): Build a distributed job scheduler that executes uploaded binaries on a cluster, supports cron-like schedules and DAGs, exposes per-job status, and runs millions of jobs per day.
-
Primary Scope
- In-scope: Cron scheduling; DAG scheduling; job execution on executors; status tracking; retries/timeouts; load balancing via a broker; “at-least-once” execution with best-effort single concurrency per job.
- Out-of-scope (implicit): Fine-grained auth, multi-region geo-consistency, UI, cost modeling, autoscaling details, advanced security, SLAs.
-
Non-Functional Priorities: High availability of scheduling path; low latency from “due” to “dispatched”; scalable to millions of jobs/day; operational simplicity.
-
Key Constraints & Numbers (stated): “Millions of jobs a day”; polling-period examples: ~30–60s; timeout bumping in minutes; multi-level priority queues with example timeouts (10s → 1m → 1h).
-
High-Level Architecture (Text)
- Object store (e.g., S3) holds uploaded binaries.
- Metadata stores: Cron table and DAG table define schedules and dependencies.
- Scheduling table contains runnable tasks, indexed by
(status, run_at); polled periodically. - Change Data Capture (CDC) propagates new/updated Cron/DAG entries into the scheduling table.
- Message broker holds ready-to-run tasks; executors pull when idle.
- Executors fetch binaries from object store, run, update status.
- Read replicas (eventually consistent) serve user status queries.
-
Top Trade-offs
- CDC vs 2PC for getting initial/next runs into the scheduler.
- In-memory broker vs log-based broker for work-stealing and stragglers.
- Exact-once vs at-least-once semantics; idempotent jobs vs distributed locks.
- Single-node transactional updates (per-DAG sharding) vs distributed transactions.
-
Biggest Risks/Failure Modes
- Duplicate execution due to retries, broker redelivery, node failures.
- Lock contention and index churn in scheduling table (frequent updates to
run_at). - Hot partitions (long-running tasks starving other work).
- Eventual-consistency delays on status replicas.
-
5-Min Review Flashcards
- Q: What tables back cron vs DAG? A: Separate Cron and DAG metadata tables; runnable instances live in a Scheduling table.
- Q: How do DAGs advance? A: Mark child dependency epochs; when all deps match current epoch, enqueue child.
- Q: Why CDC? A: Avoid 2PC; propagate new/next runs from metadata tables to scheduler asynchronously.
- Q: What drives fairness? A: Idle executors pull from broker; multi-level priority queues handle stragglers and “big” tasks.
- Q: How to curb double-running? A: Best-effort distributed lock (e.g., ZK) + idempotent jobs + run_at timeouts.
[Personal note: Instead of ZooKeeper, consider etcd/Consul or Redis-based leases in 2025 for simpler ops and native cloud support.] - Q: How to scale scheduler reads/writes? A: Partition the scheduling table; index on
(status, run_at); optionally read via replicas with eventual consistency.
- Product Pattern:
job-scheduler,queue - System Concerns:
high-availability,eventual-consistency,multi-tenancy,backpressure,throttling - Infra/Tech (mentioned):
kafka,rabbitmq,activemq,mysql,mongodb,s3- [Personal note: If choosing a log broker, Apache Kafka/Pulsar now support cooperative rebalancing and richer consumer groups; still validate against your executor pull model.]
- Original Prompt (paraphrase): Execute uploaded binaries on a distributed set of executors. Support cron schedules and DAG workflows. Users can check job status. System should scale (millions/day) and be fast and highly available.
- Use Cases
- Schedule one-off or recurring (cron) jobs.
- Build DAGs where downstream tasks start after parents succeed (or propagate failures).
- Query job history/status/errors.
- Out of Scope: UI/console design, authZ policy models, cost metering, multi-region failover specifics.
- APIs: Not stated in video.
Given in Video
- Run uploaded binaries from object storage on executors.
- Cron and DAG scheduling.
- “At least once” delivery with mitigations (locks/idempotency) to avoid concurrent duplicates.
- Status tracking; read replicas acceptable (eventual consistency).
- Scale to millions of jobs/day.
- Poll-driven scheduler using
(status, run_at)index; bumprun_atas job progresses to implement timeouts.
Assumptions (conservative)
- Authenticated tenants; per-tenant isolation at queue/topic or partition level.
- Soft SLO: seconds-to-minutes from due time to dispatch under load.
- Reasonable binary sizes (tens to hundreds of MB) stored in S3-like store.
Ask AI: Requirements and Constraints
Not stated in video—skipping numerical estimation.
-
Upload Path
- User uploads binary to S3; registers a Cron or DAG definition.
- Metadata lands in Cron table or DAG table.
- CDC streams these changes into a Scheduling table by creating the first (and subsequent) runnable instances.
-
Scheduling Path
- A scheduler process polls the Scheduling table every N seconds, fetching tasks with
status in (null|ready|in_progress)andrun_at <= now. - On assignment, the scheduler bumps
run_atforward to set a timeout window and pushes a job record into the broker (with a priority level based on retry count).
- A scheduler process polls the Scheduling table every N seconds, fetching tasks with
-
Execution Path
- Executors pull when idle; fetch binary from S3; execute.
- Update status/history in the Scheduling table (or separate history table).
- If DAG: update child dependency epochs; when all deps for a node match, enqueue child for that epoch.
- On completion/failure, mark terminal status to stop retries.
-
Status Path
- Users query read replicas of the Scheduling table for job status/logs.
Ask AI: High-Level Architecture
- Role: Define cron expressions; seed next runnable instance.
- Data Model (from video)
cron_jobs(id, schedule_spec, parent_job_id, ...)- On creation, also insert the next occurrence into the Scheduling table.
- Executor that completes the current run is responsible for creating the next occurrence. (Implemented via transaction/CDC.)
- Indexes:
(id), consider(next_run)in cron table if needed; primary scheduling index on(status, run_at). - Failure Handling: If executor crashes before scheduling the next, the bumped
run_attimeout causes a retry; duplicates tolerated via idempotence.
Ask AI: Subsystem - Cron Scheduling
- Role: Track dependencies and epochs so children run when all parents for the same epoch have completed.
- Data Model (from video)
dag_nodes(job_id, cron_if_root?, children[], dependency_epochs{dep_job_id: epoch_seen}, current_epoch)- Root nodes may have cron schedules; completion of roots advances dependent nodes’ epochs; when all deps align, enqueue child for that epoch.
- To avoid concurrent DAG runs, treat roots as “children” of leaves to re-trigger roots only after leaves finish.
- Scaling & Partitioning: Shard so each DAG’s metadata fits within a single database node to keep dependency updates transactional.
- Store Choice: Initially MySQL (transactions), later preferring MongoDB for JSON flexibility in modeling children/deps.
[Personal note: If strong transactional semantics across multiple nodes are needed, consider sticking with a relational DB with JSON support (e.g., Postgres JSONB) to keep transactions robust while preserving flexibility.]
Ask AI: Subsystem - DAG Scheduling
- Role: Single source of “runnable instances.”
- Key Fields:
job_id (PK),s3_url,run_at (indexed),status,retry_count,priority_level. - Behavior: Poller queries
status != (completed|failed)andrun_at <= now, pushes to broker, and bumpsrun_atto a future timeout window. - Performance
- Partition by time range + hash(job_id) to spread writes/reads.
- Read via replicas to reduce lock contention on leader.
[Personal note: Replica lag can cause extra retries; acceptable if idempotent and bounded by backoff.]
Ask AI: Subsystem - Scheduling Table & Poller
- Role: Distribute tasks to idle executors; support multi-level priority queues.
- Approach (from video):
- Prefers in-memory broker (ActiveMQ/RabbitMQ) over log-based (Kafka) to avoid partition stickiness (one consumer per partition) that can starve tasks when a large job sits ahead.
[Personal note: Modern Kafka/Pulsar patterns can mitigate head-of-line blocking (more partitions, cooperative rebalancing, fair scheduling), but a pull-from-queue model with per-message acks remains simpler for executors.] - Multi-level priority queues (e.g., L1 10s, L2 1m, L3 1h). Big jobs bubble up; strongest executors read L3..L1.
- Prefers in-memory broker (ActiveMQ/RabbitMQ) over log-based (Kafka) to avoid partition stickiness (one consumer per partition) that can starve tasks when a large job sits ahead.
Ask AI: Subsystem - Broker & Load Balancing
- Role: Pull, download binary from S3, execute, update status/history.
- Failure Handling: On crash or ack loss, broker may redeliver to another executor → duplicates possible; rely on idempotence and best-effort locking.
- Locking: Attempt distributed lock per
job_idto prevent concurrent duplicates; TTL required to avoid deadlocks if executor dies.
[Personal note: Use heartbeat-based leases (renewed before expiry) to reduce accidental lock loss; monitor lease skew in GC pauses/outages.]
- Role: Track
completed/failed/in_progress, error messages, timestamps. - Serving: Users read from read replicas; eventual consistency is OK for UX, protects the leader from heavy reads.
Ask AI: Subsystem - Job Status
| Topic | Option A | Option B | Video’s Leaning | Rationale (from video) |
|---|---|---|---|---|
| Getting runs into scheduler | CDC from Cron/DAG to Scheduler | 2PC across tables | CDC | Avoids slow 2PC on uploads/updates. |
| Broker style | Log-based (Kafka) | In-memory queue (RabbitMQ/ActiveMQ) | In-memory | Executor pull avoids partition HoL blocking. |
| Data store for DAG | MySQL (relational) | MongoDB (JSON flexibility) | MongoDB (final) | Easier to store children/dep maps as JSON. |
| Preventing duplicates | Idempotent jobs + best-effort locks | Strong global transactions | Idempotence + locks | Strong consistency harms performance; duplicates acceptable in practice. |
| Status serving | Leader reads | Read replicas | Replicas | Reduce contention; tolerate eventual consistency. |
- Replication/Consistency: Eventual consistency acceptable for status reads; strong consistency not required on read path.
- Backpressure & Throttling: Priority queues plus executor pull model naturally backpressure; retries gated by
run_atbumping. - Degradation: If replicas lag, worst case is more retries (still safe if idempotent).
- Failure Cases Considered: Broker down, executor down, ack loss, scheduler node down → all handled via retries; duplicates expected.
Ask AI: Reliability and Performance
Not stated in video.
- User-Visible: Status/history with errors.
- Internal: (Implied) metrics for queue depth, executor utilization, retry counts, and schedule latency; no specific stack named.
Not stated in video.
- What SLOs do we target for “due → started” latency under peak load?
- Are binaries trusted or sandboxed? Any language runtimes or containers required?
- Do we need per-tenant quotas/priorities?
- Should we provide per-job secrets management and environment injection?
- Separate metadata (Cron/DAG) from runnable instances (Scheduling table).
- Use CDC to flow metadata changes into runnable tasks.
- Index scheduler by
(status, run_at)and bump timeouts through the job lifecycle. - Prefer executor pull + priority queues for fair sharing and straggler handling.
- Aim for idempotent jobs; use best-effort distributed locks to avoid concurrent duplicates.
[Personal note: For new deployments, evaluate Redis Streams or cloud-managed queues (SQS, Pub/Sub) for reduced ops vs self-hosting AMQP brokers—verify fit for your workload.] - Shard DAG metadata so each DAG’s updates are transactional on one node.
- CDC (Change Data Capture): Streaming DB changes to downstream consumers.
- Epoch (in DAG): Versioned “run” of a node; children advance when all parent epochs match.
- HoL Blocking: Head-of-line blocking—slow item prevents progress behind it.
- Idempotent Job: Re-running produces the same result without side effects.
- Practice modeling DAG dependency tracking with epochs.
- Implement a toy scheduler:
(status, run_at)index + poller + in-memory queue + worker. - Add duplicate-suppression via idempotent writes and a lease-based lock.
- Source Video: https://www.youtube.com/watch?v=pzDwYHRzEnk
- Channel: Not stated in video
- Note: This document is a summary of the linked mock interview.
I'm Ali Sol, a PHP Developer. Learn more:
- Website: alisol.ir
- LinkedIn: linkedin.com/in/alisolphp