LLD Note
Why Async Jobs Exist
Some work is too slow, unreliable, or bursty for the request path. Sending emails, generating reports, calling third-party APIs, transcoding media, syncing search indexes, and processing webhooks should not keep a user request open until every downstream dependency succeeds.
The API should accept the intent, create a durable job, and return a stable job id. From that point onward, workers can process the job independently while the caller checks status, receives a callback, or observes the final result later.
- Use synchronous APIs for validation and durable acceptance.
- Use background workers for slow, retryable, and dependency-heavy work.
- Keep the job store as the source of truth, not the queue message.
LLD Note
Durable Job Record and Idempotency
The job row is created before any queue message is trusted. It stores the state machine, payload reference, idempotency key, attempt count, max attempts, schedule time, lock metadata, error summary, and result reference.
Idempotency prevents duplicate API submissions and duplicate worker execution from creating duplicate side effects. A duplicate create request should return the existing job id; a duplicate worker attempt should observe prior side-effect records or use provider idempotency keys.
- Unique caller plus idempotency key prevents duplicate job creation.
- Handlers must be safe to run more than once after worker crash or lease expiry.
- Payload should be versioned so old queued jobs can still be interpreted after deploys.
LLD Note
Worker Claiming, Leases, and Visibility Timeout
Workers should not assume that receiving a queue message means they own the job. Ownership is created by an atomic claim in the job store, usually by moving an eligible job into Running with lockOwner and lockExpiresAt.
The lease protects against worker crashes. If a worker dies, the lock expires and another worker can reclaim the job. Long-running jobs must heartbeat or extend the lease before it expires.
- Only jobs in Created, Enqueued, RetryScheduled, or expired Running states can be claimed.
- A queue ack should happen after the durable job state is updated.
- Concurrency limits should exist per worker, job type, tenant, and dependency when needed.
LLD Note
Retry Policy, Backoff, DLQ, and Replay
Retry policy decides whether a failure deserves another attempt. Transient timeouts, 429s, and network errors usually retry; validation errors, permission failures, and permanent provider rejections should move directly to a terminal failed or DLQ state.
Retries need exponential backoff with jitter so a dependency outage does not cause every worker to retry at the same time. Once attempts are exhausted, the job moves to DLQ with enough context for an operator to inspect and replay safely.
- Bound retries by max attempts and maximum age.
- Store error class and last failure message for support and alerting.
- Replay must preserve idempotency and should not bypass validation casually.
LLD Note
Observability and Operational Controls
A job system without telemetry becomes impossible to operate under load. Queue depth, queue lag, claim failures, attempt count, processing latency, success rate, retry rate, DLQ count, and worker heartbeat health should be visible per job type and tenant.
Operational tools should support cancellation, priority changes, DLQ inspection, targeted replay, worker draining, and dependency circuit-breaker responses. These controls keep failures contained instead of letting a backlog spread through the whole platform.
- Emit one trace across API creation, queue publish, worker execution, and dependency call.
- Alert on queue age, DLQ growth, repeated poison jobs, and worker heartbeat gaps.
- Use dashboards to distinguish queue backlog from dependency outage and worker capacity shortage.