Flows

03. Distributed Jobs

Distributed Job Queue & Retry LLD

A background-work design: durable job records, queue publishing, worker leases, idempotent handlers, bounded retries, dead-letter replay, and operational visibility.

Distributed Jobs/Retries and background workers

Async Job Entry

The API accepts a request, validates intent, and creates a background job instead of keeping the user waiting for slow work.

Durable Job Record

The database stores job id, type, payload reference, status, attempts, schedule time, and idempotency key.

Worker Lease

Workers claim jobs with a lock or visibility timeout so only one active worker owns processing at a time.

Retry and DLQ

Transient failures retry with backoff and jitter; exhausted or poison jobs move to a dead-letter queue for controlled replay.

Flow Canvas

Execution map

Retries and background workers
LLD

Distributed Job Queue Execution Map

Create durable jobs, publish messages, claim leases, run idempotent handlers, retry transient failures, and move exhausted jobs to DLQ

entry

Receive async request

Request starts background work.

Connected
decision

Validate job type and payload

Reject invalid work.

Connected
decision

Deduplicate idempotency key

Reuse existing job.

Connected
state

Insert durable job row

Database becomes source of truth.

Connected
state

Publish queue message

Make job visible to workers.

Connected
service

Worker polls queue

Worker receives job id.

Connected
lock

Claim job lease

One active owner.

Connected
state

Load payload and context

Build handler context.

Connected
service

Execute idempotent handler

Run repeat-safe logic.

Connected
external

Call external dependency

Perform side effect.

Connected
state

Mark success

Close successful job.

Connected
decision

Classify failure

Decide retry eligibility.

Connected
recovery

Schedule retry with backoff

Retry later.

Connected
lock

Extend or release lease

Maintain ownership truth.

Connected
recovery

Move to DLQ

Stop automatic retries.

Connected
state

Emit metrics, logs, trace

Make work observable.

Connected
16 Nodes
17 Connections

Static map based on job, queue, worker, retry, and observability modules

LLD Note

Why Async Jobs Exist

Some work is too slow, unreliable, or bursty for the request path. Sending emails, generating reports, calling third-party APIs, transcoding media, syncing search indexes, and processing webhooks should not keep a user request open until every downstream dependency succeeds.

The API should accept the intent, create a durable job, and return a stable job id. From that point onward, workers can process the job independently while the caller checks status, receives a callback, or observes the final result later.

  • Use synchronous APIs for validation and durable acceptance.
  • Use background workers for slow, retryable, and dependency-heavy work.
  • Keep the job store as the source of truth, not the queue message.

LLD Note

Durable Job Record and Idempotency

The job row is created before any queue message is trusted. It stores the state machine, payload reference, idempotency key, attempt count, max attempts, schedule time, lock metadata, error summary, and result reference.

Idempotency prevents duplicate API submissions and duplicate worker execution from creating duplicate side effects. A duplicate create request should return the existing job id; a duplicate worker attempt should observe prior side-effect records or use provider idempotency keys.

  • Unique caller plus idempotency key prevents duplicate job creation.
  • Handlers must be safe to run more than once after worker crash or lease expiry.
  • Payload should be versioned so old queued jobs can still be interpreted after deploys.

LLD Note

Worker Claiming, Leases, and Visibility Timeout

Workers should not assume that receiving a queue message means they own the job. Ownership is created by an atomic claim in the job store, usually by moving an eligible job into Running with lockOwner and lockExpiresAt.

The lease protects against worker crashes. If a worker dies, the lock expires and another worker can reclaim the job. Long-running jobs must heartbeat or extend the lease before it expires.

  • Only jobs in Created, Enqueued, RetryScheduled, or expired Running states can be claimed.
  • A queue ack should happen after the durable job state is updated.
  • Concurrency limits should exist per worker, job type, tenant, and dependency when needed.

LLD Note

Retry Policy, Backoff, DLQ, and Replay

Retry policy decides whether a failure deserves another attempt. Transient timeouts, 429s, and network errors usually retry; validation errors, permission failures, and permanent provider rejections should move directly to a terminal failed or DLQ state.

Retries need exponential backoff with jitter so a dependency outage does not cause every worker to retry at the same time. Once attempts are exhausted, the job moves to DLQ with enough context for an operator to inspect and replay safely.

  • Bound retries by max attempts and maximum age.
  • Store error class and last failure message for support and alerting.
  • Replay must preserve idempotency and should not bypass validation casually.

LLD Note

Observability and Operational Controls

A job system without telemetry becomes impossible to operate under load. Queue depth, queue lag, claim failures, attempt count, processing latency, success rate, retry rate, DLQ count, and worker heartbeat health should be visible per job type and tenant.

Operational tools should support cancellation, priority changes, DLQ inspection, targeted replay, worker draining, and dependency circuit-breaker responses. These controls keep failures contained instead of letting a backlog spread through the whole platform.

  • Emit one trace across API creation, queue publish, worker execution, and dependency call.
  • Alert on queue age, DLQ growth, repeated poison jobs, and worker heartbeat gaps.
  • Use dashboards to distinguish queue backlog from dependency outage and worker capacity shortage.

Failure Modes

Edge cases handled

API creates job but queue publish fails

Trigger

The database insert succeeds, but the queue broker is unavailable or publish times out.

System response

The job remains durable as Created or PendingEnqueue, and an outbox dispatcher republishes it later without creating a duplicate job.

Worker crashes while processing

Trigger

A worker claimed the job, then died before marking success, failure, or retry.

System response

The lease expires, the queue message becomes visible or the scheduler reclaims the job, and idempotent handler logic prevents duplicated side effects.

External dependency timeout

Trigger

The job handler calls a provider that does not respond before the worker deadline.

System response

The error is classified as transient, the attempt is recorded, and the job is scheduled with exponential backoff and jitter.

Poison job keeps failing

Trigger

The same job fails repeatedly because payload, tenant config, or provider state is permanently invalid.

System response

After max attempts or max age, the job moves to DLQ with failure context instead of consuming worker capacity forever.

Duplicate request with same idempotency key

Trigger

The caller retries the create-job API after timeout or network failure.

System response

The API returns the existing job id and current state instead of inserting another job or publishing another side effect.

Queue broker unavailable

Trigger

Workers cannot receive messages or the publisher cannot enqueue ready jobs.

System response

Jobs stay in the durable store, queue health alerts fire, and a dispatcher resumes publishing when the broker recovers.

Worker pool overloaded

Trigger

Queue depth and queue lag grow because workers cannot keep up with incoming jobs.

System response

Autoscaling, producer throttling, priority queues, and per-job-type concurrency limits protect critical work from starvation.

DLQ replay risks duplicate side effects

Trigger

An operator replays a failed job whose dependency call may have partially succeeded.

System response

Replay uses the same idempotency key and handler guards, records replay metadata, and requires inspection before restoring automatic processing.

Job State Machine

Job, queue, and worker states

Job StateQueueWorkerRetry/DLQSystem Note
CreatedNot publishedNoneNoneThe job is durably recorded before the queue is trusted.
EnqueuedReadyNoneNot dueA queue message points to the durable job id and can be delivered to workers.
ClaimedInvisible or leasedActive ownerLease activeExactly one worker should own the job through lockOwner and lockExpiresAt.
RunningHidden from peersExecutingLease can extendThe handler may heartbeat while doing long work, and all side effects must remain idempotent.
SucceededAckedCompletedClosedResult and completion time are stored before the message is acknowledged.
RetryScheduledDelayedReleasedBackoffnextRunAt is set from retry policy, attempt count is incremented, and the job returns later with jitter.
DeadLetteredDLQNoneExhaustedAutomatic retries stop so ops can inspect, fix, cancel, or replay the job deliberately.
CancelledRemoved or ignoredNoneClosedCaller or ops stopped the job before completion, and workers must not continue side effects.
Stuck/ReclaimedVisible after timeoutUnknown or crashedLease expiredA new worker can safely reclaim the job only because handler execution is repeat-safe.