03. Distributed Jobs

Distributed Job Queue & Retry LLD

A background-work design: durable job records, queue publishing, worker leases, idempotent handlers, bounded retries, dead-letter replay, and operational visibility.

Distributed Jobs/Retries and background workers

Async Job Entry

The API accepts a request, validates intent, and creates a background job instead of keeping the user waiting for slow work.

Durable Job Record

The database stores job id, type, payload reference, status, attempts, schedule time, and idempotency key.

Worker Lease

Workers claim jobs with a lock or visibility timeout so only one active worker owns processing at a time.

Retry and DLQ

Transient failures retry with backoff and jitter; exhausted or poison jobs move to a dead-letter queue for controlled replay.

Flow Canvas

Execution map

Retries and background workers

LLD

Distributed Job Queue Execution Map

Create durable jobs, publish messages, claim leases, run idempotent handlers, retry transient failures, and move exhausted jobs to DLQ

entry

Receive async request

Request starts background work.

Connected

decision

Validate job type and payload

Reject invalid work.

Connected

decision

Deduplicate idempotency key

Reuse existing job.

Connected

state

Insert durable job row

Database becomes source of truth.

Connected

state

Publish queue message

Make job visible to workers.

Connected

service

Worker polls queue

Worker receives job id.

Connected

lock

Claim job lease

One active owner.

Connected

state

Load payload and context

Build handler context.

Connected

service

Execute idempotent handler

Run repeat-safe logic.

Connected

external

Call external dependency

Perform side effect.

Connected

state

Mark success

Close successful job.

Connected

decision

Classify failure

Decide retry eligibility.

Connected

recovery

Schedule retry with backoff

Retry later.

Connected

lock

Extend or release lease

Maintain ownership truth.

Connected

recovery

Move to DLQ

Stop automatic retries.

Connected

state

Emit metrics, logs, trace

Make work observable.

Connected

16 Nodes

17 Connections

Static map based on job, queue, worker, retry, and observability modules

LLD Note

Why Async Jobs Exist

Some work is too slow, unreliable, or bursty for the request path. Sending emails, generating reports, calling third-party APIs, transcoding media, syncing search indexes, and processing webhooks should not keep a user request open until every downstream dependency succeeds.

The API should accept the intent, create a durable job, and return a stable job id. From that point onward, workers can process the job independently while the caller checks status, receives a callback, or observes the final result later.

Use synchronous APIs for validation and durable acceptance.
Use background workers for slow, retryable, and dependency-heavy work.
Keep the job store as the source of truth, not the queue message.

LLD Note

Durable Job Record and Idempotency

The job row is created before any queue message is trusted. It stores the state machine, payload reference, idempotency key, attempt count, max attempts, schedule time, lock metadata, error summary, and result reference.

Idempotency prevents duplicate API submissions and duplicate worker execution from creating duplicate side effects. A duplicate create request should return the existing job id; a duplicate worker attempt should observe prior side-effect records or use provider idempotency keys.

Unique caller plus idempotency key prevents duplicate job creation.
Handlers must be safe to run more than once after worker crash or lease expiry.
Payload should be versioned so old queued jobs can still be interpreted after deploys.

LLD Note

Worker Claiming, Leases, and Visibility Timeout

Workers should not assume that receiving a queue message means they own the job. Ownership is created by an atomic claim in the job store, usually by moving an eligible job into Running with lockOwner and lockExpiresAt.

The lease protects against worker crashes. If a worker dies, the lock expires and another worker can reclaim the job. Long-running jobs must heartbeat or extend the lease before it expires.

Only jobs in Created, Enqueued, RetryScheduled, or expired Running states can be claimed.
A queue ack should happen after the durable job state is updated.
Concurrency limits should exist per worker, job type, tenant, and dependency when needed.

LLD Note

Retry Policy, Backoff, DLQ, and Replay

Retry policy decides whether a failure deserves another attempt. Transient timeouts, 429s, and network errors usually retry; validation errors, permission failures, and permanent provider rejections should move directly to a terminal failed or DLQ state.

Retries need exponential backoff with jitter so a dependency outage does not cause every worker to retry at the same time. Once attempts are exhausted, the job moves to DLQ with enough context for an operator to inspect and replay safely.

Bound retries by max attempts and maximum age.
Store error class and last failure message for support and alerting.
Replay must preserve idempotency and should not bypass validation casually.

LLD Note

Observability and Operational Controls

A job system without telemetry becomes impossible to operate under load. Queue depth, queue lag, claim failures, attempt count, processing latency, success rate, retry rate, DLQ count, and worker heartbeat health should be visible per job type and tenant.

Operational tools should support cancellation, priority changes, DLQ inspection, targeted replay, worker draining, and dependency circuit-breaker responses. These controls keep failures contained instead of letting a backlog spread through the whole platform.

Emit one trace across API creation, queue publish, worker execution, and dependency call.
Alert on queue age, DLQ growth, repeated poison jobs, and worker heartbeat gaps.
Use dashboards to distinguish queue backlog from dependency outage and worker capacity shortage.

Source References

Create job use case

src/application/jobs/CreateJob.ts

Request validation, idempotency key lookup, durable job insertion, and accepted response creation.

Queue publisher

src/infrastructure/queue/JobPublisher.ts

Queue message publishing, pending-enqueue recovery, outbox dispatch, and broker failure handling.

Worker processor

src/infrastructure/workers/JobWorker.ts

Queue polling, lease claiming, handler execution, heartbeat, ack, retry, and crash recovery.

Retry policy

src/domain/jobs/services/RetryPolicy.ts

Error classification, exponential backoff, jitter, max attempts, max job age, and retry eligibility.

Dead-letter queue

src/infrastructure/workers/DeadLetterQueue.ts

Terminal failure storage, DLQ inspection, replay controls, and poison-job isolation.

Job telemetry

src/infrastructure/observability/job-telemetry.ts

Structured logs, traces, queue lag, processing latency, attempt metrics, worker health, and replay audit events.

Design Rule

A job should be durably recorded before it is published, claimed by exactly one active worker lease, retried only through idempotent handlers, and moved to DLQ after retry policy is exhausted.

Failure Modes

Edge cases handled

API creates job but queue publish fails

Trigger

The database insert succeeds, but the queue broker is unavailable or publish times out.

System response

The job remains durable as Created or PendingEnqueue, and an outbox dispatcher republishes it later without creating a duplicate job.

Worker crashes while processing

Trigger

A worker claimed the job, then died before marking success, failure, or retry.

System response

The lease expires, the queue message becomes visible or the scheduler reclaims the job, and idempotent handler logic prevents duplicated side effects.

External dependency timeout

Trigger

The job handler calls a provider that does not respond before the worker deadline.

System response

The error is classified as transient, the attempt is recorded, and the job is scheduled with exponential backoff and jitter.

Poison job keeps failing

Trigger

The same job fails repeatedly because payload, tenant config, or provider state is permanently invalid.

System response

After max attempts or max age, the job moves to DLQ with failure context instead of consuming worker capacity forever.

Duplicate request with same idempotency key

Trigger

The caller retries the create-job API after timeout or network failure.

System response

The API returns the existing job id and current state instead of inserting another job or publishing another side effect.

Queue broker unavailable

Trigger

Workers cannot receive messages or the publisher cannot enqueue ready jobs.

System response

Jobs stay in the durable store, queue health alerts fire, and a dispatcher resumes publishing when the broker recovers.

Worker pool overloaded

Trigger

Queue depth and queue lag grow because workers cannot keep up with incoming jobs.

System response

Autoscaling, producer throttling, priority queues, and per-job-type concurrency limits protect critical work from starvation.

DLQ replay risks duplicate side effects

Trigger

An operator replays a failed job whose dependency call may have partially succeeded.

System response

Replay uses the same idempotency key and handler guards, records replay metadata, and requires inspection before restoring automatic processing.

Job State Machine

Job, queue, and worker states

Job State	Queue	Worker	Retry/DLQ	System Note
Created	Not published	None	None	The job is durably recorded before the queue is trusted.
Enqueued	Ready	None	Not due	A queue message points to the durable job id and can be delivered to workers.
Claimed	Invisible or leased	Active owner	Lease active	Exactly one worker should own the job through lockOwner and lockExpiresAt.
Running	Hidden from peers	Executing	Lease can extend	The handler may heartbeat while doing long work, and all side effects must remain idempotent.
Succeeded	Acked	Completed	Closed	Result and completion time are stored before the message is acknowledged.
RetryScheduled	Delayed	Released	Backoff	nextRunAt is set from retry policy, attempt count is incremented, and the job returns later with jitter.
DeadLettered	DLQ	None	Exhausted	Automatic retries stop so ops can inspect, fix, cancel, or replay the job deliberately.
Cancelled	Removed or ignored	None	Closed	Caller or ops stopped the job before completion, and workers must not continue side effects.
Stuck/Reclaimed	Visible after timeout	Unknown or crashed	Lease expired	A new worker can safely reclaim the job only because handler execution is repeat-safe.