Using Leases to Manage Multi-Instance Environments

Published on June 20, 2025

Senior Software Engineer

Using Leases to Manage Multi-Instance Environments

Imagine a deployed application consisting of a single process hosting a web service and background job processor within a single service component. The background job processor accesses shared resources, such as tasks stored in a database table acting as a work queue. However, when multiple replicas of the application run simultaneously, each replica may attempt to process the same task, causing race conditions or duplicated work. These problems are compounded when the application is implemented as a monolith and cannot split the worker code from the web service or reduce the number of replicas, making resource coordination even more challenging.

While different solutions exist, leases offer a simple and scalable solution to resource management challenges. In this article, we’ll cover how to use leases to share resources in multi-instance environments to ensure high availability and scalability in modern applications.

Exploring solutions

Centralized Locks: Using a central service to coordinate access to resources. While effective, this can become a bottleneck or single point of failure.
Distributed Transactions: Ensure atomic updates across multiple instances. This can be complex and challenging to scale.
Leases: Grant temporary, time-bound access to resources. Leases offer a simple and scalable solution to many resource management challenges.

Leases provide temporary, time-bound access to resources, automatically revoking access after a fixed duration unless explicitly renewed. This approach prevents race conditions and ensures that resources are released even in the case of failure.

Key features of leases

Time-Bound Access: Grants exclusive access for a specified duration.
Renewal Mechanism: Allows lease-holders to extend access before expiration.
Automatic Expiration: Ensures resources are released without manual intervention.
Failure Resilience: Resources become available for others when a lease-holder crashes or becomes unresponsive.

Tools support leases

Redis: Use time-to-live (TTL) settings to manage resource leases. For example, a key representing the resource is set with a TTL, and automatic expiration ensures resource availability when the lease expires.
Consul: Provides a distributed key-value store that can be used to implement lease-based coordination. Developers can leverage session-based locking mechanisms to manage shared resources and ensure consistent state across instances.
Zookeeper: Provides primitives for lease management in distributed systems, such as ephemeral nodes that automatically expire when the client session ends. This allows developers to implement resource coordination and leader election mechanisms tailored to their distributed workloads.

Advantages of leases

Simplifies resource management by automating resource release.
Prevents deadlocks and split-brain scenarios through time-bounded constraints.
Enhances fault tolerance by ensuring resources are reclaimed upon failure.

Sample Application: Task processor with managed PostgreSQL

The sample application demonstrates an end-to-end implementation of a centralized lease service, a task management service, and task processors. Also included is a tool that generates fake work (random sleep intervals) for the processors to consume.

Below we explore the various components of the sample application that demonstrate how leases can be used in a Task Processor service deployed on DigitalOcean’s App Platform.

Database

We are using PostgreSQL because it is a battle-tested implementation of a fully ACID-compliant transactional database with strong guarantees. Instead of implementing ACID ourselves, we delegate the complexity of the locking mechanism to the database.

The application uses Prisma to:

Manage database schema via migrations.
Generate models for the entities.
Interacting with the Database

The Prisma abstractions do not support all of the Postgres-specific functionality we’re using and in those cases we’re using queryRaw.

Lease service

This service is crucial for managing the lifecycle of worker processes, ensuring that only one instance of a worker can hold a lease on a task (resource) at any given time, thus preventing race conditions and ensuring safe state transitions.

The implementation found in the repo is designed to be generalized and has no knowledge of any of the task processing code.This service is essential for managing worker processes’ lifecycles. It ensures that only one worker instance can hold a lease on a task (resource) at a time, preventing race conditions and ensuring safe state transitions. The API routes enable the creation, renewal, release, and retrieval of leases.

Leases client

The leases client is a crucial component of the lease management service, responsible for interacting with the lease API to acquire, renew, and release leases. It provides a structured way to manage leases, ensuring that resources are properly leased and renewed as needed. An added bonus feature of the leases client is it can optionally automatically renew the lease for you.

Sample app Database schema

Below is the schema for the leases table:

CREATE TABLE IF NOT EXISTS public.leases
(
    id integer NOT NULL DEFAULT nextval('leases_id_seq'::regclass),
    resource text COLLATE pg_catalog."default" NOT NULL,
    holder text COLLATE pg_catalog."default",
    created_at timestamp(3) without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
    renewed_at timestamp(3) without time zone,
    released_at timestamp(3) without time zone,
    expires_at timestamp(3) without time zone,
    CONSTRAINT leases_pkey PRIMARY KEY (id)
)

Sample app API routes

/api/leases

GET: Fetches the status of the worker and the list of all leases. POST: Creates a new lease or updates an existing one if it has expired.

Request body:

{
    "resource": "resource_name",
    "holder": "holder_name"
}

How lease creation works

The POST method in the lease service is designed to create a new lease or update an existing one if it has expired. Here’s how it works and why it’s implemented this way:

INSERT INTO leases (resource, holder, expires_at)
                VALUES (${resource}, ${holder}, NOW() + INTERVAL '30 seconds')
            ON CONFLICT (resource) 
            DO UPDATE 
                SET 
                    holder = ${holder},
                    created_at = NOW(),
                    renewed_at = null,
                    released_at = null,
                    expires_at = NOW() + INTERVAL '30 seconds'
            WHERE leases.expires_at <= NOW()
            RETURNING *;

Insert New Lease: The INSERT INTO leases statement attempts to insert a new lease with the specified resource, holder, and an expiration time set to 30 seconds from the current time (NOW() + INTERVAL '30 seconds').
Conflict Handling: The ON CONFLICT (resource) clause specifies that if a lease with the same resource already exists, the conflict should be resolved by updating the existing lease.
Update Existing Lease: The DO UPDATE clause updates the existing lease with the new holder, resets the created_at timestamp to the current time, and clears the renewed_at and released_at fields. The expires_at field is also updated to 30 seconds from the current time.
Conditional Update: The WHERE leases.expires_at <= NOW() condition ensures that the update only occurs if the existing lease has already expired. This prevents overwriting an active lease.
Return Updated Lease: The RETURNING * clause returns the newly inserted or updated lease record.

Why lease creation?

Atomic Operations: Using a single SQL statement with ON CONFLICT ensures that the creation or update of a lease is atomic. This means that the operation is completed in a single step, reducing the risk of race conditions.
Conflict Resolution: The ON CONFLICT clause allows the system to handle cases where multiple requests might try to create a lease for the same resource simultaneously. By updating the existing lease if it has expired, we ensure that only one active lease exists for a resource at any given time.
Lease Expiration Management: The WHERE leases.expires_at <= NOW() condition ensures that only expired leases are updated. This prevents active leases from being prematurely overwritten, maintaining the integrity of the lease system.
Efficiency: Combining the insert and update operations into a single SQL statement reduces the number of database queries, improving the efficiency of the lease management process.

This approach ensures that the lease service can reliably create and update leases while preventing race conditions and maintaining the integrity of the lease system.

/api/leases/active GET: Fetches the list of active leases.

/api/leases/expired GET: Fetches the list of expired leases.

/api/leases/renew PUT: Renews a lease by extending its expiration time. Body:

{
    "resource": "resource_name",
    "holder": "holder_name"
}

How the lease renewals work

The PUT method in the lease service is designed to renew an existing lease by extending its expiration time. Here’s how it works and why it’s implemented this way:

   UPDATE leases
    SET 
        renewed_at = NOW(),
        expires_at = NOW() + INTERVAL '30 seconds'
    WHERE
        holder = ${holder} 
        AND resource = ${resource}
        AND expires_at > NOW()
        AND released_at is null
    RETURNING *;

Update Statement: The UPDATE leases statement specifies that we are updating the leases table.
Set Clause: `SET renewed_at = NOW(), expires_at = NOW() + INTERVAL '30 seconds’.
- renewed_at is set to the current timestamp (NOW()), indicating when the lease was last renewed.
- expires_at is set to 30 seconds from the current timestamp (NOW() + INTERVAL '30 seconds'), extending the lease’s expiration time.
Where Clause: WHERE holder = ${holder} AND resource = ${resource} AND expires_at > NOW() AND released_at is null:
- holder = ${holder}: Matches the lease with the specified holder.
- resource = ${resource}: Matches the lease with the specified resource.
- expires_at > NOW(): Ensures that only leases that have not yet expired are updated.
- released_at is null: Ensures that only leases that have not been releeased are updated.
Returning Clause: RETURNING *: Returns the updated lease record. This is useful for confirming the update and providing the updated lease details in the response.

Why lease releases?

Atomic Update
- The query perforks a atomic update, ensuring that the lease release is completed in a single step which recudes the risk of race conditions
Conditional Update
- The WHERE clause ensures that only leases that have not already been released are updated. This prevents the release of leases that are already marked as released.
Timestamp Management
- Updating the released_at and expires_at timestamps ensure that the lease’s release time and expiration time are accurately tracked.
Efficient Feedback
- The RETURNING * clause provides immediate feedback on the updated lease, allowing the application to respond with the updated lease details.

/api/leases/renewed GET: Fetches the list of renewed leases.

/api/leases/released GET: Fetches the list of released leases.

/api/leases/[id] GET: Fetches a lease by its ID. DELETE: Releases (deletes) a lease by its ID. The behavior is the same as the DELETE /api/leases/release route, except the lease’s ID is included in the WHERE clause of the UPDATE statement. Body:

{
    "resource": "resource_name",
    "holder": "holder_name"
}

Task service

The Task Service is responsible for managing tasks within the application. It provides various API endpoints to create, update, retrieve, and manage the lifecycle of tasks. The Task Service encapsulates Leases API to guarantee a single task can only be processed by a single worker. Of course, this assumes that the workers obey the rules of the API and do not perform work on the task once the lease has expired.

Task service database schema

Below is the database schema for the tasks table:

CREATE TABLE IF NOT EXISTS public.tasks
(
    id integer NOT NULL DEFAULT nextval('tasks_id_seq'::regclass),
    task_data jsonb NOT NULL,
    scheduled_at timestamp(3) without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
    processor text COLLATE pg_catalog."default",
    last_heartbeat_at timestamp(3) without time zone,
    must_heartbeat_before timestamp(3) without time zone,
    started_at timestamp(3) without time zone,
    processed_at timestamp(3) without time zone,
    task_output jsonb,
    CONSTRAINT tasks_pkey PRIMARY KEY (id)
)

Task service API routes

Get next tasks

POST /api/tasks/next

The POST method in this route retrieves the next available tasks for processing, ensurring that a lease is acquired for the task.

The Get Next Task query is designed to retrieve the next available task for processing:

SELECT * 
FROM tasks 
WHERE 
    processed_at is null
ORDER BY scheduled_at ASC
LIMIT 1
OFFSET ${tasksToSkip}
FOR UPDATE;

SELECT * FROM tasks
- This part of the query selects all columns from the tasks table.
WHERE processed_at is null
- This condition filters the tasks to only include those that have not yet been processed. The processed_at column is null for tasks that are still pending processing.
ORDER BY scheduled_at ASC
- This clause orders the tasks by their scheduled_at timestamp in ascending order. This ensures that tasks scheduled earlier are prioritized for processing.
LIMIT 1
- This limits the result to a single task. The query will return only one task that meets the criteria.
OFFSET ${tasksToSkip}
- This clause skips a specified number of tasks before returning the result. The ${tasksToSkip} variable is used to dynamically adjust the number of tasks to skip. This is useful for iterating through tasks if the first few are already leased or otherwise unavailable.
FOR UPDATE
- This clause locks the selected row for update. It prevents other transactions from modifying or selecting the same row until the current transaction is complete. This ensures that the task is not picked up by another processor while it is being assigned.

Why the get next task?

Prioritization
- By ordering tasks by their scheduled_at timestamp, the query ensures that tasks scheduled earlier are processed first. This helps in maintaining the correct order of task execution.
Concurrency Control
- The FOR UPDATE clause is crucial for preventing race conditions. It ensures that once a task is selected, it is locked for the current transaction, preventing other processors from picking it up simultaneously, allowing us to acquire a lease for the resource (task).
Dynamic Skipping
- The OFFSET ${tasksToSkip} clause allows the query to skip tasks that might already be leased or otherwise unavailable. This helps in efficiently finding the next available task without repeatedly querying the same tasks.

Heartbeat task

/api/tasks/[id]/heartbeat

The PUT method route is used to update the heartbeat of a task. This ensures that the task is still being processed by the correct processor and renews the lease to prevent it from being picked up by another processor. If a task processor fails to heartbeat in time, the underlying lease expires, and the processor should abandon and discard all work and get a new task to process.

Transaction
- The code executes a transaction using Prisma’s $transaction method to ensure atomicity and consistency.
Task Retrieval
- Within the transaction, it retrieves the task with the specified id using a raw SQL query with the FOR UPDATE clause to lock the row.
Processor Validation
- It checks if the task is assigned to the specified processor. If not, it returns a 200 OK response with a message indicating that the task is not assigned to the processor.
Processed Check
- It checks if the task has already been processed. If it has, it returns a 409 Conflict response with a message indicating that the task has already been processed.
Lease Renewal
- It sends a PUT request to the lease service to renew the lease for the task.
- If the lease renewal fails and the status is not 404, it returns a 500 Internal Server Error response with an appropriate error message.
- If the lease has expired (status 404), it returns a 409 Conflict response with a message indicating that the task lease has expired.
Task Update
- If the lease renewal is successful, it updates the task’s lastHeartBeatAt and mustHeartBeatBefore fields with the renewed lease timestamps.
- It returns the updated task with a 202 Accepted status.

Why the heartbeat task?

Atomic Operations
- Using a transaction ensures that the heartbeat update and lease renewal are performed atomically, reducing the risk of race conditions.
Concurrency Control
- The FOR UPDATE clause locks the task row, preventing other transactions from modifying it simultaneously.
Lease Management
- Renewing the lease ensures that the task remains assigned to the processor and prevents other processors from picking it up.

Complete task

PUT /api/tasks/[id]/complete

The PUT method route is used to mark a task as completed. It ensures that the task is assigned to the correct processor, renews the lease to prevent other processors from picking it up, and updates the task’s status.

Transaction
- The code executes a transaction using Prisma’s $transaction method to ensure atomicity and consistency.
Task Retrieval
- Within the transaction, it retrieves the task with the specified ID using a raw SQL query with the FOR UPDATE clause to lock the row.
- If the task is not found, it returns a 200 OK response with a message indicating that the task was not found.
Processor Validation
- It checks if the task is assigned to the specified processor. If not, it returns a 200 OK response with a message indicating that the task is not assigned to the processor.
Processed Check
- It checks if the task has already been processed. If it has, it returns a 409 Conflict response with a message indicating that the task has already been processed.
Lease Renewal*
- It sends a PUT request to the lease service to renew the lease for the task.
- If the lease renewal fails and the status is not 404, it returns a 500 Internal Server Error response with an appropriate error message.
- If the lease has expired (status 404), it returns a 409 Conflict response with a message indicating that the task lease has expired.
Task Update
- If the lease renewal is successful, it updates the task’s lastHeartBeatAt, mustHeartBeatBefore, processedAt, and taskOutput fields with the renewed lease timestamps and the provided task output.
- It returns the updated task with a 202 Accepted status.

Why the complete task?

Atomic Operations
- Using a transaction ensures that the task completion and lease renewal are performed atomically, reducing the risk of race conditions.
Concurrency Control
- The FOR UPDATE clause locks the task row, preventing other transactions from modifying it simultaneously.
Lease Management
- Renewing the lease ensures that the task remains assigned to the processor and prevents other processors from picking it up while we’re trying to complete the task. This is useful for when the SQL update experiences transient issues and the operation fails, releasing the task record from the database lock we acquired via FOR UPDATE, allowing the task processor to retry and not worry about the lease expiring out from under it.

These routes use simple queries to work with the task list:

Get Task by ID (/api/tasks/[id])
Get Processed Tasks (/api/tasks/processed)
Get Started Tasks (/api/tasks/started)
Get All Tasks (/api/tasks)

Task worker

The task worker is a worker process used to continuously fetch, process, and complete tasks from the task service. It ensures that tasks are processed reliably by periodically sending heartbeats to renew leases and handling potential errors gracefully.

How the worker handles task execution and lease management

The worker follows a structured loop to continuously fetch and process tasks while ensuring safe execution through a leasing mechanism. This prevents multiple workers from handling the same task simultaneously and ensures that tasks are completed reliably. Below, we break down how the worker operates.

Fetching a task

The worker begins by querying the task queue to retrieve the next available task. If no task is found, it waits for a short period before retrying. This prevents unnecessary load on the task queue while ensuring new tasks are picked up promptly.

Lease management with heartbeats

Once a task is assigned, the worker starts a heartbeat loop to maintain its lease. This lease prevents other workers from claiming the same task while it’s being processed. The worker must send periodic heartbeat signals to renew its lease. If it fails to do so, the lease expires, and the task becomes available for reassignment.

If the lease expires, the worker stops processing and abandons the task.
If the lease is renewed, the worker continues execution, occasionally simulating high latency to test resilience.

Task execution and failure handling

During execution, the worker processes the task step by step. It introduces:

Random failures (5% chance) to simulate unexpected crashes.
Latency spikes (10% chance) to model network delays or performance bottlenecks which may cause heartbeats to be missed.

If the worker encounters a simulated failure, it exits immediately, forcing the task to be retried later. Otherwise, it continues processing until completion.

Completing the task

Once the worker finishes executing the task:

It marks it as complete, allowing the system to remove it from the queue.
It stops the heartbeat timer.
It immediately looks for a new task to process, restarting the loop.

This design ensures:

Only one worker processes a task at a time
Automatic failure recovery:
- via lease expiration.
- In the event of an application crash, App Platform guarantees your process will be restarted.
Periodic heartbeats prevent abandoned tasks from remaining unprocessed or in a stuck state
Resilience testing with simulated failures and latency
- Latency is there to simulate missed heartbeats

By using this approach, we achieve fault-tolerant, distributed task execution that can scale efficiently.

Workflow

Task generator

It’s essential to test how workers handle tasks under real-world conditions. The Task Generator is designed to simulate a workload by continuously creating tasks with random execution times. These tasks mimic real jobs that a worker might process, such as resizing images, transcoding video, sending emails, or any other work that must be done reliably.

By using a lease-based mechanism, we ensure that only one instance of the generator is actively producing tasks at any time, preventing duplication while still allowing multiple instances to be deployed. This setup helps us observe system behavior, validate worker performance, and uncover potential issues in task execution and lease management.

Handling task generation in a load-balanced enviornment

In a distributed system where multiple instances of a service are running behind a load balancer, maintaining a singleton process (one that should run only once at a time) is challenging. The Task Generator demonstrates this problem by ensuring that only one instance can actively generate tasks while all instances respond to status queries, albeit incorrectly.

How the task generator works

The task generator is responsible for:

Ensuring Only One Instance Runs at a Time
- It acquires a lease, which acts as a lock, allowing only one instance to actively generate tasks.
- Other instances remain idle but continue responding to API requests.
Continuously Creating Tasks
- It inserts new tasks into a PostgreSQL table, each containing a JSON object (e.g., {"sleep_duration_seconds": 5}) to define processing duration.
- These tasks are picked up by worker processes.
Handling Load-Balanced Requests
- Since multiple instances handle API requests, different responses may be returned depending on which instance is queried.
- Only the instance holding the lease reports that it is generating tasks, while others remain in a passive state.

Why status fluctuates

When checking the generator’s status, you may notice inconsistent reports:

One request might return STARTED, indicating an instance has the lease.
Another request might return STOPPED, because that request hit an instance without the lease.
Pressing “Stop” might not work immediately because the request might be handled by an instance that isn’t holding the lease. Multiple attempts may be needed to reach the lease holder.

This behavior highlights a fundamental challenge of running a singleton service in a load-balanced environment.

How to stop the generator

To stop the generator, the request must reach the instance holding the lease. Since API requests are distributed randomly:

Stopping might require multiple attempts until the lease holder processes the request.
A more reliable approach would be to leverage the centrally managed lease and query the leases service to see which specific instance holds the lease, and then send the request to that specific instance.

Deploy the task processor

The complete application is available in the GitHub repository. To deploy the application on DigitalOcean’s App Platform:

Navigate to the App Platform dashboard, or use the Deploy to DO button in the repository’s README.
Two managed PostgreSQL development databases will be provisioned and used to store tasks and manage leases.
The Task Generator service automatically creates the necessary database tables and populates the task queue.

Note: Be sure to delete the app when you’re finished. Not doing so will result in real charges to your DigitalOcean account.

Conclusion

Leases provide an elegant solution to managing shared resources in multi-instance environments. By implementing leases with PostgreSQL, you can achieve safe, time-bound access to resources, automated expiration, and simplified recovery from failures. DigitalOcean’s managed database offerings make adopting this approach straightforward and reliable. Try out the sample application to experience the benefits of lease-based resource management firsthand.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products