Building Systems That Survive: Availability, Reliability & Durability

Designing large-scale distributed systems isn't just about choosing a database or adding more servers. At true scale, one reality stands out: failures are inevitable, and systems must be built to handle them.

Our responsibility as engineers is to design systems that fail gracefully, recover predictably, and continue delivering value with minimal disruption.

In practice, this comes down to three foundational properties: Availability, Reliability, and Durability. They're often used interchangeably, but each represents a distinct, tightly connected aspect of how a system behaves under failure.

1. Understanding Availability

Availability defines how consistently a system is accessible to users. When your service is up and running at any given moment, it is considered available. Most organizations express availability as a percentage.

Availability = (Actual Uptime / Total Time) x 100 %

Even tiny improvements in availability percentages can require massive architectural changes. Here's a practical mapping.

What Availability Percentages Actually Mean

Availability %	Allowable Downtime / Year	Typical Use Case
99% (two nines)	~3.65 days	Small internal tools
99.9% (three nines)	~8.76 hours	Mid-scale SaaS
99.99% (four nines)	~52 minutes	Consumer-grade platforms
99.999% (five nines)	~5 minutes	Banking, telco-grade apps

During my early years running a payments subsystem, I learned the classic lesson: achieving four nines is less about infrastructure and more about eliminating single points of failure—in code, operations, people and processes.

How Availability Is Achieved

You engineer availability through:

1) Redundancy: Redundancy means adding extra nodes or copies of data so the system keeps working even if parts fail. It removes single points of failure and enables automatic failover during outages.

Example: To avoid downtime during a node failure, an e-commerce system keeps three replicas of the order service. If the primary node crashes, traffic automatically switches to a healthy replica without user impact. This ensures no single point of failure.

2) Load balancing (traffic routing): These techniques distribute incoming requests across multiple servers to avoid overload. They also route traffic intelligently based on health, geography, or capacity for optimal performance.

Example: A web API runs on six application servers behind an Nginx or AWS ALB load balancer. The load balancer continuously performs health checks and sends requests only to healthy instances. If one server slows down, it is automatically removed from rotation.

3) Stateless architectures: Stateless systems keep no user/session data on individual servers, allowing any server to handle any request. This makes horizontal scaling simpler and reduces coupling between components.

Example: A login API stores session tokens in Redis or issues JWTs instead of storing user session data in memory. Now, any server can handle the next request—no server "owns" the session. This allows seamless horizontal scaling by simply adding more instances.

4) Graceful degradation: Graceful degradation lets a system reduce non-critical features when overloaded instead of failing completely. Users get a lighter but functional experience while the system protects itself from collapse.

Example: When a streaming platform is overloaded, it temporarily disables:
- HD video quality
- real-time recommendations
- background analytics
But basic streaming continues uninterrupted. The system intentionally drops non-critical features to protect core functionality.

2. Understanding Reliability

Where availability deals with "Can I reach the system?", Reliability asks "Will the system perform correctly each time?"

A highly available system can still be unreliable if requests fail or return incorrect data. In engineering terms, reliability is often expressed through Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR).

Mean Time Between Failures (MTBF): MTBF measures how long a system runs on average before a failure occurs. A higher MTBF means failures are infrequent, indicating a more reliable system.

MTBF = (Total Operational Time / Number of Failures)

Mean Time To Repair (MTTR): MTTR measures how long it takes, on average, to restore a system after a failure. A lower MTTR means issues are resolved quickly, reducing downtime and improving reliability.

MTTR = Total Repair Time / Number of Failures

Example: Imagine a service that technically "never goes down" — it responds to every request — but a large percentage of those responses are 500 errors or, worse, incorrect results.

Your uptime dashboards proudly display 99.99% availability, yet from the customer's perspective the system is failing constantly. This disconnect between being up and being reliable is exactly why error budgets, popularized by Google SRE, became one of the most transformative tools in many teams.

When the error budget is consumed, it means the system has already experienced as many failures as its SLO allows for that period. At that point, the team deliberately slows down or pauses new feature releases, because shipping new code increases risk.

Instead, the focus shifts to stabilizing the system—fixing bugs, improving monitoring, tightening rollback processes—until reliability returns to acceptable levels. Only then has the team "earned back the right" to ship new features safely.

Service Level Objective (SLO): An SLO is a target level of reliability or performance a service must meet—e.g., "99.9% of requests must succeed over 30 days." It defines what "good enough" looks like for the customer.

SLOs can measure both — availability and reliability — depending on what you choose to track.

Service Level Indicator (SLI): An SLI is the actual measured metric that tells you how the system is performing—e.g., success rate, latency, availability, throughput. If the SLO is the goal, the SLI is the scoreboard.

Service Level Agreement (SLA): An SLA is the external, contractual commitment made to customers (often with penalties). SLAs are stricter because they carry legal or financial consequences when breached.

Error Budget: An error budget is the maximum allowed failures your service can have while still meeting the SLO—e.g., at 99.9% SLO, your error budget is 0.1% failure or downtime. Error budgets help teams balance reliability vs. innovation by creating a controlled amount of acceptable risk.

Improving Reliability

Reliability is achieved by:

1. Idempotent APIs: Idempotent APIs ensure repeated requests produce the same result, preventing duplicated operations during retries. This makes recovery from transient failures predictable and safe.

Example: A payment service exposes a CreatePayment(requestId) API where every request carries a unique requestId. If the client retries due to a timeout, the server simply returns the same payment result instead of charging the user twice.

2. Circuit breakers: Circuit breakers stop calls to failing dependencies once error thresholds are crossed. This protects the system by allowing upstream services to degrade gracefully instead of collapsing.

Example: If a service repeatedly fails while calling an external fraud-check API, the circuit breaker “opens” and stops all outgoing calls for a cooldown period. During this time, the system returns a fallback response instead of waiting for failures— protecting upstream services from cascading collapse.

3. Graceful backoff + retry mechanisms: Backoff strategies slow down retry frequency during failures, reducing pressure on unstable services. Combined with smart retries, they let the system recover without causing retry storms.

Example: A messaging service retries publishing messages using exponential backoff (e.g., 1s → 2s → 4s → 8s). If the broker is unstable, this reduces load instead of triggering a retry storm. Once the downstream service recovers, messages resume normally.

4. Health checks and canary deployments: Health checks detect unhealthy instances early and remove them from traffic rotation. Canary deployments release new code to a small subset of users to validate stability before full rollout.

Example: Each application instance exposes liveness and readiness endpoints. When an instance becomes unhealthy, the load balancer automatically stops routing traffic to it. During deployments, only 5% of traffic is directed to the new version first; if metrics remain stable, rollout continues.

5. Automated rollbacks and chaos testing: Automated rollbacks immediately revert bad deployments when error rates spike. Chaos testing intentionally breaks components in controlled ways to verify the system's ability to withstand failures.

Example: Chaos Engineering tools like Chaos Monkey kill random instances in staging to ensure the system can survive failures without user impact.

At scale, reliability is more an organizational habit than a technical switch. Teams that prioritize observability—latency histograms, request patterns, saturation metrics—tend to build more reliable systems naturally.

A transient failure is a temporary, short-lived error that occurs not because the system is fundamentally broken, but because of momentary conditions like network glitches, brief service overload, or timeout spikes.

These failures often resolve themselves quickly, which is why strategies like retries, backoff, and idempotent APIs are designed specifically to handle them safely and automatically.

Backoff is a retry strategy where the system waits longer between each retry attempt after a failure. Instead of hammering a struggling service with rapid retries, backoff spreads out attempts—often doubling the wait time each time (exponential backoff)—to give the system room to recover and prevent overload or retry storms.

3. Understanding Durability

Durability answers the question: once data has been successfully written, will it remain correct, safe, and intact even if the system crashes, loses power, or nodes fail?

While availability and reliability deal with how well a system handles requests in the moment, durability is concerned with the long-term correctness of stored data—ensuring nothing is lost, corrupted, or partially written over time.

A simple example from the database world:
1. Redis (non-AOF) prioritizes speed over durability.
2. PostgreSQL or DynamoDB prioritize data preservation via WALs, replication, and quorum writes.

AOF (Append-Only File) is a durability feature where Redis logs every write operation to a file on disk. During a restart, Redis replays this log to rebuild the in-memory dataset exactly as it was.

Why does it matter?
- With AOF enabled, Redis can recover far more data after a crash.
- With AOF disabled ("non-AOF"), Redis relies mostly on snapshots (RDB), so you may lose recent writes between snapshots.

What is RDB (Redis Database) snapshots?

Redis periodically takes a point-in-time snapshot of the entire dataset and saves it to a compact binary file (dump.rdb). If Redis restarts, it loads this snapshot to restore the data as it existed at the time of that save.

Key characteristics:
- Fast to read and restore, but not as durable as AOF.
- You can lose all writes made since the last snapshot, which is why non-AOF Redis is not fully crash-safe.

Durability in Practice

A durable system ensures data survives:

1. Server crashes: A server crash occurs when a machine or process stops unexpectedly due to bugs, resource exhaustion, or hardware issues. Durable systems ensure no committed data is lost even if the node disappears instantly.

2. Region failures: Region failures happen when an entire data center or cloud region becomes unavailable due to outages or disasters. Durable architectures replicate data across regions so it can survive large-scale failures.

3. Disk corruption: Disk corruption means stored data becomes unreadable or incorrect due to hardware faults or write errors. Durability mechanisms use checksums, replication, or journaling to detect and recover from corruption.

4. Power outages: A sudden loss of power can interrupt writes and leave data in a partial or inconsistent state. Durable systems use logs, write-ahead techniques, or battery-backed storage to preserve committed data.

5. Network partitions: A network partition splits the system into isolated groups that cannot communicate. Durable systems prevent conflicting writes and maintain data correctness even when parts of the cluster are temporarily isolated.

Durability Techniques

You achieve durability with:

1. Write-Ahead Logs (WAL): WAL records every intended change to a durable log before applying it to the main data. This guarantees recovery after crashes by replaying logged operations.

Apache Cassandra writes every incoming update to a commit log (WAL) on disk before acknowledging the request, ensuring durability even if a node crashes. The data is later flushed to SSTables, with the commit log serving as the authoritative source for crash recovery.

2. Multi-AZ or multi-region replication: Data is synchronously or asynchronously replicated across availability zones or geographic regions. Even if an entire zone fails, durable replicas ensure data is still safe and accessible.

3. RAID and Immutable logs: RAID (Redundant Array of Independent Disks) uses multiple disks arranged with redundancy (mirroring, parity, or striping with parity) so that even if one or more drives fail, the system can reconstruct the data without loss. It strengthens durability by ensuring hardware failures don't translate into data corruption or unavailability.

Mirroring stores exact copies of the same data on two or more disks.

Parity is a mathematical checksum stored alongside data, allowing the system to reconstruct missing data if a disk fails. It provides durability with less storage overhead than full mirroring.

Striping with parity is a RAID technique, most commonly RAID 5, that improves performance by writing data across multiple drives (striping) while using a calculated value called parity to provide data redundancy and fault tolerance. If a single drive fails, the system can use the parity information and the data on the remaining drives to reconstruct the missing data.

Immutable logs store data in an append-only manner, where past entries cannot be modified or deleted. This prevents accidental or malicious corruption, guarantees a verifiable history of changes, and provides a reliable foundation for auditing, replication, and recovery.

4. Consensus algorithms: Consensus algorithms like Raft or Paxos ensure a majority of nodes agree on the correct data state, even during failures. They provide strong guarantees for durability and ordering of writes in distributed systems.

Paxos is a mathematically rigorous consensus protocol that ensures a cluster of unreliable nodes can agree on a single value even with failures. It is powerful but hard to understand, which led to simpler alternatives like Raft.

Raft is a consensus algorithm designed to be easier to understand while providing the same guarantees as Paxos.

5. Quorum writes: A write is considered successful only when acknowledged by a required number of replicas. This ensures data survives individual node failures and remains consistent across multiple copies.

Amazon Dynamo & Dynamo-inspired systems like DynamoDB (internally) and Apache Cassandra let you configure N (replication factor), W (write quorum), and R (read quorum).

4. The Interplay Between Availability, Reliability, and Durability

Although each represents a different dimension, you rarely optimize them independently. Improving one often impacts others. How They Influence Each Other:

Goal	Possible Trade-Off	Explanation
High Availability	Lower consistency	Replicating widely increases system reachability but may introduce stale reads
High Reliability	Higher latency	More validation, retries, consensus steps
High Durability	Lower write throughput	WALs and quorum writes slow down operations

In my experience, no system gets all three at maximum levels; you design around what matters most for your business domain.

Payment systems → prioritize durability
Social media feeds → prioritize availability
Real-time bidding → prioritize reliability + availability

Understanding trade-offs early helps avoid painful re-architectures later.

5. A Practical Example: Designing a Payment Order Service

To bring these ideas together, consider a simplified version of a real payment workflow. The goal is to accept a user's payment request immediately, store it safely, and then process the actual payment through downstream services without blocking the customer.

Pseudocode for a Reliable & Durable Write Path


public PaymentResponse processPayment(PaymentRequest request) {
    validate(request); // reliability

    PaymentEvent event = PaymentEvent.from(request);

    // durable write (event store)
    eventStore.append(event); 

    // asynchronous workflow, high availability
    sagaOrchestrator.trigger(event);

    return PaymentResponse.accepted(event.getId());
}

Here's what each component guarantees:

The event store provides durability by ensuring the payment event is safely persisted before anything else happens.

The saga orchestrator ensures reliability by coordinating multi-step, distributed operations with compensating actions if failures occur.

Asynchronous processing improves availability because the system can accept payments even when downstream gateways are slow or temporarily offline.

This pattern has proven extremely effective in real-world systems. For example, when external payment gateways were down, customers still received instant acknowledgment that their request was accepted. Once the gateway recovered, the saga resumed and completed the transaction without user intervention.

Limitations of this Example

This example illustrates the core ideas but hides many real-world complexities. It does not cover retries, idempotency, or handling duplicate events. It assumes the event store is highly available without explaining replication or consistency.

The saga logic is abstracted; real implementations need compensation steps, failure handling, and state tracking. It also omits mechanisms like dead-letter queues, backpressure, and circuit breakers.

6. Conclusion

Availability, Reliability, and Durability are not buzzwords; they are the backbone of resilient system design. As systems grow more distributed, these foundations become even more important.

With experience, you learn that engineering is less about preventing failures and more about preparing your system to survive them. After working with multiple software development teams, I can confidently say:

1. Availability is visible. You can temporarily lose availability and users may tolerate brief outages.
2. Reliability is felt. Lose reliability and users get frustrated.
3. Durability is trusted. Lose durability and your business loses money and confidence—sometimes permanently.

Teams that understand these principles build systems that stand the test of scale, failures, and time.

If there is one principle I consistently apply as an engineer, it is this: "Design systems assuming everything will fail-but ensure your users never notice".