Fault Tolerance
Failures are not avoidable in any system and will happen all the time, hence we need to build systems that can tolerate failures or recover from them.
- In systems, failure is the norm rather than the exception.
- «Anything that can go wrong will go wrong” — Murphy’s Law
- “Complex systems contain changing mixtures of failures latent within them” — How Complex Systems Fail.
Fault Tolerance — Failure Metrics
Common failure metrics that get measured and tracked for any system.
Mean time to repair (MTTR): The average time to repair and restore a failed system.
Mean time between failures (MTBF): The average operational time between one device failure or system breakdown and the next.
Mean time to failure (MTTF): The average time a device or system is expected to function before it fails.
Mean time to detect (MTTD): The average time between the onset of a problem and when the organization detects it.
Mean time to investigate (MTTI): The average time between the detection of an incident and when the organization begins to investigate its cause and solution.
Mean time to restore service (MTRS): The average elapsed time from the detection of an incident until the affected system or component is again available to users.
Mean time between system incidents (MTBSI): The average elapsed time between the detection of two consecutive incidents. MTBSI can be calculated by adding MTBF and MTRS (MTBSI = MTBF + MTRS).
Failure rate: Another reliability metric, which measures the frequency with which a component or system fails. It is expressed as a number of failures over a unit of time.
Refer
- https://www.splunk.com/en_us/data-insider/what-is-mean-time-to-repair.html
Fault Tolerance — Fault Isolation Terms
Systems should have a short circuit. Say in our content sharing system, if “Notifications” is not working, the site should gracefully handle that failure by removing the functionality instead of taking the whole site down.
Swimlane is one of the commonly used fault isolation methodologies. Swimlane adds a barrier to the service from other services so that failure on either of them won’t affect the other. Say we roll out a new feature ‘Advertisement’ in our content sharing app. We can have two architectures 
If Ads are generated on the fly synchronously during each Newsfeed request, the faults in the Ads feature get propagated to the Newsfeed feature. Instead if we swimlane the “Generation of Ads” service and use a shared storage to populate Newsfeed App, Ads failures won’t cascade to Newsfeed, and worst case if Ads don’t meet SLA , we can have Newsfeed without Ads.
Let’s take another example, we have come up with a new model for our Content sharing App. Here we roll out an enterprise content sharing App where enterprises pay for the service and the content should never be shared outside the enterprise.

Swimlane Principles
Principle 1: Nothing is shared (also known as “share as little as possible”). The less that is shared within a swim lane, the more fault isolative the swim lane becomes. (as shown in Enterprise use-case)
Principle 2: Nothing crosses a swim lane boundary. Synchronous (defined by expecting a request—not the transfer protocol) communication never crosses a swim lane boundary; if it does, the boundary is drawn incorrectly. (as shown in Ads feature)
Swimlane Approaches
Approach 1: Swim lane the money-maker. Never allow your cash register to be compromised by other systems. (Tier 1 vs Tier 2 in enterprise use case)
Approach 2: Swim lane the biggest sources of incidents. Identify the recurring causes of pain and isolate them. (if Ads feature is in code yellow, swim laning it is the best option)
Approach 3: Swim lane natural barriers. Customer boundaries make good swim lanes. (Public vs Enterprise customers)
High availability VS Fault Tolerance VS Disaster Recovery
![]()
As a Solution Architect, it is crucial to comprehend the concepts of High Availability, Fault Tolerance, and Disaster Recovery. However, these terms are often confused, and their distinctions are not always apparent. It is essential to understand the differences between these concepts to design robust and reliable systems. By mastering these concepts, you can design systems that can withstand any challenge and provide uninterrupted service to your customers.
High Availability (HA)
First, let’s try to give the definition of high availability. Wikipedia has a pretty good one:
High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
Many people assume that a highly available system means that it will never fail and that users will never experience outages. However, this is not entirely true. High availability (HA) is designed to keep a system online and providing services as often as possible. It is not about preventing user disruption, but rather about maximizing a system’s online time.
HA is not a fail-safe mechanism that guarantees a system will never fail. Instead, it is a system designed to quickly replace or fix components when they fail, often using automation to bring systems back into service. This means that if a system fails and a component is replaced, causing a few seconds of disruption, it is still considered highly available.
System availability is generally expressed as a percentage of uptime. For example, 99.9% uptime means that a system can have 8.77 hours of downtime per year. Some systems require even higher levels of availability, such as 99.999% uptime, which only allows for 5.26 minutes of downtime per year.
Implementing HA requires design decisions to be made in advance, such as having redundant servers ready to switch customers over to in case of failure. However, it is important to note that HA comes with costs.
In summary, HA is about keeping a system operational and quickly recovering from issues. It is not about preventing user disruption, but rather maximizing a system’s online time. While a highly available system can still experience disruption, it is designed to quickly recover and minimize downtime.
Fault Tolerance (FT)
When it comes to ensuring system reliability, two terms that are often confused: high availability and fault tolerance. While they share some similarities, fault tolerance is a more comprehensive approach.
Fault tolerance refers to a system’s ability to continue functioning properly even if some of its components fail. This means that the system must be able to operate seamlessly despite the presence of faults, and without any negative impact on customers.
Achieving fault tolerance is a complex and expensive process, as it requires a high level of redundancy and the ability to route traffic and sessions around any failed components. In contrast, high availability can be achieved by simply having spare equipment or standby components ready to go. By automating processes and having these backups in place, outages can be minimized. However, high availability alone may not be enough to ensure system reliability in the face of faults.
It’s important to note that implementing fault tolerance when high availability would suffice is a waste of resources, as it is a more complex and costly approach. On the other hand, implementing high availability when fault tolerance is necessary can put lives at risk.
In summary, while high availability and fault tolerance share some similarities, fault tolerance is a more comprehensive approach that ensures system reliability even in the face of faults. Achieving fault tolerance is a complex and expensive process, but it is necessary in situations where system failure could have serious consequences.
Disaster Recovery (DR)
Disaster recovery is a crucial set of policies, tools, and procedures that enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. It’s about planning for the worst-case scenario and knowing what to do when disaster strikes and knocks out your system.
What happens when high availability (HA) and fault tolerance (FT) fail? That’s where disaster recovery comes in. It’s a multi-stage process that involves pre-planning, building a set of processes and documentation, and planning for staffing and physical issues when a disaster happens.
The worst time for any business is recovering in the event of a major disaster. That’s why a good set of disaster recovery processes needs to include regular backups and offsite backup storage. Storing backups at the same site as your system is a recipe for disaster. If your main site is damaged, your primary data and backups are damaged at the same time. Having an offsite backup storage location ensures that backups can be restored at the standby location in the event of a disaster.
Effective disaster recovery planning isn’t just about the technology, though. It’s also about knowledge. Make sure that you have copies of all your processes available and that all your logins to key systems are accessible to staff at the standby site. By doing this in advance, you can avoid a chaotic process when an issue inevitably occurs.
Ideally, you should run periodic disaster recovery testing to ensure that you have everything you need. If you identify anything missing, you can refine the processes and run the test again. With a solid disaster recovery plan in place, you can rest assured that your business will be able to recover quickly and efficiently in the event of a disaster.
Conclusion
High Availability refers to a system’s ability to remain operational and accessible even in the event of hardware or software failures. Fault Tolerance, on the other hand, involves designing a system to continue functioning even if a component fails. Finally, Disaster Recovery is the process of restoring a system to its previous state after a catastrophic event.
Take Control of Your
Multi-Cloud Environment
Give developers the flexibility to use any app framework and tooling for a secure, consistent and fast path to production on any cloud.
Connect & Secure Apps & Clouds
Deliver security and networking as a built-in distributed service across users, apps, devices, and workloads in any cloud.
Run Enterprise Apps Anywhere
Run enterprise apps and platform services at scale across public and telco clouds, data centers and edge environments.
Automate & Optimize Apps & Clouds
Operate apps and infrastructure consistently, with unified governance and visibility into performance and costs across clouds.
Access Any App on Any Device
Empower your employees to be productive from anywhere, with secure, frictionless access to enterprise apps from any device.
Anywhere Workspace
Access Any App on Any Device Securely
App Platform
Build and Operate Cloud Native Apps
Cloud & Edge Infrastructure
Run Enterprise Apps Anywhere
Cloud Management
Automate and Optimize Apps and Clouds
Desktop Hypervisor
Manage apps in a local virtualization sandbox
Security & Networking
Connect and Secure Apps and Clouds
Run VMware on any Cloud. Any Environment. Anywhere.
On Public & Hybrid Clouds
On Private & Local Clouds
Solutions
Anywhere Workspace
Access Any App on Any Device Securely
App Platform
Build and Operate Cloud Native Apps
Cloud Infrastructure
Run Enterprise Apps Anywhere
Cloud Management
Automate and Optimize Apps and Clouds
Edge Infrastructure
Enable the Multi-Cloud Edge
Networking
Enable Connectivity for Apps and Clouds
Security
Secure Apps and Clouds
By Industry
VMware AI Solutions
Accelerate and ensure the success of your generative AI initiatives with multi-cloud flexibility, choice, privacy and control.
For Customers
For Partners
Working Together with Partners for Customer Success
See how we work with a global partner to help companies prepare for multi-cloud.
Tools & Training
Services
Support
Marketplace
Videos
Blogs & Communities
Customers
Events
Fault Tolerance Software vSphere Fault Tolerance
vSphere Fault Tolerance (FT) provides a live shadow instance of a virtual machine (VM) that mirrors the primary VM to prevent data loss and downtime during outages.
What is Fault Tolerance?
Protect Your Applications Regardless of Operating System or Underlying Hardware
vSphere Fault Tolerance safeguards any virtual machine (with up to four virtual CPUs), including homegrown and custom applications that traditional high-availability products cannot protect. Key capabilities include the following:
- Compatible with all types of shared storage, including Fibre Channel, Internet Small Computer Systems Interface (iSCSI), Fibre Channel over Ethernet (FCoE) and network-attached storage (NAS).
- Compatible with all operating systems supported by vSphere.
- Works with existing VMware vSphere Distributed Resource Scheduler and VMware vSphere High Availability (HA) clusters for advanced load balancing and optimized initial placement of virtual machines.
- Contains a version-control mechanism that allows primary and secondary virtual machines to run on vSphere FT-compatible hosts at different, but compatible, patch levels.
Simple to Set Up, Start and Stop
vSphere FT can safeguard any number of virtual machines in a cluster because it leverages existing vSphere HA clusters. Administrators can start or stop vSphere FT for specific virtual machines with a point-and-click action in the vSphere web client. Use vSphere FT for applications that require continuous protection during critical times, such as quarter-end processing.
What is fault tolerance, and how to build fault-tolerant systems

November 25, 2020. If you work in tech infrastructure, that’s a date you probably remember. On that day, AWS’s US-east-1 experienced a significant outage, and it broke a pretty significant percentage of the internet.
Adobe, League of Legends, Roku, Sirius XM, Amazon, Flickr, Giphy, and many, many more experienced issues or went offline completely as a result of the outage.
That kind of outage costs time and money. It also does something that’s arguably even more expensive in the long run: it erodes customer confidence in your product.
Outages like that are one of the reasons why fault tolerance is an integral part of most modern application architectures.
What is fault tolerance?
Fault tolerance describes a system’s ability to handle errors and outages without any loss of functionality.
For example, here’s a simple demonstration of comparative fault tolerance in the database layer. In the diagram below, Application 1 is connected to a single database instance. Application 2 is connected to two database instances — the primary database and a standby replica.

In this scenario, Application 2 is more fault tolerant. If its primary database goes offline, it can switch over to the standby replica and continue operating as usual.
Application 1 is not fault tolerant. If its database goes offline, all application features that require access to the database will cease to function.
Of course, this is just a simple example. In reality, fault tolerance must be considered in every layer of a system (not just the database), and there are degrees of fault tolerance. While Application 2 is more fault tolerant than Application 1, it’s still less fault tolerant than many modern applications. (See examples of fault-tolerant application architecture.)
Fault tolerance can also be achieved in a variety of ways. These are some of the most common approaches to achieving fault tolerance:
Multiple hardware systems capable of doing the same work. For example, Application 2 in our diagram above could have its two databases located on two different physical servers, potentially in different locations. That way, if the primary database server experiences an error, a hardware failure, or a power outage, the other server might not be affected.
Multiple instances of software capable of doing the same work. For example, many modern applications make use of containerization platforms such as Kubernetes so that they can run multiple instances of software services. One reason for this is so that if one instance encounters an error or goes offline, traffic can be routed to other instances to maintain application functionality.
Backup sources of power, such as generators, are often used in on-premises systems to protect the application from being knocked offline if power to the servers is impacted by, for example, the weather. That type of outage is more common than you might expect.
Fault tolerance vs. high availability
High availability refers to a system’s total uptime, and achieving high availability is one of the primary reasons architects look to build fault-tolerant systems.
Technically, fault tolerance and high availability are not exactly the same thing. Keeping an application highly available is not simply a matter of making it fault tolerant. A highly fault-tolerant application could still fail to achieve high availability if, for example, it has to be taken offline regularly to upgrade software components, change the database schema, etc.
In practice, however, the two are often closely connected, and it’s difficult to achieve high availability without robust, fault-tolerant systems.
Fault tolerance goals
Building fault-tolerant systems is more complex and generally also more expensive. If we think back to our simple example from earlier, Application 2 is more fault tolerant, but it also has to pay for and maintain an additional database server. Thus, it’s important to assess the level of fault tolerance your application requires and build your system accordingly.
Normal functioning vs. graceful degradation
When designing fault-tolerant systems, you may want the application to remain online and fully functional at all times. In this case, your goal is normal functioning — you want your application, and by extension the user’s experience, to remain unchanged even if an element of your system fails or is knocked offline.
Another approach is aiming for what’s called graceful degradation, where outages and errors are allowed to impact functionality and degrade the user experience, but not knock the application out entirely. For example, if a software instance encounters an error during a period of heavy traffic, the application experience may slow for other users, and certain features might become unavailable.
Building for normal functioning obviously provides for a superior user experience, but it’s also generally more expensive. The goals for a specific application, then, might depend on what it’s used for. Mission-critical applications and systems will likely need to maintain normal functioning in all but the most dire of disasters, whereas it might make economic sense to allow less essential systems to degrade gracefully.
Setting survival goals
Achieving 100% fault tolerance isn’t really possible, so the question architects generally have to answer when designing fault-tolerant systems is how much they want to be able to survive.
Survival goals can vary, but here are some common ones for applications that run on one or more of the public clouds, in ascending order of resilience:
- Survive node failure. Running instances of your software on multiple nodes (often different physical servers) with the same AZ (data center) can allow your application to survive faults (such as hardware failures or errors) on one or more of those nodes.
- Survive AZ failure. Running instances of your software across multiple availability zones (data centers) within a cloud region will allow you to survive AZ outages, such as a specific data center losing power during a storm.
- Survive region failure. Running instances of your software across multiple cloud regions can allow you to survive an outage affecting an entire region, such as the AWS US-east-1 outage mentioned at the beginning of this post.
- Survive cloud provider failure. Running instances of your software both in the cloud and on-premises, or across multiple cloud providers, can allow you to survive even a full cloud provider outage.
These are not the only possible survival goals, of course, and fault tolerance is only one aspect of surviving outages and other disasters. Architects also need to consider factors such as RTO and RPO to minimize the negative impact when outages do occur. But considering your goals for fault tolerance is also important, as they will affect both the architecture of your application and its costs.
The cost of fault tolerance
When architecting fault-tolerant systems, another important consideration is cost. This is a difficult and very case-specific factor, but it’s important to remember that while there are costs inherent with choosing and using more fault-tolerant architectures and tools, there are also significant costs associated with not choosing a high level of fault tolerance.
For example, operating multiple instances of your database across multiple cloud regions is likely to cost more on the balance sheet than operating a single instance in a single region. However, there are a few things you must also consider:
- What does an outage cost in dollars? For mission-critical systems, even a few minutes of downtime can lead to millions in lost revenue.
- What does an outage cost in reputation damage? Consumers are demanding, particularly in certain business verticals. An application outage of just a few minutes, for example, could be enough to scare millions of customers away from a bank.
- What does an outage cost in engineering hours? Any time your team spends recovering from an outage is time they’re not spending building new features or doing other important work.
- What does an outage cost in team morale and retention / hiring? Outages also often come at inconvenient times. The US-east-1 outage, for example, came the day before Thanksgiving, when most US-based engineers were on vacation, forcing them to rush into the office on a holiday or in the middle of the night to deal with an emergency. Great engineers generally have a lot of choices when it comes to where they work, and will avoid working anywhere where those sorts of emergencies are likely to disrupt their lives.
These are just a few of the costs associated with not achieving a high level of fault tolerance.
The company could have made that MySQL database more fault tolerant by manually sharding it, but that approach is technically complex and requires a lot of work to execute and maintain. Instead, the company chose to migrate to CockroachDB dedicated, a managed database-as-a-service that is inherently distributed and fault tolerant.
Although CockroachDB dedicated itself is more expensive than MySQL (which is free), migrating to CockroachDB enabled the company to save millions in labor costs because it automates the labor-intensive manual sharding process and resolves many of the technical complexities that manually sharding would introduce.
Ultimately, the company achieved a database that is as or more fault tolerant than manually sharded MySQL while spending millions of dollars less than what manually sharded MySQL would ultimately have cost them.
This’s not to say that CockroachDB or any specific tool or platform will be the most affordable option for all use cases. However, it’s important to recognize that the methods you choose for achieving your fault tolerance goals can have a significant impact on your costs in both the short and long term.
Fault-tolerant architecture examples
There are many ways to achieve fault tolerance, but let’s take a look at a very common approach for modern applications: adopting a cloud-based, multi-region architecture built around containerization services such as Kubernetes.

An example of a fault-tolerant multi-region architecture. Click to enlarge.
This application could survive a node, AZ, or even region failure affecting its application layer, its database layer, or both. Let’s take a closer look at how that’s possible.
Achieving fault tolerance in the application layer
In the diagram above, the application is spread across multiple regions, with each region having its own Kubernetes cluster.
Within each region, the application is built with microservices that execute specific tasks, and these microservices are typically operated inside Kubernetes pods. This allows for much greater fault tolerance, since a new pod with a new instance can be started up whenever an existing pod encounters an error. This approach also makes the application easier to scale horizontally — as the load on a specific service increases, additional instances of that service can be added in real time to handle the load, and then removed when the load dies down again and they’re no longer needed.
Achieving fault tolerance in the persistence (database) layer
The application in the diagram above takes a similar approach in the database layer. Here, CockroachDB is chosen because its distributed, node-based nature naturally provides a high level of fault tolerance and the same flexibility when it comes to scaling up and down horizontally. Being a distributed SQL database, it also allows for strong consistency guarantees, which is important for most transactional workloads.
CockroachDB also makes sense for this architecture because although it’s a distributed database, it can be treated like a single-instance Postgres database by the application — almost all the complexity of distributing the data to meet your application’s availability and survival goals happens under the hood.