Fault Tolerance Definition

KZero Staff

Jul 27, 2023

What is Fault Tolerance?

Fault tolerance refers to a system’s ability to tolerate faults or adverse events that impact the ability of a component or system to do its job. For example, a critical web application should be able to continue operating even if the power supply dies on a webserver supporting it.

What are Single Points of Failure?

One of the key concepts when discussing fault tolerance or resiliency is the single point of failure (SPOF). Implementing fault tolerance is largely identifying and eliminating SPOFs.

An SPOF is a component that is critical to a system’s operation. If the SPOF breaks or operates less efficiently, then the entire system suffers.
For example, a computer has a number of potential SPOFs. If the power supply, central processing unit (CPU), network interface card (NIC), or any of a number of other components die, then the computer may become unusable until that component is repaired or replaced.

Designing for Fault Tolerance

A fault tolerant or resilient system is one where no single component is essential. To achieve fault tolerance, it’s necessary to identify potential SPOWs and create redundancy for them.

For example, consider a crucial web application. An organization may need to take multiple steps to make it fault tolerant.

Server Redundancy

If that web application is running on a single server, then anything that causes that server to go down will bring the web application down as well. For this reason, an organization may set up multiple web servers behind a load balancer. If one server fails, then the load balancer will send traffic to the other servers in the cluster.

Alternatively, the organization may implement backup systems for that server. The backup system could regularly check if the primary server is offline and take over if so. Alternatively, the organization could manually switch to the backup when they see that the primary one has gone down.

Backup Power and Internet Connectivity

If all of the servers hosting the critical web application are in the same data center, they rely on the same physical infrastructure. An event that brings down power or Internet connectivity for one will affect the others as well. For this reason, an organization might choose to implement redundant power and Internet connectivity or geographically distribute the served to reduce the risk of both being affected by a localized incident.

Dependencies

An organization may design a web application to be resilient and fault tolerant by using redundant servers, power, network connectivity, etc. However, that application may still be vulnerable to disruption if services that it relies on are potential SPOWs.

For example, a web application may need to make regular queries to a database to function properly. If that database is not fault tolerant, then the application could be rendered unusable by any event that brings down the database.

Conclusion

As IT systems become more vital to companies’ operations and their customers, fault tolerance for critical systems has become essential. Achieving fault tolerance involves identifying potential single points of failure and implementing redundancy.

Previous Topic

Next Topic

KZero Staff

Explore more insightful content from the knowledgeable KZero staff on our blog and guides section.