What is high availability (HA)?
In context of IT operations, the term High Availability refers to a system (a network, a server array or cluster, etc.) that is designed to avoid loss of service by reducing or managing failures and minimizing planned downtime.
A system is expected to be highly available when life, health, and well-being – including economic well-being – are at stake.
In information technology, system or component availability is expressed as a percentage of yearly uptime. Service Level Agreements (SLAs) generally refer these availability percentages in order to calculate billing. Using the unachievable ideal of 100% availability as a baseline, the goal of the highest levels of service availability is considered to be “five nines” – 99.999% availability.
High availability management
High availability can be achieved only with thorough planning and consistent monitoring.
A good starting point for high availability planning involves the identification of services that must be available for business continuity, and those that should be available.
For each level of service, from must to should, it is also worthwhile to decide how far the organization is willing to go to ensure availability. This should be based on budget, staff expertise, and overall tolerance for service outages.
Next, identify the systems or components that comprise each service, and list the possible points of failure for these systems. Each point of failure should be initially checked, a failure tolerance baseline established, and frequency of ongoing monitoring defined. Some key questions to ask about common points of failure include:
- Network availability: How available is your network, compared to the SLA with your Internet Service Provider (ISP)? Check this with Network Internet Control Message Protocol (ICMP) echo pings, via your network monitoring software.
- Bandwidth usage: How much bandwidth does your system consume, at both peak and idle times? Get this information from managed routers and Internet Information Services (IIS) log analysis. Use it to plan bandwidth allocation for known peaks (end-of-year crushes, key shopping days, etc.), and avoid inadequate bandwidth scenarios.
- HTTP availability and visibility: Are you monitoring system HTTP requests – internally, per ISP, and per geographic location? Problems with internal requests can serve as an early warning of outward-facing problems. Track HTTP requests from ISP networks to determine whether or not users of these networks can access your service, and monitor requests from different geographic locations to ensure users from anywhere in the world are able to use your services.
- System availability: Are you keeping track of abnormal and normal operating system, database, and enterprise server system shutdowns?
- Performance metrics: Do you monitor the number of users that visit your site or use enterprise applications, and compare these numbers to latency of requests and historical CPU utilization? Have you grouped servers by function, and do you monitor disk capacity and I/O throughput? Do you check fiber channel controller and switch bandwidth, and keep an eye on overall system memory usage?
High availability and disaster recovery
High availability planning is designed to ensure system uptime, and disaster recovery is designed to minimize or eliminate downtime. These are two sides of the same business continuity coin, which are defined via:
- Recovery Time Objective (RTO) – the amount of time a business can function without the system’s availability
- Recovery Point Objective (RPO) – how old the data will be once systems do recover.
During the planning stages these two metrics should be used to establish goals and priorities. For example, systems that are defined as mission-critical during high availability planning will of necessity have the lowest possible RTO in disaster recovery planning.
The same is true with data synchronization and replication, backup, and failover – all key aspects of disaster recovery planning. How your organization chooses to synchronize data for a given system should be a direct outcome of that system’s importance as determined in high availability planning.
Whether a hot or warm failover option is maintained for a given system should keep in the mind that system’s must-versus-should status, as discussed above.
High availability planning, much like disaster recovery planning, should also include the right combination of internal resources and vendor-supported solutions.
For example, maintaining an off-premises failover system, which will monitor mission-critical system health and reroute traffic in real-time to a backup system or data center in the event of failure, can be crucial to high availability. Also, cloud-based synchronization options can ensure that “hot-hot” failover solutions for crucial systems always have access to the most updated data.
Similar to disaster recovery planning, high availability planning ensures that systems crucial to your organization will continue to provide optimal service.
Keeping disaster recovery and high availability planning in-line with each other can help make sure that downtime and mean time between failures (MTBF – the predicted elapsed time between system failures) are consistently minimal.