The Horrific Reality of Catastrophic Failure
The Exorcist doesn’t hold a candle to the idea of a catastrophic failure wiping out your data, your web presence… your entire operation (cue the vomit). It should scare you.
Our livelihoods—our lives—are increasingly digital. Your IT infrastructure is integral to your operations. Whether it’s your website, your database, or your inter-office communications and operations, downtime is intolerable. A catastrophe-level shutdown is unfathomable.
Fortunately, there are plenty of ways to safeguard your business from the worst. You can read about how to prevent a disaster with redundancies, a high availability (HA) infrastructure, and other solutions, here, here, and here. However, things happen and even the best-laid plans are well intended, but sometimes a tornado comes through a takes out your data center.
In the event that something catastrophic does occur, you need to be ready and the best way to be ready is to understand exactly what happens if (and with the right protection that’s a pretty big if) the walls you’ve built around your business come tumbling down. You need to expect the unexpected, so you’re prepared for anything that comes your way.
Failures Occur. When?
There isn’t an infrastructure out there (no matter how well designed, implemented, or maintained) that is impervious to failure. They happen. That’s why HA systems are a thing; it’s why you have redundancies, backups, and other preventative measures. But, where do they occur? When do they occur?
Well, there are 5 particularly vulnerable points in your infrastructure—housing, hardware, ISP, software, and data.
Your first vulnerable point, housing, is your physical accommodations and include the building that houses your servers/computers, your climate controls, and your electrical supply. Your housing is only vulnerable in highly specific instances (natural disasters, brownouts, blackouts, etc.) and is pretty easily mitigated.
For example, two separate sources of power, uninterruptible power supplies, battery backups, restricted access to server rooms, routine building maintenance, etc. can reliably safeguard this vulnerability in your infrastructure. This goes for your ISP (fiber, cable, wireless) and other vendors, as well. Thoroughly vetted, high-quality vendors will have their own HA systems in place, making this vulnerability in your infrastructure a low probability for catastrophic failure.
However, your hardware, software, and data are significantly more vulnerable even though there are steps your company can take to prevent failures. Servers, computers, peripherals, and network equipment age, break down, and fail; it’s just the reality of physical systems. But, non-physical systems (productivity and communication software, websites, applications, etc.) are also open to certain failures, including external attacks—DDoS, hacking, bugs, viruses, and human error.
Finally, your data can get corrupted by itself or can fail as a result of another failure in the chain; a hardware failure, for example, could wipe out your data. While some failures can be predicted and prevented—regular maintenance and replacement of equipment to prevent breakdowns, for example—others simply can’t be anticipated. A sudden equipment failure, power outages, natural disasters, a DDoS attack; these can all occur seemingly out of nowhere. You simply have to have a plan in place to react to these events in case they do (almost inevitably) happen.
A good rule of thumb is to create an infrastructure that doesn’t have (or at least attempts to eliminate) a single point of failure. All of these vulnerability points—housing, hardware, ISP, software, and data—are susceptible to single points of failure.
- Housing? Make sure you have a physical space you can use in case the first space become unviable.
- Hardware? Make sure you have redundant equipment you can swap in, in case of a failure.
- ISP, software, data? Redundancies, backups, and backups of backups. Be prepared.
What is the Worst Case Scenario?
In 2007, according to the Los Angeles Times, “a malfunctioning network interface card on a single desktop computer in the Tom Bradley International Terminal at LAX” brought international air travel to an absolute standstill; for nine hours. For nine hours, 17,000 passengers were stranded on board—because this was software used by U.S. Customs, software used to authorize entry and exit, no one was allowed to disembark. This not only stopped international travel in its tracks, U.S. Customs and the airlines themselves had to supply food, water, and diapers to passengers, and had to keep refueling to keep the environmental controls on the aircraft operating. Oh, and shortly after the system was restored, again according to the Los Angeles Times, it gave out again: “The second outage was caused by a power supply failure.” Now that’s a worst case scenario. You’re not U.S. Customs or LAX, but you can relate.
Almost nine hours of downtime in a single day exceeds what 81% of businesses said they could tolerate in a single year (thanks Information Technology and Intelligence Corp).
Everyone’s worst case scenario is different, but a massive failure that cripples your infrastructure for even a few hours in a single day can have irrevocably adverse effects on your revenue, your workflow, and your relationship with your clients/customers. Any significant downtime should be a cause for concern.
Is it a worst case scenario? Maybe not, but a few days in a row—or even over the course of a year—could be.
Automatic Failover vs. No Automatic Failover
While a systems failure is a spectrum of what can go wrong, there are two scenarios on either end—an automatic failover and a catastrophic failure in which a failover doesn’t take place either manually or automatically. Failover systems themselves can fail, but it’s more likely that there isn’t a system in place to automate a switch to a redundant system.
What follows is a look into what actually happens during an automatic failover and what would happen if such a system wasn’t in place.
What Happens During an Automatic Failover
Several scenarios can trigger a failover—your secondary node(s) does/do not receive a heartbeat signal; a primary node experiences a hardware failure; a network interface fails; your HA monitor detects a significant dip in performance, or a failover command is manually sent. In the event that a secondary node does not receive a heartbeat signal (synchronous, two-way monitor of server operation and performance), there are several causes including network failure, a hardware failure, or a software crash/reboot.
As you can see, an automatic failover is triggered (predominantly) by an equipment failure. Any time a piece of equip stops operating—or even begins to perform below its expected values—a failover will be triggered.
It should be noted that there is a difference between a switchover and failover. A switchover is simply a role reversal of the primary and a secondary node; a secondary node is chosen to become the primary node and the primary node becomes a secondary node. This is almost always anticipated and done intentionally. A common switchover scenario is maintenance and upgrading. In a switchover, there is no data loss.
A failover, on the other hand, is a role reversal of the primary node and a secondary node in the event of a systems failure (network, hardware, software, power, etc.). A failover may result in data loss depending on the safeguards in place.
So, what does happen in an automatic failover? Let’s break it down:
- An event occurs that initiates failover. This could be a network failure, a power outage, a software failure, or a hardware failure. In all cases, the heartbeat link between the primary node and the elected secondary mode is severed and failover is initiated.
- An error log (why was a failover initiated?) is created.
- The elected secondary node takes on the role of the primary node.
- The primary node is removed from the cluster.
What Happens With No Automatic Failover
Ok, so you don’t have an automatic failover safeguard in place and something breaks—or, even worse, a lot of things break. What happens? Well, that’s going to depend on what systems you have in place. If you have working backups, but no automatic failover systems in place, you’ll retain your data. However, depending on your infrastructure, the amount of time it will take to recognize a failure and the amount of time it takes to manually switch over will be much longer than an automatic solution. However, if your system is sketchy and there are vulnerabilities throughout, things get significantly more complicated and need to be addressed on a case-by-case basis. We can, though, examine what happens in systems with one or more single points of failure at critical junctures.
You’re sure to remember housing, hardware, ISP, software, and data.
- Housing. In May of 2011, a major tornado ripped through Joplin, MO. In the tornado’s path were a hospital and the hospital’s adjoining data center. The data center held both electronic and physical records. Serendipitously, the hospital IT staff was in the middle of mass digitization and data migration to an off-site central center with redundant satellites. Which meant that most of the data was saved (although some records were irrevocably destroyed) and the hospital was able to mobilize services quickly. However, if the tornado had come any earlier, the data loss would have been extreme. While this scenario (indeed, any IT housing disaster) is rare, it does happen and there are ways to safeguard your equipment and your data. According to Pergravis, (offsite backups notwithstanding) the best data center is constructed from reinforced concrete and is designed as a box—the data center—within a shell—the structure surrounding the data center—which creates a secondary barrier. This is, obviously, a pie-in-the-sky scenario, but Pegravis does offer simpler solutions for shoring up an existing data center. For example, they suggest locating your data center in the middle of your facility away from exterior walls. If that’s not an option, however, removing and sealing exterior windows will help safeguard your equipment from weather damage.
- Hardware. The key to any secure system (the key to HA, as we’ve discussed here and here) is redundancy. That includes redundant hardware that you might not immediately think of. A few years ago, Microsoft Azure Cloud services in Japan went down for an extended period of time because of a bad rotary uninterruptible power supply (RUPS). As the temperatures in the data center rose, equipment began shutting itself off in order to preserve data, disrupting cloud service in the Japan East region. It’s not always going to be a storage device that fails or even a network appliance. Besides, most systems are over-engineered in terms of server component, data backup, and network equipment redundancies. It’s up to you to work with your company to conceive of, prepare for, and shore up any weaknesses in your IT infrastructure—if you prepare for the worst, it will never come.
- ISP. According to the Uptime Institute, between 2016 and 2018, 27% of all data center outages were network-related. As more and more systems migrate to the cloud and more and more services are network-dependent, redundant network solutions are becoming increasingly important. In some cases, that could mean two or more providers or two or more kinds of services—fiber, cable, and wireless, as an example.
- Software. Whether it’s unintended consequences (Y2K) or a straight-up engineering faceplant—in 1998, NASA lost the Mars Polar Lander because a subcontractor used imperial units instead of metric like they were supposed to—software is vulnerable. When software goes bad, there’s usually a human to blame, and that’s true for cyber attacks, too; DDoS attacks other cyber intrusions are on the rise. According to IndustryWeek, in 2018 there was, “…a 350% increase in ransomware attacks, a 250% increase in spoofing or business email compromise (BEC) attacks and a 70% increase in spear-phishing attacks in companies overall.” What does this mean for you? It means defensive redundancies—threat detection, firewalls, encryptions, etc. It also means having a robust HA infrastructure in case you do come under attack. With an HA system with automatic failover, you can quickly take down the affected systems and bring up clean ones.
- Data. In 2015, a Google data center in Belgium was struck—multiple times in quick succession—by lightning. While most of the servers were unaffected, some users lost data. Data redundancy is the cornerstone of any HA infrastructure and new and improved options for data retention are constantly emerging. With the increase in virtual networks, virtual machines, and cloud computing, your company needs to consider both physical and virtual solutions—redundant physical servers, redundant virtual servers—in addition to multiple geographical locations.
How the Right Protection Saves You
As has been mentioned, it’s up to you and your company to examine and identify single points of failure—and other weak spots—in your infrastructure.
A firm grasp of where vulnerabilities most often occur (housing, hardware, ISP, software, and data) will give you a better understanding of you own system’s limitations, flaws, and gaps.
While you can’t prepare for (or predict) everything, you can eliminate single points of failure and shore up your IT environment. An HA system with plenty of redundancies, no single points of failure, and automatic failover, you’ll not only safeguard your revenue stream, you’ll maintain productivity, inter-office operations, keep staff on other tasks, and get better sleep at night (you know, from less anxiety about everything coming to a grinding halt).
What We Offer at Liquid Web
At Liquid Web, we worry about catastrophic failures (preventing them, primarily, but recovering from them too) so you don’t have to. To this end, we make automatic failovers—and cluster monitoring for the shortest and most seamless transitions—a top priority. Heartbeat, our multi-node monitor, and the industry standard, keeps a close eye on the health of your systems, automatically performing failovers when needed. Heartbeat can quickly and accurately identify critical failures and seamlessly transition to an elected secondary node.
The automatic failover system in place at Liquid Web is one of many components that comprise our HA infrastructure and uptime guarantee. We offer 1000% compensation as outlined in our SLA’s 100% uptime guarantee.
What does this mean?
This means that if you experience downtime we will credit you at 10x the amount of time you were down. At Liquid Web, we also continue to operate at 99.999% (or five 9s), a gold standard for the industry—this equates to only 5.26 minutes of downtime a year, 25.9 seconds of downtime per month, and 6.05 seconds of downtime a week. Five 9s is incredibly efficient and we are proud to operate in that range. However, we are constantly striving for more efficiency, more uptime, and optimization.
A Final Reminder: Failures Do Happen
Failures do happen. If Google is susceptible to a catastrophic failure, everyone is susceptible to a catastrophic failure. You can, however, mitigate the frequency and severity of catastrophic failures with a thorough accounting of your infrastructure, a shoring up of your systems, a solid and sensible recovery plan, and plenty of redundancies. Oh, and don’t forget an automatic failover system; it will save you time (and data) when you have to transition from a failing primary node to a healthy secondary node.