Search Blog for

,

How Our Carrier Failover Kept Data Moving Last Week

Last Monday, web users in Europe saw their internet connections slow—some to a crawl. There were reports of connection lags with many popular sites. The connectivity issues were initially attributed to cable faults with Telia Carrier, one of the top two global IP transit providers carrying over 1 Exabyte of data per month.

The issues were big enough to lead some to speculate that a cable had been cut. Along with other service providers, we reported connectivity and packet loss issues as well. We identified the issue with our transit provider and took action to mitigate the effect on our clients.

The internet is amazing in its complexity. To make it work, it has to change—constantly. The heart of the internet is its routing system. While routers and cables move the data, it takes people to set it all up and make it all work. As a service provider, we can empathize that Telia had “one of those days”. Turns out the problem was caused by a human that had misconfigured a router causing Europe’s data to be sent to the Far East.

We understand that our clients expect the highest levels of response time and availability. So we built the Incapsula network with no single point of failure—not only within our data centers but among our upstream service providers. Consistent with this goal is our multi-carrier architecture that’s designed to insulate our clients from glitches among our carriers.

We choose our data center locations carefully. We’ve previously written about the importance of deploying our network points of presence (PoPs) at strategic internet hubs, such as Frankfurt and London, to take advantage of peering agreements with Tier 1 providers, other internet service providers (ISPs), leading hosting providers and large major network entities. We’ve also written about important nuts-and-bolts considerations for deciding when, where, and with whom to peer.

Incapsula works with regional peering providers such as AMS-IX, DE-CIX, HKIX, and others to minimize latency. These providers sit on the network backbone and enable our PoPs to benefit from direct connections to other CDNs and Tier 1 carriers. Our customers, as a result, enjoy the highest levels of network performance and provide their end users with the best possible experience.

The key to insulating the inevitable transit provider disruption from affecting our clients is carrier monitoring, redundancy and failover. Before understanding how our monitoring works, here’s an overview of how a connection to a provider works: When establishing a connection, our provider assigns one IP address to a port on their router and another IP address to a port on our router—these are known as the “endpoints”. A BGP peer is then established between these connections and the local endpoint on our side is defined as the next hop in the BGP announcement. When establishing a transit agreement, we also order multiple ports and use them as separate “pipes”, each with a separate /30 subnet. We can also bind them together using LACP or LAG in order to get larger pipes. With this architecture, we have a set of pipes in each PoP from different providers each with its own set of endpoints (/30).

Now for the monitoring part: We use a combination of Pingdom, ThousandEyes and also some internally developed monitoring services to monitor both endpoints described above. Our monitoring scheme also gives us a clear view of how traffic is flowing to different locations from multiple vendors and internal systems. From the inside, we use the monitoring services on our core routers (RPM in JuneOS or IPSLA in Cisco) to test the availability of internet resources—like Google’s 8.8.8.8 public DNS servers. Using multiple monitoring services allows us to direct the traffic through different interfaces and endpoints in order to test routing via different pipes and vendors. We then export the monitoring results using SNMP to our monitoring system.

The frequency of our network tests depends on the capabilities of the systems we use, but it can be every 30 seconds, every minute or more. All of these aggregate tests are running in parallel on a continuous basis allowing us to identify and start the investigation of issues within a few seconds after they start to occur. When we identify network issues, we use the information our monitoring systems have gathered to tell our Behemoth device in the PoP which pipes to use. We also can take the additional step of stopping the BGP publications through vendors that are suspected to have a major issue.

The human element

In some cases, the rerouting is done automatically. In all cases, the situation is being reported in real time to our NOC team. The Incapsula NOC engineer is being fed a lot of data including direct monitoring of every PoP, monitoring pseudo sites on the network, exercising CDN processes, monitoring the various providers and connections described above as well as monitoring bandwidth and potential attacks. An Incapsula NOC engineer, using the information provided by the monitoring system can change how BGP publication is done, divert traffic away from problematic pipes and keep traffic flowing. There is no substitute for the human eyes and brain when it comes to dealing with the unexpected.

In the stacked graphs below, you can see the traffic switchover from Telia to other carriers in our Zurich and Paris data centers, and then back to Telia when the incident was resolved.

Zurich Graph Paris Graph

 

We understand the nature of the internet will lend itself to occasional hiccups—no service provider is immune. We’ll continue to build and manage our network to help keep our clients insulated from the inevitable glitch.