Frequent software updates are one of the cornerstones of the Imperva Incapsula service. We use DevOps to continually roll out updates to deliver a constant stream of new features and improvements. By deploying rapid network-wide updates we can also quickly respond to threats like Heartbleed, by issuing rapid network-wide patches.
The process of deploying each of these updates is a monumental task, involving hundreds of servers spread across five continents. Adding to the complexity of this task, these servers perform different roles, run on very different hardware and software and are managed by different teams within Incapsula.
A lot of hard work goes on behind the scenes to maintain our massive, constantly-evolving network with zero disruption to our customers, while ensuring that the rollout process is conducted at DevOps speed.
Here’s an inside look at the agile methodology we use to deploy updates across our global network.
Gradual Rollout Methodology
The Change Control board, which comprises representatives from our R&D, operations, product, support and security teams, determines the updates and features for the weekly production rollout. Once the scope of the rollout is finalized, it goes to QA for final approval before being released.
Deploying a new production version is based on a gradual and structured process (described below). The process typically takes four to five days to complete and always begins with a server at one of the smaller data centers. This makes it easy to shut down an update or isolate an issue, if we encounter a problem.
After each step of the process, we verify the results of the update. Once the update is approved it is then propagated to a larger number of servers.
We roll out a new production update to all components on the Incapsula network on a weekly basis by going through the following steps:
- Step 0: The update is deployed on one to two servers at a low-traffic PoP. The objective of this preliminary step is to make sure the update doesn’t conflict with the existing Incapsula environment. The deployment is monitored for half a day prior to approval.
- Step 1: The deployment is now expanded to an additional 5-10 servers. This gives us a much better representation of the entire network and allows us to see how the updated system is responding to a broad sampling of customer scenarios. It also lets us verify that traffic trends and metrics are “normal”. For instance, a sudden spike in challenges or change in caching performance will trigger an investigation. In this step, the servers are monitored for one day prior to approval.
- Steps 2-4: We continue to ramp up the deployment during these steps, each of which includes dozens of servers. The goal is to gradually increase the number of use cases and continue to check for red flags in the system metrics. This expansion is done in three steps to minimize risk. Each step is monitored for up to a day prior to moving on to the next step.
At each step, the R&D team(s) responsible for the updated component(s) approves the deployment. For example, if an update affects both the CDN proxy server and the Behemoth server, each R&D team needs to approve the update for its respective component. We conduct separate updates for each component to remove any dependencies between the teams.
In addition to these regular updates, there are urgent scenarios that cannot wait for the weekly update. These include patching of vulnerabilities, critical bug fixes and downgrade scenarios which must be rolled out immediately.
In the case of a newly-disclosed vulnerability, such as Shellshock or Heartbleed, hackers start to run scripts to exploit it within a few hours of disclosure. In these scenarios, rapid patching is crucial for securing customer websites against opportunistic hackers.
We use the same deployment methodology here, but with much shorter intervals between each of the steps. Since these types of patches can be as localized as one line of code, there is less risk of collateral impact and less need for checking potential customer scenarios.
In most cases rapid patches are rolled out across the entire network within four to five hours. Downgrades can be deployed in minutes since the previous version that we’re rolling back to have already been approved.
Looking ahead, the Incapsula DevOps team is busy thinking about new ways to streamline the efficiency of our internal processes to further reduce potential risk to our customers.
One possibility we’re looking into is further automating the deployment process. While the software updates themselves are already automated, our team is examining tools that will let us automate additional facets of the process.
Another option being explored is the implementation of scheduled deployments using a “follow the moon” approach. Such an option would allow our 24 x 7 team to perform pre-scheduled rollouts on weekends, nights and non-working hours, further minimizing the chance of possible disruptions should an issue be discovered.
Have any questions about our system updates and rollout process? Leave us a comment.