Incapsula Management Console Event on August 29: Root Cause Analysis

On August 29 starting at 12:55 UTC the Incapsula management console was unavailable for several hours. While the Incapsula service continued protecting and working appropriately for security mitigation and all CDN functionality, our clients were unable to log in to their Incapsula consoles and access the functionality that depends upon GUI or API access. This included the ability to view their network and application layer traffic, make changes to their account, and use the support portal to communicate with our support team.

Due to the unavailability of the management console, our Netflow monitoring service, which is centrally managed, was not available for some Infrastructure Protection customers. Our support team contacted these customers by email and phone, and let them know that they need to monitor manually and call support for immediate escalation for manual mitigation if they notice any suspicious DDoS activity.

In some cases, some load balancing functionality that is centrally managed, such as failover between master and DR, may have been temporarily unavailable.

This issue stemmed from human error during database migration as part of a routine database administration task. While setting up additional database slave servers in our disaster recovery data center, a “drop database” command was executed to clean up an old copy of the database. The command was executed on a server that was used as the master recently and was expected to be disconnected from the cluster. However, it was not, which led to the command propagating via database replication to the primary data center database cluster, causing a system-wide failure of the Incapsula management console.

Since the command propagated to the master database server, it was replicated to all its slaves on the primary data center, resulting in the R&D and Operations teams having to restore the database from backup instead of promoting a slave server to master. This delayed full resolution of the issue by a few hours. Users were updated about the incident via the Incapsula status page.

The teams have completed backup and testing, resolving the console outage at 20:45 UTC, with no data loss detected. All users are now able to access the management console. To protect against this happening again, we’re reviewing our database administration policies, database user permissions, and reviewing and improving database cluster topology to ensure that slave servers are available for promotion when needed.