Why configuration management system was a must for our network, and how we chose SaltStack
When we planned and designed the network automation at Imperva Cloud, we split our automation systems into three different systems, where each of the systems has a different set of requirements:
- Infrastructure configuration – Managing the configuration for the network infrastructure in our cloud. This included configuration for connectivity with our service provider, networking for our cloud services located inside the PoP, and many other general-purpose configurations (e.g. login permissions, SNMP, flows, Syslog and more…). Because changes to our infrastructure are predictable, we’ll usually have adequate time to prepare before we need to apply them in production. The requirements for this service will be a well-defined and orderly process that includes source-control, review, CI-testing, a deploy procedure, and the ability to rollback.
- Product configuration – This is the configuration that belongs to our customers – including BGP session definition, cross-connects configuration, tunnels toward the customers, and more. In this case, the customer expects the configuration they insert to be applied and work in a few minutes at most.
- Ad-hoc configuration – This concerns automation that responds to network events, such as a service provider going down, a DDoS attack that congest our pipes toward an internet provider, or an interface that starts flapping. In this case, the response to the event should happen in a matter of seconds, or else it would impact on our network and our customers.
In this post, I’ll focus only on the first part – how we chose to manage our network infrastructure configuration in the cloud. I’ll explain why it’s become necessary to automate our configuration deployment and why we chose SaltStack to do it.
In the second part I’ll get a little bit more into the technical details – the structure of our configuration and the toolset we used for modeling, automating, reviewing, deploying and validating the network configuration in our devices.
The need – ability to scale fast
For a few years, the configuration of our network devices was managed manually and, at first, this was sufficient for our needs. The Imperva network (Incapsula as it was then) was small – we had only a handful of PoPs and the number of services in each of these wasn’t particularly large. At that time you only had to reach out to Dima, our network architect, with any questions, concerns, or requests related to the network.
But over the years the network began to grow, both in terms of the number of PoPs and of the number of services inside each PoP. As part of the expansion process, the network engineering team grew larger, and it became much harder to maintain the knowledge. Even Dima was showing signs of congestion.
Over time the configuration piled up and accumulated many pieces that seemed to be irrelevant. It was too scary to delete them, though, as no-one could remember why they were there in the first place, so it was hard to predict whether removing them would cause an impact. The statement – if it’s not broken, don’t fix it – was starting to be used more and more frequently. In addition, the deployment of new features had become a very complex and sensitive task. It required logging into devices across production and adding these configurations manually. Devices could sometimes be missed altogether during the process.
Other problems we encountered along the way included:
- Copy-and-paste culture – theoretically speaking, a significant part of a configuration is common to many of the appliances, so it makes sense to copy-and-paste those common sections. That quickly proved to be problematic, however, when it was done so frequently, and often without care.
- Hard-to-create automation – when configuration was set manually it was very hard to enforce a convention. And without a convention, it was virtually impossible to create automation.
- Hard to achieve a high-level view of our network – when the single source of truth for our networks was the de-facto configuration on the devices, getting a high-level view involved fetching and analyzing many pieces of configuration from across production.
The requirements we defined for the system
In order to resolve the problems mentioned above, we started to design an automated system for managing our configuration, defining the following requirements for our system:
- Every change should be documented – we need the ability to see who has inserted a change, when they inserted it, and why.
- Any part of the configuration that’s identical across multiple devices should be written only once.
- Deployment of transverse change should happen quickly but safely – that means we should be able to monitor the deployment process and, if something breaks, ensure rollback is available with a single click.
- All our network configuration should be defined in a single code repository that should be our source of truth.
- The system should be able to run in parallel alongside other systems that control other aspects of our network (as defined above).
- User login into the device should happen only in case of an emergency.
So why Salt?
After defining the requirements, we started searching for a tool that would meet our requirements. Recently, the availability of open-source tools and libraries for managing configuration in network devices has significantly improved – we came across several Python libraries such as PyEZ, CiscoConfParse, NAPALM and netmiko, as well as modules for network device management in configuration management (CM) tools like Salt, Ansible, and Nornir.
After carrying out a small market survey we decided to go with SaltStack. Its design is out of the scope of this blog post (even though I like it very much), partly because there’s an abundance of reading material about the tool. So instead, I’ll try to describe some of Salt’s traits that met our requirements really well – and led to us choosing it as our CM tool.
- Ability to manage the configuration in a convenient and flexible way – the common way to manage a configuration with Salt is by a rendering system known as pillars. Very flexible, these pillars enabled us to easily build a structured configuration that met the requirement of writing a configuration that could be shared between multiple devices just once. (In the technical part of the article I’ll elaborate on our pillar structure).
- Ability to manage everything by Git – as one of our requirements was to record and audit each configuration change, the option to work with a version control system like Git was perfect for us. Every change was documented via the commit message, allowing us to take advantage of the blame feature for understanding when a change was inserted and why. The Git repo became the single source of truth and, by writing a simple Jenkins job, we could validate that the configuration on the device matched the configuration described by our repo. The branching ability of Git is a good platform for managing reviews and deployments.
- Fast deployment to large numbers of devices – in Salt, each device has a small process called proxy-minion, responsible for managing the connection with the device. The proxy minion receives orders from the main component – called the salt-master – which is able to communicate in parallel with many minions via a ZMQ message-bus. The fact that the proxy-minion is located close to the device reduces the network latency of the configuration commands forwarded by the protocol known as NETCONF. The combination of these traits enables us to run commands on large numbers of network devices and get a fast response.
- NAPALM module and the network-to-code community – NAPALM is a Python library that enables connection to multiple types of network devices in an agnostic manner. In Salt there are many existing modules that are written on top of NAPALM, which makes SaltStack very comfortable to use with network devices. In addition, there’s an active community for SaltStack in the NetworktoCode Slack workspace, which supported us during the development phase.
- The flexibility of Salt – in Salt, almost everything is pluggable, including filesystems, modules, states, renderers, database communication and more. So whenever we needed something that wasn’t implemented by the Salt team, we could just implement a custom Python plugin and drop the code into the appropriate folder to be used as part of the system. From the plugins we could also make calls to other Salt modules, which made the plugin pretty easy to write.
Continue reading about the technical aspect of the implementation and design of our system in Adding Some Salt to Our Network – Part 2.