Clustering App Attacks with Machine Learning Part 1: A Walk Outside the Lab
A lot of research has been done on clustering attacks of different types using machine learning algorithms with high rates of success. Much of it from the comfort of a research lab, with specific datasets and no performance limitations.
At Imperva, our research is done for the benefit of real customers, solving real problems. Data sets can vary and performance limitations are important, if not critical, to avoid. We were recently tasked by our engineering team with how to cluster application attacks in near real-time scenarios where performance is a key factor. The requirements list was long. So were the challenges. Bottom line, we found reality punches lab statistics in the face. (Not that any actual punches were thrown. The first rule of Imperva research is you talk about Imperva research!)
With that said, in this three-part blog series we’ll share interesting insights and discuss some of the challenges we met—and overcame—as part of our research, such as:
- Applying a clustering algorithm to a stream of data
- Extracting meaningful features from limited data
- Translating different features and determining distance calculations
In this first blog post we start with the motivation for clustering attacks. We’ll discuss the data used for this task, and how we enriched it by adding meaningful features to the raw data, specifically the IP and the URL.
Why Cluster Attacks
Our goal of clustering attacks on web applications was two-fold: 1) finding interesting patterns inside the attacks, and 2) making it feasible to navigate the massive amounts of attacks. Clustering can help us create a “story” out of the attacks (naming them based on behavior), making them more easily understood to a human observer and easier to analyze. For example, when seeing a cluster called, “SQL injection attack from China using a Havij scanner”, the story behind it is much clearer than analyzing the hundreds of attacks this cluster contains and trying to find the common ground between them.
The Raw Data
The raw data that entered our algorithm was an HTTP request (see Figure 1) with additional meta data fields added by the WAF that stopped or alerted on the attack. These extra fields include the time the request was received, the IP of the attacker, the attack that was found in the request and sometimes additional information about the attacked application.
Figure 1: HTTP request – Each request contains a request line with method, URL and protocol, the headers of the requests and the parameters
We can’t just ingest this data as is into a machine learning algorithm and expect it to cluster correctly. We first need to convert data into a structured object which contains all the fields that are interesting to us. After that we need to enrich the data to get as much meaningful information out of it as we can in order to improve our clustering results.
The goal of the enrichment process is to extract more meaningful features from the raw data. This way we can ingest our algorithm with more features, which hopefully will give better results. In this phase we extract features which may be correlated. In general, it’s best practice to reduce correlation between different features before ingesting the data into a machine learning algorithm. But in this case correlation isn’t a priority at this stage, we’ll deal with it later on. Here our goal is only to extract as many meaningful features as we can.
Almost every part of the raw data can be structured as a feature and enriched into other features. For example, the headers of the HTTP request which may imply which tool was used to attack, and the type of attack that was found which may imply which system the attacker was trying to target, etc. Here we’ll dive into two important features, the IP and the URL, and how we can extract additional features out of them, although every other feature in the data also has many possibilities of enrichment.
All About the IP
In each request we receive the source IP, that is, the IP from which the attack originated. This is a very important feature as it indicates the origin of the attack. The attacker may use a proxy or an anonymity framework to hide himself, or he may reveal his true origin. In any case all of these features are important to us and can enrich the data significantly.
Class A, B and C
Say we have the following IP: 188.8.131.52. Its class C is 157.42.65.* —or all the IPs that start with 157.42.65—and each class C contains 256 different IP addresses. In the same manner its class B is 157.42.*.*, which contains 256 to the power of 2 IPs, and class A is 157.*.*.*, which contains 256 to the power of 3 IPs. Two attacks originating from the same class of IPs may indicate a connection between them, and the smaller the
class, the bigger the connection. In our data we saw many attacks with the same attributes originating from different IPs in the same class.
The IP corresponds to a geographic location, whose extraction requires the use of a GIS or geolocation source. This feature enables us to find similarity between IPs, even from different classes. The geolocation may include country, subdivision or region (such as state or county), city, and geographic coordinates (see Figure 2).
Figure 2: IP as geolocation taken from an online database
In our experience using a combination of country and subdivision gives the best results, while using city or the coordinates is too high-resolution and loses the bigger picture in the process.
Many attackers launch their attacks from anonymous origins, or use proxies. This practice enables them to cover their tracks, and launch attacks without being identified by their target. An attacker from the US can launch an attack using Tor and identify himself as if he is from Romania, and a few seconds later launch another attack identifying himself as being from Argentina. The geolocation of an IP that launches attacks using some sort of anonymity framework is usually not important because it doesn’t give any information about the real origin of the attacker. What does matter in our case is whether the attacker uses an anonymity framework at all, and if so, which kind. In our experience attackers who use an anonymity framework and change their geolocation between attacks tend to keep to the same framework. Hence this is also an important feature that can be extracted from the IP. This feature can be extracted using the “X-Forwarded-For” header for proxies or using outside sources, like the Tor network or anonymous proxy databases.
The URL is Greater Than the Sum of its Parts
The attacked URL indicates the target of the attack, that is, which page of the web application the attacker targeted. An attack on a login page has different features than an attack on a search page. Also, the URL may contain hints on the resources that were attacked. See Figure 3 for the parts a URL may contain. These are some of the features we can extract from the URL:
Figure 3: Different parts of the URL – protocol, domain name, directory/folder, web page and file extension
The resource extensions of the URL are the final part of the URL after the last “.” (dot), this part indicates which resource the URL contains. For example, a URL may end with “.jpg” or “.png” which indicates that this URL contains a picture, or it may end with “.php” or “.aspx” which indicates the server side of the page. Two attacks targeting different URLs but with the same resource extension may indicate a scan on the site for vulnerabilities, especially if the resource extension is not a very common one to that application.
Many web applications contain different URLs but with the same directories or patterns. For example, a news site can put every article about economics in the URL prefix “/news/economics/[article-name]”. Finding the patterns of these URLs, and especially finding attacks on different URLs with the same patterns, may indicate a phenomenon that our algorithm is trying to discover. This way we may discover a scraping attempt, even when each attack comes from a different IP (maybe even a different country) and with large time gaps between attacks.
Injecting malicious code inside the URL is very common, and was seen a lot in our data. It is possible to clean the code injected in the URL using heuristic methods, like looking for special characters that should not appear in the URL and are used to delimit scripts. See Figure 4 for example.
By cleaning this code we may reveal a pattern of the attacked URL. For example, attackers may try to inject malicious code to the same URL, and on every attack make minor changes to the code. This may appear like a set of completely different URLs, hence the clean URL may reveal this pattern.
Next Up: Calculating Distance
In this post we discussed the data we used to cluster attacks and how to extract meaningful features from it. In the next post we’ll discuss one of the core stages of our algorithm – how to measure the distance between features. Calculating distance is not always an easy task, especially with complex features like URL or IP. Our final goal will be to cluster like attacks together, so we’ll need to find a way to determine when two attacks are similar based on the extracted features.