In the past two years, enterprises have created more data than has been created in the entire history of humankind. At scale, securing this amount of data requires a re-think of how we grant and revoke access to sensitive files and, more importantly, how we identify and track the inevitable access behavior anomalies to understand which are dangerous.
In this post, we’ll explore innovative ways of applying machine learning to answer data security’s most crucial question: “Is this behavior normal, and if it’s not, is it OK?” We’ll shed some light on how Imperva technology goes beyond surface level insight to snuff out insider threats by applying pattern recognition algorithms to every SQL query by every user.
What is Machine Learning?
To begin, let’s review the definition of machine learning. Machine learning is a type of artificial intelligence that enables computers to detect patterns and establish baseline behavior using algorithms that learn through training or observation. It can process and analyze vast amounts of data at scale that is beyond human comprehension. The result is analysis that humans can comprehend.
Machine learning tasks are classified into two main categories, supervised and unsupervised. With supervised learning, the output of an algorithm is already known. With unsupervised machine learning computers learn to “teach themselves” without explicit input from a human as to what patterns might look like. Understanding the problem domain (e.g., facial recognition, likes/preferences, fraud detection) is key to being able to correctly choose between supervised and unsupervised.
Machine Learning in Data Security
Traditional security controls are typically based on a model of least privilege access. This model is theoretically sound, but seldom works practically at scale. The requirement to manually determine who should have access to what can be daunting all by itself. And when you add the next step of sifting through the results of access logs to identify potential bad actors? Even in a small enterprise environment of 50 to 200 data repositories this process would overwhelm an IT department of 20 – much less a lean team of two.
And in a large enterprise, the number of data repositories can reach 10,000 and beyond.
Enter the promise of machine learning to ease the load. And the good news is that with machine learning, more information means more fuel for learning. The more inputs the system has to learn from, the more it can apply that learning to produce higher quality results.
However, effectively applying machine learning still requires a human brain – someone who thoroughly understands the problem they’re trying to solve, and who can apply the appropriate algorithms to that problem domain. The algorithms aren’t one-size-fits-all, and enterprise architectures are not all the same. Truly innovative machine learning applications go one step further.
Using Pattern Recognition to Learn Contextually
In the security space, we’ve already seen simple machine learning applications that process log files to interpret patterns of access behavior. However, creating a behavior model based simply on who typically logs in to which resource (e.g., login/logout of a database) at what time is not enough. In the unique problem domain of data security, the real need is to scale the early identification of potentially malicious data abuse. This requires a deeper dive into the exact data that is being accessed (e.g., after login, what records are and are not accessed).
Machine learning can be used to automate the previously manual process of establishing a baseline of data access patterns. And, using pattern-recognition, machine learning can examine unique usage to identify what’s normal for individuals in specific peer groups within an organization. Furthermore, machine learning can dynamically learn true working peer groups, rather than relying upon static “org chart” structures that rarely reflect how people actually work.
Over the past few years researchers have made significant strides in applying AI and machine learning to pattern-recognition problems. Facebook’s machine learning application, for example, can recognize not only what’s in an image but also the context of what’s going on in the scene and whether or not it contains any other known entities or landmarks. Massive amounts of information have been examined to determine this context and to recognize patterns that suggest ‘this is a cat,’ or ‘this is a wedding.’
Similarly, by applying the appropriate machine learning algorithms to a data set that’s comprised of every SQL query, by every unique user, Imperva developers have created a system that examines usage patterns of peer-group segments. This approach goes beyond logins and session times to recognize and establish normal user data access behaviors specific to that user’s organization – and that organization’s architecture. Once normal data access patterns are identified, it becomes easy to filter potentially risky behaviors that could compromise enterprise data (Figure 1). Figure 1: Imperva’s process of detecting suspicious activity using dynamic peer group analysis
For example, the questions crucial to security teams are what did an individual access, and does that make sense? The goal is to create a list of events that a reasonably-sized SOC team can investigate. To be usable, the resulting dataset must be:
Finite – so the team can easily digest the information
Accurate – to instill confidence it includes appropriate events, without excess noise
Context-rich – so any subsequent investigations won’t need to start from scratch
Imperva developers achieve this by applying a combination of rich knowledge of machine learning algorithms with specific expertise in what constitutes improper data access behavior for different types of users. This information is processed using pattern-recognition algorithms – similar to those used in the Facebook image recognition example. Instead of identifying images, however, these algorithms identify contextual data access patterns across tens of thousands of employee accounts, and across billions of individual data accesses a day. By automatically identifying groups based on behavior, file access permissions can be accurately defined for each user and dynamically removed based on changes in user interaction with enterprise files over time.
We documented a few real examples from customers who allowed us to access their file logs containing highly granular data access activity. Applying the machine learning dynamic peer group analysis algorithm uncovered incidents that would have otherwise gone un-noticed.
Unleash Machine Learning
Most machine learning applications in the security space look at data access from a high level: Tim logged in to a particular database on Tuesday at 8:12 am and logged out at 8:39 am. What they aren’t tracking is what Tim actually did during those 27 minutes. Without this further granular insight, it is impossible to determine whether or not the recognized pattern of activity could be considered normal, or potentially malicious data abuse.
With an understanding of pattern-recognition algorithms, machine learning can work smarter, identifying user patterns that may pose threats to your data. Examining every SQL query by every user means we can learn and not only identify what time and how long Tim was logged in, but more importantly, what he accessed. We can then understand that behavior in relation to his peer groups to compare what he did to what others did, ultimately determining what data access is normal and what is not. Learn how to recognize the top ten indicators of data abuse and unleash machine learning to identify insider threats.