Clustering and Dimensionality Reduction: Understanding the “Magic” Behind Machine Learning


These days we hear about machine learning and artificial intelligence (AI) in all aspects of life. We see machines that learn and imitate the human brain in order to automate human processes. There are autonomous cars that learn the road conditions to drive, personal assistants we can converse with and machines that can predict what stock markets will do. In some respects, it can appear as “magic.”

However, it’s not. Behind machine learning there are some fundamental, well-studied and understood techniques. To the extent that there is any magic, it is in knowing how to apply these techniques to solve a certain problem. We’ll take a look at some of these techniques, and then illustrate how some of them are applied to solve a specific problem for identifying improper access to unstructured data.

Machine Learning, Defined.

Machine learning is a type of artificial intelligence that enables computers to detect patterns and establish baseline behavior using algorithms that learn through training or observation. It can process and analyze vast amounts of data that are simply impractical for humans.

Machine learning tasks are classified into two main categories:

  • Supervised learning – the machine is presented with a set of inputs and expected outputs, later given a new input the output is predicted.
  • Unsupervised learning – the machine aims to find patterns, within a dataset without an explicit input from a human as to what these patterns might look like.

More importantly, however, is that within unsupervised machine learning, there are several different techniques that can be used to identify patterns, and ultimately yield valuable analysis. Understanding the problem domain is key to being able to correctly choose which of these techniques to use. One of the key decisions data scientists make is which approach to use. And if a data scientist doesn’t understand the problem domain, they cannot choose the right approach.


Clustering is the assignment of objects to homogeneous groups (called clusters) while making sure that objects in different groups are not similar. Clustering is considered an unsupervised task as it aims to describe the hidden structure of the objects.

Each object is described by a set of characters called features. The first step of dividing objects into clusters is to define the distance between the different objects. Defining an adequate distance measure is crucial for the success of the clustering process.


There are many clustering algorithms, each has its advantages and disadvantages. A popular algorithm for clustering is k-means, which aims to identify the best k cluster centers in an iterative manner. Cluster centers are served as “representative” of the objects associated with the cluster. k-means’ key features are also its drawbacks:

  • The number of clusters (k) must be given explicitly. In some cases, the number of different groups is unknown.
  • k-means iterative nature might lead to an incorrect result due to convergence to a local minimum.
  • The clusters are assumed to be spherical.

Despite these drawbacks, k-means remains the right and popular choice in many cases. An example for clustering using k-means on spherical data can be seen in Figure 1.

k-means clustering on spherical data - 1v2

Figure 1: k-means clustering on spherical data


A different clustering algorithm is OPTICS, which is a density-based clustering algorithm. Density-based clustering, unlike centroid-based clustering, works by identifying “dense” clusters of points, allowing it to learn clusters of arbitrary shape and densities. OPTICS can also identify outliers (noise) in the data by identifying scattered objects.

k-means versus OPTICS on moon-like data - 2

Figure 2: k-means versus OPTICS on moon-like data

The OPTICS approach yields a very different grouping of data points than k-means; it classifies outliers and more accurately represents clusters that are by nature not spherical.An example of running k-means versus OPTICS on moon-like data is presented in Figure 2:

Dimensionality Reduction

In the field of machine learning, it is useful to apply a process called dimensionality reduction to highly dimensional data. The purpose of this process is to reduce the number of features under consideration, where each feature is a dimension that partly represents the objects.

Why is dimensionality reduction important? As more features are added, the data becomes very sparse and analysis suffers from the curse of dimensionality. Additionally, it is easier to process smaller data sets.

Dimensionality reduction can be executed using two different methods:

  • Selecting from the existing features (feature selection)
  • Extracting new features by combining the existing features (feature extraction)

The main technique for feature extraction is the Principle Component Analysis (PCA). PCA guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Sometimes the information that was lost is regarded as noise – information that does not represent the phenomena we are trying to model, but is rather a side effect of some usually unknown processes. PCA process can be visualized as follows (Figure 3):

PCA process visualized - 3

Figure 3: PCA process visualized

Following the process in the example, we might be content with just PC1 – one feature instead of originally two.

There is a great choice of dimensionality reduction techniques: some are linear like PCA, some are nonlinear and lately methods using deep learning are gaining popularity (word embedding).

Applying the Techniques to Dynamically Learn True Peer Groups

A recent Hacker Intelligence Initiative (HII) research report from the Imperva Defense Center describes a new innovative approach to file security. This approach uses unsupervised machine learning to dynamically learn peer groups. Once these peer groups are learned, they are then used to determine appropriate virtual permissions for which users should/should not be allowed to access files within an organization. This dynamic peer group functionality is available in Imperva’s breach prevention solution, CounterBreach.

Figure 4 illustrates how machine learning is used to detect suspicious file access activity based upon dynamic peer group analysis.

Dynamic Peer Groups - suspicious file access - 4

Figure 4: The process of detecting suspicious activity using dynamic peer group analysis

First, Imperva performs dimensionality reduction on the data. In the previous step the audit data was transformed into a matrix with users as rows and folders as columns. The values in the matrix cells are the amount of activity performed by the given user in a folder. The first reason we use PCA is the sparseness of the matrix, more than 99% of its cells are empty. Secondly, many of the folders’ access patterns are correlated, which causes multicollinearity in our matrix. Practically in our case, correlation can occur due to groups of users working on a similar project therefore working on a similar group of folders. Lastly, after using PCA the matrix is 90% smaller and hence easier to process.After collecting and preparing the data, the machine learning is applied to build the virtual peer groups. In order to build these dynamic peer groups, Imperva uses the machine learning techniques mentioned above – PCA and density-based clustering.

Second, Imperva chose the OPTICS algorithm as the clustering algorithm. The users are clustered based on densities. The number of peer groups is unknown. Since k-means requires knowing the number of clusters – peer groups in this case – it cannot be used.  OPTICS helps us overcome this limitation. OPTICS also allows us to handle noisy users with special care – each noisy user resides in a cluster alone. In addition to the above, by a great deal of trial and error it was confirmed that OPTICS is the right choice for this dataset.


The strength of a successful algorithm based on data analysis lays in the combination of three building blocks. The first is the data itself, the second is data preparation—cleaning and choosing the exact features that represent the data characters—and the third is using the right machine learning methods in order to correctly profile the data.

In this case, PCA and OPTICS proved particularly useful for learning peer working groups. However, the “machine” didn’t magically determine this itself. Rather, it was a person (well, really a team) that understood the problem, and the data being analyzed, that performed the “magic” of selecting the right machine learning building blocks.