Data Anonymization: Motivation and Mechanics

Data-Anonymization-2

Data is one of the most valuable assets a company has in its possession. And while it may not be listed as a line item on the balance sheet, when a company’s data is breached it can have a very negative impact on the bottom line—in a company’s stock price, reputation and brand.

One approach to protect a large majority of an organization’s data from both being breached and if breached is called data anonymization, which is variously known as data masking, obfuscation, pseudonymization, de-identification or scrambling. In this post, we’ll explain how copying production data—while a common and understandable business practice—increases risk, the advantages of using data anonymization (aka data masking) to protect data, and how to effectively mask sensitive data with the Imperva Camouflage data masking solution.

Copy Data: An Asset and a Liability

While it may not seem obvious at first glance, when companies look to leverage the value of the data they hold in their possession, quite often they are taking copies of their production databases and using those copies for a variety of non-production processes such as application and product development, testing, training and analytics. According to this IDC report, 82% of organizations have more than 10 copies of each database. To put that in perspective an organization with 100 production databases has 1,000 replicas. Stated differently, replicated data represents about 90% of the data within those organizations (see Figure 1).

production vs copy data_pie chart_1

Figure 1:  Percentage of database copies relative to source/production databases

In addition to the obvious difference in scale and the bigger attack surface presented by those production replicas is the contrast in security controls typically applied to each environment. Protections are often focused on the production data and are markedly absent once the data has been copied.

Copied data represents a huge and largely undefended attack surface. It also looms large in the data minimization rules contained in regulations such as GDPR which insist that data not be copied as freely as it traditionally has been.

Addressing Copy Data Risk with Data Anonymization (Data Masking)

The basic approach to addressing the risk represented in that huge pool of replicated production data is to use data masking to replace sensitive data with a realistic fictional equivalent.

Simplified, the process looks like this:

  1. Classify – identify and categorize sensitive data, the foundation to making security decisions about that data
  2. Configure – configure, refine and verify appropriate masking rules
  3. Automate – ensure data masking is applied every time production data is cloned

Step-by-Step

Let’s walk through the process using the Imperva Camouflage data masking solution:

1.      Classify

Starting with CX-Discover, define a connection to the database you wish to search, choose the appropriate search rules from the search rule library and click search (see Figure 2).

classify data_2

Figure 2: Camouflage displays search settings that allow you to configure and run the classification process.

Review and confirm the findings taking note of the default classification and masking rules applied (see Figure 3). Generally speaking, consistent masking techniques will be applied to ensure the integrity of application data once masked. These settings can be easily changed here as well.

data anonymization_search results_3

Figure 3: Camouflage displays search results that allow you to review and adjust categorizations, masking techniques.

Once reviewed, the dashboard provides a convenient summary of where you are in the overall process of classifying your data (see Figure 4). This includes the number and size of data sources, search status and the overall sensitive data ratio.

data masking_dashboard_4

Figure 4: Camouflage displays a dashboard summarizing current search status and findings.

2.      Configure Masking Rules

For a detailed list of the masking rules, view the Functional Masking Document (FMD) for the data source you just searched (see Figure 5). You can accept the default configuration or make adjustments here to apply masking to specific rows using filters, use custom algorithms, etc.

data anonymization FMD

Figure 5: The Functional Masking Document (FMD) allows for fine-grained control of masking rules.

3.      Automate

Automating the data masking process ensures that masked data becomes an integral step in creating production clones for DevOps purposes. Simply convert the FMD into the XML format understood by the masking engine and then integrate the running of the masking engine using the job scheduling tools of your choice.

Additional Considerations

Data never exists in isolation within your organization. There are many different types of data sources (relational, flat file, XML, big data) that link together in different ways. It is important to consider these different sources and the data relationships that must be maintained when masking them. Consistent masking techniques should be used to ensure those links are retained. In addition, make efficient use of the masking process by masking source systems that then feed downstream systems with masked data. This approach can save time and complexity, for example, by masking the source files used to populate Hadoop, MongoDB or other big data environments as well as more traditional data warehouse systems.

In closing, there is a legitimate motivation driving the need for all of the database copies organizations make. As noted at the outset, data has tremendous value for a wide host of reasons and by masking the data that tremendous value can be leveraged without tremendous risk.

Learn more about Imperva Camouflage data masking.