What is Data Obfuscation?
Data obfuscation is the process of replacing sensitive information with data that looks like real production information, making it useless to malicious actors. It is primarily used in test or development environments—developers and testers need realistic data to build and test software, but they do not need to see the real data.
There are three primary data obfuscation techniques:
- Masking out is a way to create different versions of the data with a similar structure. The data type does not change, only the value change. Data can be modified in a number of ways, for example shifting numbers or letters, replacing words, and switching partial data between records.
- Data encryption uses cryptographic methods, usually symmetric or private/pub key systems to codify the data, making it completely unusable until decrypted. Encryption is very secure, but when you encrypt your data, you cannot manipulate or analyze it.
- Data tokenization replaces certain data with meaningless values. However, authorized users can connect the token to the original data. Token data can be used in production environments, for example, to execute financial transactions without the need to transmit a credit card number to an external processor.
Why is Data Obfuscation Important?
Here are a few of the key reasons organizations rely on data obfuscation methods:
- Third parties can’t be trusted—sending personal data, payment card information or health information to any third party is dangerous. There is a dual risk—it increases the number of people who have access to the data beyond the organization’s control, and it exposes the organization to violations of regulations and standards.
- Business operations may not need real data—any use of customer, employee, or user data is risky because it exposes the data to employees, contractors, and others. Many business processes, such as development, testing, analytics, and reporting, do not necessarily need to process real personal data. By obfuscating the data, the organization can maintain the business process but eliminate the risk.
- Compliance—many compliance standards require data to be obfuscated under certain conditions. For example, the European Union’s General Data Protection Regulation (GDPR) clearly stipulates the use of data masking for sensitive data collected about EU citizens.
What is Data Masking?
Data masking is the process of replacing real data with fake data, which is identical in structure and data type. For example, the phone number 212-648-3399 can be replaced with another valid, but fake, phone number, such as 567-499-3788.
There are two main types of data masking: static and dynamic.
Static Data Masking
Static data masking involves masking data in the original database and then copying it to a development or testing environment. This makes it safe to share the database with contractors or unauthorized employees.
Dynamic Data Masking
Dynamic data masking (DDM) is a more advanced technique that maintains two sets of data in the same database—the original, sensitive data, and a masked copy. By default, applications and users see the masked data, and the real copy of the data is only accessible to authorized roles. DDM is usually achieved by serving the data to unauthorized parties via reverse proxy.
What is Data Encryption?
Encryption involves scrambling data or plain text using an encryption algorithm, in such a way that it cannot be deciphered without the encryption key. Modern encryption algorithms are very secure and require infeasible amounts of computing power to crack.
There are two main types of encryption: symmetric, and asymmetric or public-key cryptography.
Symmetric Key Encryption
Symmetric key encryption encrypts and decrypts a message or file using the same key. It is much faster than asymmetric encryption, but the sender must exchange the encryption key with the receiver before decrypting.
Symmetric encryption requires users to distribute and securely manage a large number of keys, which is impractical and creates security concerns. This is why most modern encryption solutions are based on public-key cryptography.
Public Key Cryptography
Public key cryptography (also known as asymmetric encryption) uses two keys: a public key and a private key. The public key can be shared with anyone, while the private key is protected. A public-key encryption system uses an algorithm that requires a combination of the private and public key to unlock the message.
The RSA algorithm is a widely used public-key cryptography system. It is commonly used for digital signatures that can ensure the confidentiality, integrity, and authenticity of electronic communications.
Tokenization replaces sensitive information with equivalent, non-confidential information. The replacement data is called a token.
Tokens can be generated in a number of ways:
- Using encryption, which can be reversed using a cryptographic key
- Using a hash function—a mathematical operation that is not reversible
- Using random numbers or index numbers
Once the original data is replaced with tokens—tokenized—the token becomes public information and the sensitive information represented by the token is securely stored in the “token vault”, a well-protected server. Only someone with access to the token vault can make the connection between the token and the original data it represents.
Other Data Obfuscation Techniques
Here are several other techniques your organization can use to obfuscate data in non-production environments:
- Non-deterministic randomization—replacing the real value with another, random value, within certain constraints that ensure the value is still valid. For example, ensuring the new value of a credit card expiration date is a valid month in the next five years.
- Shuffling—changing the order of digits in a number or code that does not have semantic meaning. For example, changing a phone number from 912-8876 to 876-7129.
- Blurring—adding variance to a number, while remaining in the general vicinity of the original number. For example, changing the amount of funds in a bank account to a random value within 10% of the original amount.
- Nulling—replacing original values with a symbol that represents a null character, for example, ####-####-####-9887 for a credit card number.
- Repeatable masking—replacing a value with another, random value, but ensuring that the original values are always mapped to the same replacement values. This maintains referential integrity.
- Substitution—replacing the original number with one value from a closed dictionary of values—for example, replacing a name with a name randomly selected from a list of 10,000 possible names.
- Custom rules—it is important to specify rules to retain the validity of special data formats, such as social security numbers, addresses, phone numbers, etc. For example, to perform obfuscation of addresses, you will need to use a geographical database and ensure you are replacing each element of the address with a valid value—street number, street name, city, country, etc.
A 4-Step Data Obfuscation Strategy
To succeed in a data obfuscation project, your organization should develop a holistic approach to planning, data management, and execution.
1. Data Discovery
The first step in a data obfuscation plan is to determine what data needs to be protected. Each company has specific security requirements, data complexity, internal policies and compliance requirements. The end result of this step is to identify classes of data, determine the risk of data breaches from each class, and the extent to which data obfuscation can reduce the risk.
In the data discovery stage, the organization may classify data based on business classes, functional classes, or classes mandated by a compliance standard like PCI/DSS. A typical classification is into public, sensitive, and classified data.
For those classes that need to be protected by obfuscation, there is a need to carefully test how different types of obfuscation will impact the application. The business operation must be able to function normally under continuous obfuscation of the data.
In this step, the organization builds a solution to perform obfuscation in practice and configures it according to the data classes and architecture that were previously defined. This includes:
- How to integrate the data obfuscation component with existing data stores and applications
- Preparation of datasets and storage infrastructure to store obfuscated versions of the data
- How to start the change management process.
- Defining obfuscation rules for different types of data
4. Testing and Deployment
Once the system is built, it should be carefully tested on all relevant data and applications, to ensure obfuscation is really secure and does not impact business operations. Testing involves creating one or more test datastores and attempting to obfuscate at least part of the production dataset.
As the project moves towards deployment, the organization must perform user acceptance testing (UAT), define organizational roles to take responsibility for obfuscation, and produce scripts that can automate obfuscation as part of routine business processes.
Data Obfuscation with Imperva
Imperva’s security solution uses data masking and encryption to obfuscate core data, so it would be worthless to a threat actor, even if somehow obtained.
Beyond obfuscation, Imperva’s data security solution protects your data wherever it lives—on premises, in the cloud, and in hybrid environments. It also provides security and IT teams with full visibility into how the data is being accessed, used, and moved around the organization.
Our comprehensive approach relies on multiple layers of protection, including:
- Database firewall—blocks SQL injection and other threats, while evaluating for known vulnerabilities.
- User rights management—monitors data access and activities of privileged users to identify excessive, inappropriate, and unused privileges.
- Data loss prevention (DLP)—inspects data in motion, at rest on servers, in cloud storage, or on endpoint devices.
- User behavior analytics—establishes baselines of data access behavior, uses machine learning to detect and alert on abnormal and potentially risky activity.
- Data discovery and classification—reveals the location, volume, and context of data on-premises and in the cloud.
- Database activity monitoring—monitors relational databases, data warehouses, big data, and mainframes to generate real-time alerts on policy violations.
- Alert prioritization—Imperva uses AI and machine learning technology to look across the stream of security events and prioritize the ones that matter most.