Data Masking Techniques and Best Practices for Data Security

The risks of a data leak have never been higher. Over the last year, data breach costs rose from $3.86 million to $4.24 million, a record high. Data exfiltration, sophisticated hacker attacks, and even insider threats are forcing organizations across the board to take a more sophisticated, multi-layered approach to data security.

Enter: data masking. Data masking is a simple technique that can help organizations continue to work productively while keeping sensitive data stored safely. Here’s what data masking is, how it’s useful, and some data masking best practices to consider for your business.

What is data masking?

Data masking is a data security practice in which a fake, but still realistic, version of your company’s data is created to be used as an alternative to real, sensitive information. For instance, the false data set created by data masking can be used in user training, sales demos, or testing.

Data masking changes the information coded in the data to protect PII, PHI, and other sensitive personal and intellectual property. However, data masking processes create data in the same format, allowing teams to test security protocols, train on a piece of software, or share a live demo without compliance risk. Data masking creates a version of the organization’s information that can’t be reverse-engineered or deciphered to reveal the real proprietary data.

Data masking offers a number of advantages for healthcare companies, the banking industry, and any business that needs to adhere to GDPR regulations.

Benefits of data masking

Data masking allows companies to follow common compliance regimes while mitigating the risks of sensitive data exposure during key business activities. Data masking can be used to defend against data loss, data exfiltration, insider threat, and even vulnerabilities that may occur from using third-party applications. By masking data for most employees or partners, the breach of one system will not give hackers unlimited access to all of the organization’s data.

Simultaneously, authorized users, such as testers and developers, can still share data and collaborate on projects without exposing data records. Data masking can help businesses outsource specific projects without putting their entire data universe at risk.

And, finally, data masking may be required by law. GDPR, for instance, requires all personal data held by a third party to be “protected” which can be interpreted to mean anonymized or encrypted so it cannot be compromised following a breach. Data masking allows organizations to minimize the risk of sensitive data exposure by creating an alternate data set that is virtually useless to hackers.

Common data masking techniques

There are two common types of data masking: static and on the fly. Static data masking is a process that works on a copy of a production database. An IT team will backup the production database to a different environment. Then, the team removes any unnecessary data and masks it, saving the masked copy to the location where it will be used for training or testing.

On-the-fly masking takes place when data is transferred from the production environment to the testing or development environment. This process is ideal for organizations that are constantly deploying software and need to overcome the challenge of keeping a backup copy of the masked data continuously. On the fly masking only sends a subset of masked data when it is needed.

With these two approaches in mind, there are a number of data masking techniques that organizations can use to protect sensitive information.

Data encryption

Here, an encryption algorithm masks the data, making it useless without a decryption key. Data encryption is one of the most common and more secure forms of data masking; however, it can be complicated to implement, requiring a specific technology to perform encryption and share decryption keys.

Data scrambling or tokenization

Data scrambling tools reorganize characters into a random order, changing ID numbers from 12345 to 53214, for instance. This method is simple to implement but only works with certain types of data, limiting its applicability.

Tokenization is a form of data scrambling that takes a piece of data and changes it to a random string of characters with no meaningful value. Tokens do not transform the sensitive data into the token: there is no key or algorithm from which to derive the original data. Instead, tokenization uses a database to store the relationship between the sensitive data and the token. This database is secured with encryption and other security protocols.

Pseudonymization

Pseudonymization is defined in the GDPR as any method that ensures data can’t be used for personal identification. It covers data masking, encryption, and hashing. This general category requires removing any direct identifiers and avoiding the use of multiple identifiers that can be taken together to identify the individual.

Data value variance

Data value variance replaces the original data value with a function. For instance, if a customer purchased several items, the purchase price can be replaced with a range between the highest and lowest price paid. Or, use the difference between someone’s last date of purchase and their first date of purchase to obscure their payment details. Often this is a more manual, low-fi way to practice data masking; but, it can work in a pinch.

Data shuffling

Values in a data set are shuffled, making the dataset appear accurate when in reality none of the identifying details match the individuals to whom they belong. This can be achieved by rearranging each column of a spreadsheet using a random sequence, for instance.

Data masking best practices

By itself, data masking usually isn’t sufficient to prevent the risk of a data breach. It should be integrated as part of a layered approach to data loss prevention. As such, there are other tools that can help defend sensitive information against hackers or insider threats.

Data loss prevention (DLP) can identify where sensitive data lies and whether or not it’s been accessed. A cloud DLP solution, like Nightfall, specifically discovers, classifies, and protects PII, PHI, other unique identifiers, and credentials, and secrets. Nightfall uses machine learning-based detectors to identify tokens in a variety of contexts — such as within Slack messages, strings within your codebase, files, etc.

With custom workflows, you can automatically redact, delete, or quarantine any tokens identified by Nightfall before any irreversible damage is done. This data-centric approach has the added benefit of cutting down on cloud data spray by illuminating where your most valuable data lives. From there, you can take a more targeted approach with data masking techniques and other security protocols.

To learn more about Nightfall, set up a demo using the calendar below.