What's the difference between Regex and AI-based Detection?

Data loss prevention (DLP) is a crucial area of concern for organizations of all sizes, and preventing the exposure of sensitive data is essential. Regular expressions (regex) and AI detection methods are two common approaches to DLP, but which one is more effective? In this article, we'll compare regex and machine learning for DLP and provide clear examples of how AI can lead to increased accuracy and better outcomes for security teams managing sensitive data.

What is Regex and AI-Powered detection?

Regular expressions are a set of characters and operators that match specific patterns in text data like character length, letters, and numbers. For example, a regex pattern might be used to identify a social security number or a credit card number in a block of text. While regex is an effective tool for identifying patterns in text data, it has limitations because anything outside of the predefined patterns specified by a regex will generally be missed. Additionally, regexes have a tendency to include false positive findings because they don’t account for variation within a string of text.

Credit card numbers, for example, are generated with the Luhn algorithm. Since regex can only account for character class (the specific numbers and letters in a string) and string length, regex will frequently capture credit card numbers that are “invalid” or don’t pass Luhn algorithm validation. This can be frustrating for teams who need accurate findings to quickly and reliably address the exposure of sensitive data like credit cards.

On the other hand, AI-powered DLP uses algorithms and models to detect patterns and classify data based on past experience. This approach allows for the analysis of large amounts of data and the detection of complex patterns that may not be apparent to humans or that are not static and even random. Machine learning algorithms can adapt to new threats and changes in data patterns, making them an essential tool in a constantly evolving threat landscape.

Where AI-powered DLP Detection Shines

Let's take a closer look at a few examples to see how AI-powered detection can lead to increased accuracy in DLP:

1. AI-powered DLP is needed to detect credentials and keys

One big area of concern for both DevSecOps and security teams is leaking encryption keys, authentication tokens, API keys, and passwords. Strings like these are considered high entropy strings (see: Shannon Entropy) which means that even if strings of this nature are constrained to some criteria (like a 20-character alphanumeric string), there will be a huge variation in what will be considered a “valid” API key. Additionally, in the case of hard-coded credentials, developers might name variables associated with this content in many different ways.

Digging deeper in the keys and credentials example, consider the generic API key regex in truffleHog:


/[a|A][p|P][i|I][_]?[k|K][e|E][y|Y].*['|\"][0-9a-zA-Z]['|\"]/

Now, consider the following API keys:


api_key = ‘12345678901234567890123456789012’
 
api_token = ‘12345678901234567890123456789012’

When looking at the example above, if you apply TruffleHog v3’s regex, the API key in the first line will be captured, while the second will not. This demonstrates that regex searching is limited to the naming conventions of the code. Because variable naming conventions differ from developer to developer, this can easily lead to situations where designed regexes do not match the variable names that a regex is searching for. Ideally, however, you will want a solution that can identify a specific token (API key, password, etc.) regardless of a developer’s or a service provider’s conventions for labeling tokens or generating tokens.

This is where AI comes in. AI understands the semantic meaning behind content. Rather than relying solely on predefined patterns, with machine learning, you can account for entropy as well as the surrounding context around a finding. Effectively it doesn’t matter how a developer labels an API key in code, all that matters is that there is an API key.

2. AI in DLP helps with basic alphanumeric strings & combinations

We talked above about credit cards, which require Luhns algorithm validation to be accurate. This is another great example of where AI-based DLP can shine. Credit cards, however, aren’t the only example of a numeric string that regex may struggle to identify. It’s true that basic strings, like IP addresses, driver's licenses, social security numbers, and phone numbers can be satisfactorily captured with regex. However, this often requires chaining long and complex regexes together in order to eliminate invalid instances (like 999.999.999.9 as an IP address or 999-999-9999 as a phone number) which are very time-consuming to build. However, for more complex alphanumeric strings that might be sensitive, like a person’s name or street address, there’s simply no way to capture this context via regex, because their contents can contain any name, word, or phrase.

Additionally, you might only wish to classify some types of data as sensitive when they appear together. For example, maybe a street address by itself being found in your internal systems isn’t a problem, however, if it appears next to a specific person’s name, and an IP address, you might wish to flag this content as identifying a specific individual. Unfortunately, it’s impossible to chain regex in any way to coherently search for combinations of data together. AI, though, excels at this, as again context is what matters for AI.

3. AI can scan images & other file types

Regex can only capture strings in text or files explicitly containing text. However, text in images or non-searchable PDFs, cannot be parsed by regex. This explicitly requires Object Character Recognition (OCR), which is an AI-based technology. The absence of this ability limits the real-world utility of regex even further than the examples listed above.

Comparing Regex accuracy to AI-Powered detection

When Nightfall ran our AI detection against a competitor's Heuristics + Regex approach for API keys we found substantial benefits in recall and accuracy.

These differences can be hidden during gated testing scenarios, but once deployed to production environments where variation is higher, you will see substantial benefits from an AI-based detection approach.

TLDR

Whilst regex is a powerful tool for identifying patterns in text data, it has limitations in detecting and classifying sensitive data. AI-powered DLP offers increased accuracy and adaptability, making it an essential tool for organizations looking to prevent data loss. The statistics clearly demonstrate that AI-powered detection outperforms regex-based DLP in terms of accuracy, making it a superior choice for DLP in today's threat landscape.

‍