Entity Detection Plus Protection: Nightfall's New Approach to Comprehensive DLP

For years, data loss prevention has meant one thing: finding sensitive entities. Social Security numbers, credit card numbers, API keys—if you could pattern-match it, you could protect it. But this approach has always had fundamental limits.

What happens when you need to protect customer IDs unique to your business? What about proprietary source code that doesn't contain any traditional PII? How do you distinguish between a mortgage application and a business loan document when both contain similar financial data?

These aren't edge cases. They're the reality of modern data protection. And traditional DLP built on regex patterns and sample-based training struggles to solve them.

Why 95% Precision Changes Everything

Before exploring what's new, it's worth understanding why Nightfall approaches detection differently in the first place.

Traditional DLP systems achieve precision rates around 5-20%. This means 80-95% of alerts are false positives. At that accuracy level, automation is impossible. You can't auto-remediate when most detections are wrong. You can't trust the system to make decisions. You end up with alert fatigue, ignored policies, and shadow IT.

Nightfall's multimodal AI approach combining convolutional neural networks, computer vision, large language models, and deterministic validation achieves 90-95% precision out of the box. This isn't a marginal improvement. At 95% precision, 19 out of 20 detections are accurate. That's human-level accuracy. It changes what's possible.

With this level of confidence, you can enable automated remediation, trust end-user workflows, and actually enforce policies at scale across billions of assets.

Where Traditional Entity Detection Falls Short

But even with industry-leading entity detection across 150+ data types, we kept hearing the same requests:

"Can you detect our internal member IDs?" Every organization has custom identifiers—patient IDs, customer numbers, order references—with unique patterns. The traditional answer was "build a regex." But regex doesn't understand context. A transaction ID might match the same pattern as a medical record number. Without contextual awareness, false positives multiply.

"How do we protect our intellectual property?" Source code, financial projections, customer lists, strategic plans are files and documents that are often more sensitive than any single PII field. The old approach required hundreds of sample documents, lengthy training cycles, and constant retraining as data patterns evolved. Even then, accuracy wasn't guaranteed.

Entity detection, no matter how accurate, can't solve these problems. You need a different approach.

Expanding the Detection Toolkit

We've introduced three new ways to identify sensitive data, each addressing a specific gap in traditional DLP:

Describe What You're Looking For, Not How to Find It

Instead of writing regex patterns, you describe what you're looking for in natural language. The system understands both the pattern and the context.

Here's a real example: A healthcare organization needs to detect medical record numbers with a specific format—"MRN-YYYY-####" where the year must be 2020 or later. Their transaction IDs happen to follow a similar pattern and were generating false positives with regex-based detection.

With prompt-based detection, you simply describe the entity:

"A medical record number is used to track patient records. It starts with 'MRN-' followed by a 4-digit year (2020 or later), then a dash, then 4 numbers. Keywords to look for: patient, medical record, MRN."

You provide a few positive and negative examples. The system validates that it has enough information to achieve 75%+ confidence. Then it works.

When it scans a document, it sees: "Patient John Smith calling to discuss lab results. MRN-2023-4567." It recognizes this as a medical record number based on both pattern and context.

When it sees: "Transaction ID: ABC-2023-8901," it knows this isn't a medical record number, even though the pattern is similar. No false positive.

No regex. No custom development. Just accurate detection of the identifiers that matter to your business.

Protecting Documents, Not Just Data Fields

Some documents are inherently sensitive regardless of whether they contain traditional PII. We now offer 23 pre-built file classifiers for common document types:

Background checks
Customer lists
Internal source code
HR files and performance reviews
Patient data compilations
Payroll and tax documents
Financial projections
And more

These aren't keyword searches. The models understand document structure, content, and purpose.

When we tested this on real files, the results were striking. The system analyzed a performance review and noted: "Contains performance ratings, strengths, areas of development—this is an HR file." It examined two code files and correctly distinguished internal proprietary code from a public open-source library based on copyright notices and code recognition.

This is classification based on understanding, not pattern matching.

When Your Documents Need Special Treatment

When you need specificity beyond the 23 pre-built classifiers, you can create your own using the same prompt-based approach.

Consider a mortgage lender that needs to specifically identify mortgage applications—not just any financial document, but specifically mortgage applications that require special handling and compliance measures.

The prompt describes what the document contains: "Collects borrower information including employment details, salary, assets, liabilities. Used in the mortgage origination process. Keywords: mortgage application, borrower, assets and liabilities, loan amount."

You upload a single sample file. The system validates its understanding. Then it works across your entire environment.

When tested, the classifier correctly identified mortgage applications while ignoring business loan applications and grant applications—documents that contain similar financial data but serve different purposes. This is the kind of nuanced classification that traditional DLP simply cannot achieve.

From Detection to Protection: What This Enables

These three capabilities together represent a fundamental shift in what DLP can protect.

For custom business identifiers: No more regex debugging. No more false positives from pattern matches without context. Just describe what you need to find.

For intellectual property: No need to collect hundreds of samples. No lengthy training cycles. No retraining when document formats evolve. Protection that works out of the box.

For compliance and risk management: The ability to identify specific document types means you can apply appropriate controls. Mortgage applications get handled differently than business loans. Internal source code gets different policies than open-source libraries.

And because the underlying detection maintains 90-95% precision, you can actually automate responses. Quarantine sensitive files in Slack. Block uploads to personal cloud storage. Redact before sharing. With confidence that you're not creating friction from false positives.

Getting Started Is Straightforward

The implementation is straightforward:

For custom entities: Write a description of what you're looking for, provide a few examples, test that the system has sufficient confidence, and add it to a policy.
For pre-built classifiers: Simply add them to your policies. No configuration needed.
For custom classifiers: Describe the document type, provide keywords, upload a sample, validate, and deploy.

All three integrate with the same policy engine and remediation workflows you're already using. The detection happens in real-time as files are shared, uploaded, or modified across SaaS applications, endpoints, and cloud storage.

Solving the Right Problem

Data loss prevention started with entity detection because that was the tractable problem. Find the credit card number, protect it. But the actual problem has always been broader: protect the data that matters to your organization.

Sometimes that's a Social Security number. Sometimes it's a customer ID specific to your business. Sometimes it's a document that doesn't contain any traditional PII but represents significant intellectual property or compliance risk.

Comprehensive data protection requires all three: accurate entity detection, context-aware custom entity detection, and intelligent file classification.

Watch our full session: AI-Powered Detection & Classification from our product launch webinar to see Nightfall in action.

Ready to get started? Request a demo or reach out to us at sales@nightfall.ai to learn more about prompt-based detection and file classification.