## Introduction

There are various ways to measure any given machine learning (ML) model’s ability to produce correct predictions, depending on the task that the system performs. Named Entity Recognition (NER) is one such task, in which a model identifies spans of sensitive data within a document. Nightfall uses NER models extensively to detect sensitive data across cloud apps like Slack, Microsoft Teams, GitHub, Jira, ChatGPT, and more.

In order to evaluate these ML models, Nightfall uses industry-standard metrics, including precision and recall.

**Precision**refers to the percentage of model predictions that are correct.**Recall**refers to the percentage of relevant items identified by the model.

These metrics are evaluated over a collection of documents by matching predicted spans with human-annotated reference spans and counting the number of matched and unmatched predicted and reference spans.

But how do precision and recall affect the experience of using an ML model? In the case of cloud DLP, higher precision translates to fewer false positive alerts for security teams to manage. Recall, on the other hand, translates to enhanced security, as more true positive findings are detected.

Let’s dive into an example. Say that we’re trying to identify social security numbers (SSNs) in the following document:

Tax Form Your Social Security Number:359-95-7430

Spouse’s Security Number:213 - 65 - 5321

Transaction ID:858 - 68 - 0493

This document contains two SSNs, as well as a similarly formatted transaction ID. We’ll run this example through two sample models to see how they perform.

## Model #1

Let’s look at a prediction from a simple regex-based model using the regex: \\d{3}\\s?-\\s?\\d{2}\\s?-\\s?\\d{4}. It seems like this model accidentally predicted that the transaction ID is an SSN. Because this predicted span has no matching reference span in the original document, it’s considered a **false positive**.

When evaluating NER tasks, it’s necessary to keep track of:

**False Positives:**Predicted spans without a corresponding reference span.**False Negatives:**Reference spans without a corresponding predicted span.**True Positives:**Matching predicted and reference spans.

To tie this back to our example, the first model provided two **true positives** and one **false positive**.

As we collect more data, we’ll also amass more true positives, false positives, and false negatives. In order to separate the influence of the amount of data we have from the correctness of our model’s predictions, ML practitioners typically compute summary statistics that describe the model’s general behavior independent of the dataset size. The two most prevalent metrics used for NER models are **precision** and **recall**.

These metrics describe the types of predictions a security analyst would encounter when examining the output from the ML model. For example, a model with a precision of 95% would be right 19 out of 20 times.

In this example, the model made **3** predictions, and **2** were correct, meaning that the model’s precision was **2 / 3 = 67%**. Additionally, since the model identified all of the SSNs, its recall was **2 / 2 = 100%**.

## Model #2

Say we try to modify our regex to get rid of the transaction ID by not allowing whitespace around the dashes: (\\d{3}-\\d{2}-\\d{4}).

This time, the model missed the second social security number. As a result, we have one **true positive** and one **false negative** (the missed social security number). Therefore:

**Precision:**TP / (TP + FP) = 1 / (1 + 0) = 100%**Recall:**TP / (TP + FN) = 1 / (1 + 1) = 50%

## F1 Scores

You may have noticed that neither of the above models are perfect. One is precise, but has low recall, while the other has high recall, but is imprecise. This is a pattern frequently seen in machine learning, where precision can be “traded off” for recall, and vice versa. Tuning the right balance between precision and recall can be difficult, but is an important part of producing an ML model that meets the business use case for which it is intended.

In order to describe a model’s performance more concisely, we can use a single metric called an F-score, which incorporates both precision and recall. Typically the F1 score is used, representing equal weighting of precision and recall.

This formula represents the harmonic mean of the model’s precision and recall, which emphasizes the effect of the lower score on the model.

Let’s compute the F1 score for both of the model’s we’ve looked at so far on our dataset:

#### Model #1

F1 score = 1 / (1 / Precision + 1 / Recall) = 1 / (1 / (2 / 3) + 1 / (2 / 2)) = 1 / (3 / 2 + 1) = 1 / (5 / 2) = 2 / 5 = **20%**

**Model #2**

**F1 score = 1 / (1 / Precision + 1 / Recall) = 1 / (1 / (1 / 1) + 1 / (1 / 2)) = 1 / (1 + 2) = 1 / 3 = 33%**

## False Discovery Rates

When comparing two models, precision and recall can help understand which model will produce better results. However, these metrics are subtly different in how they relate to the experience of actually using the model.

For example, let’s say we have two models: **Model A** with a precision of 60%, and **model B** with a precision of 80%. We can say that model **B** has a precision that is 33% higher than that of model **A**. However, does this mean that model **B** will have 33% fewer false positives?

Let’s calculate the number of false positives that each model would be expected to produce on a sample of 100 model predictions:

**A:**100 x (100% - 60%) = 100 x 40% = 40**B:**100 x (100% - 80%) = 100 x 20% = 20

In fact, model **B** would produce **50%** fewer false positives than model **A**.

So why was this number different than when we directly compared precision values? It’s because **precision is not linearly related to the expected number of false positives**. There’s actually a different metric that’s more suitable for this type of comparison: False Discovery Rate (FDR).

So, since model **A** has a precision of 60%, its FDR is 40%. In line with this, model **B’s** FDR is 20%. Since model **B**’s FDR is half of model **A**’s, model **B** produces **50%** fewer false positives than model **A**.

Comparing the false discovery rate of two models can be a more effective way of describing how choosing one model over another will affect the number of false positives that the ML system is expected to produce.

## Conclusion

It’s crucial to evaluate precision, recall, and false discovery rate when comparing machine learning models. Precision and recall provide insights into the accuracy and coverage of a model's predictions, while F1 scores combine both metrics to provide a single performance measure. However, F1 scores may not accurately reflect the quantity of false positive findings—making False Discovery Rates a more suitable metric for comparing one ML model to another. By considering FDRs, data scientists can assess and optimize the performance of machine learning models for real-world applications—including, say, detecting sensitive data in cloud apps.