Data stewardship and protection of customer data is likely to become one of the largest challenges for businesses this decade, with a growing number of countries considering data privacy legislation. For example, in 2020, 30 states within the US considered data privacy bills; that number increased to 38 in 2021. This wave of legislation, kicked off by GDPR, started around 2016 when the world entered the Zettabyte era. Since then, both organizations and individuals have become more aware of the massive amounts of data they may be storing in the cloud. By 2025, we can expect there to be nearly 100 Zettabytes of cloud data in the world. That’s about 100 billion terabytes, which is nearly equal to the number of stars in the Milky Way.
The prodigious growth of cloud data and the legislation that has followed, are far from the only bellwethers indicating the growing importance of data stewardship. During the hot IPO season of the past year, cybersecurity and data protection have increasingly become important parts of boardroom conversations as companies go public.
This means that data stewardship and data protection are areas where all stakeholders’ interests ought to align: Customers trust your organization to handle their data with care and to protect their privacy. Your board expects you to avoid data exposure incidents, which are highly disruptive to a growing business. And ultimately, your business’s longevity relies on managing customer data and other types of sensitive data properly.
Everyone agrees: data breaches are bad, customer data is sacred, and companies should proactively secure customer data in order to prevent exposure incidents. Yet, despite this, the data security problem remains increasingly difficult to define and solve, especially when it comes to data stored in the cloud. But why is this the case?
What I learned about data security in the trenches at Uber
I started thinking about this problem in 2016, at Uber. Back then, the modern, cloud-centric compliance landscape was just beginning to take shape, with GDPR on the horizon. As Uber Eats scaled rapidly to become a multibillion-dollar business in under two years, we unsurprisingly experienced a surge in the number of engineering services being built, as well as a proliferation of customer data across SaaS systems and cloud silos. This data sprawl issue ballooned as our business grew, making it a problem that would be easier to solve the sooner we addressed it.
It became apparent that the problem required three core approaches:
- Increasing data visibility. The data that we were aggregating couldn’t be easily identified, and in most cases it was only semi-structured. In a lot of ways, it was a known unknown; we only had a rough idea of what we were looking for, but we didn’t really know where to look. Additionally, as our environments continued to grow and change, we knew that the types of sensitive data we stored and the places where this data was stored would only increase.
- Improving data hygiene. As we added more engineers to the team, it became more difficult to validate that everyone was following best practices in order to limit the opportunities for sensitive data being exposed.
- Remediating past and future violations. Creating a standard reporting system and taking the time to discover and remove sensitive data would prove very time consuming, reducing bandwidth for other engineering projects. In effect, this process required some degree of automation.
What we learned firsthand was that rapid growth and adoption of cloud tools make it hard to know if or when best practices for ensuring the security of sensitive data are being followed. This is something many of Nightfall’s customers can attest to today. However, at the time we identified this problem, there were no solutions on the market that were agile enough to scale with us, so we realized we would need to build internally to tackle the issue.
I realized that for a lot of organizations, what we were attempting to do at Uber Eats was simply unfeasible, and so in the back of my mind, the seeds of the idea that would become Nightfall began to form.
What does it take to solve the challenge of cloud data security?
The problem of protecting consumer data – really all business-sensitive data – is challenging, both conceptually and technically. One of the first hurdles is understanding the relationship between data visibility, data hygiene, and remediation. Unless all three are addressed simultaneously, it’s impossible to ensure sensitive data is protected. Much like the CIA triad, which is foundational for many infosec practitioners, these three areas of data security are deeply interlocked.
- Visibility. Gaining data visibility is crucial to understanding historical violations. This refers to sensitive data that's already been introduced to your environments but hasn’t surfaced yet.
- Hygiene. Enforcing data hygiene is important for preventing the introduction of inappropriate content that violates data security best practices. This will ensure that future violations don’t happen.
- Remediation. Remediation of sensitive data exposure is critical for removing historical violations and ensuring that future violations of data hygiene don’t lead to data exposure events down the line.
Conceptualizing the problem is one thing, but solving it is something else entirely. The technical challenge can’t be understated. Not only do all three of these aspects need to be encompassed in a solution, but the solution has to be applied across multiple types of cloud environments, widening the scope of the problem. Although my team and I deliberated about how to solve this issue at Uber, I don’t think I fully appreciated just how big – and common – this problem really was until I left to co-found Nightfall.
Why I left Uber to start Nightfall
Once I left Uber Eats, I became obsessed with the cloud data security problem. Despite the broad scope of the problem, my co-founder and I settled on a simple and elegant solution: by authenticating into cloud environments via API, you can view sensitive data and remove it in a way that nearly resembles native functionality. This idea serves as the backbone for what Nightfall is today.
The challenge of discovering and remediating sensitive data exposure required an additional solution. This turned out to be a data classification problem; if data could be accurately detected and classified, security teams would be spared from having to engage in extensive data mapping exercises across thousands of tables, applications, and systems. My co-founder and I realized that supervised machine learning could help make detectors capable of picking up on the context likely to indicate the presence of sensitive data, regardless of where it lives.
Machine learning and API connectivity inform how Nightfall’s native integrations work across SaaS applications like Slack, GitHub, Google, Atlassian, and more. I think this approach works exceptionally well to address the data stewardship issues of today, but I see a future where both security and compliance obligations encourage companies to be more proactive. Companies will not just want to remediate existing instances of sensitive data or manage employee behavior; they will also look for ways to prevent their customers from even submitting certain types of sensitive information in the first place. We’ve already seen apps like Airbnb automatically redact phone numbers and other contact information within its messenger. Functionality like this not only protects customers, but it reduces the overall exposure risk for companies as well.
My co-founder and I understood this, and thus have been hard at work on what we’re calling the Nightfall Developer Platform. This platform unleashes our machine learning detectors and allows users to send data directly to our detection engine over API. You can use our APIs to identify sensitive data, be it strings in files, messages, or content in images.
Data security APIs from one developer to another
What is the Nightfall Developer platform? The idea is simple: Nightfall will do the hard work of detecting, classifying, and remediating sensitive data, allowing developers to focus on building applications that are secure and don't leak data. This means that developers can bake in workflows within their applications or custom environments to have sensitive data identified and classified so that it can be removed or otherwise remediated.
Our customers have already begun building amazing functionality into their applications using our APIs. Some customers are leveraging our platform to discover and remove PII in logs. Other customers have begun building in functionality that will flag and remove PII that is inappropriately entered into text fields within their app.
The best part is that since this is all done over API, getting started is easy. All you have to do is sign up for an account and create an API key, which you can do in just seconds. From there, any files or content you send over the wire will be parsed and classified, returning JSON with the sensitive data detectors the content triggered alongside confidence thresholds. You can also de-identify and redact data via multiple techniques including substitution and encryption via API.
When it comes to building with the developer platform, the sky’s the limit. If you can imagine the functionality you want, then you can build it. If you want to learn more about the Developer Platform, we have a ton of materials prepared. Check out our documentation, a library of rich integration tutorials, guides, and SDKs.
You can learn more on our Product Hunt listing and get started for free here.