Preventing data loss in data warehouses with the Nightfall Developer Platform

Chris Martinez
July 16, 2021
Preventing data loss in data warehouses with the Nightfall Developer PlatformPreventing data loss in data warehouses with the Nightfall Developer Platform
Chris Martinez
July 16, 2021
On this page

Data warehouses power your data analysis and business intelligence operations so you can level up your knowledge and progress toward bigger business goals. Like any key component of your tech stack, using data warehouses effectively also requires care and caution — especially when uploading and sharing sensitive information. 

To understand data protection for data warehouses better, we sat down with Nightfall Co-Founder and CTO Rohan Sathe for a question and answer session. Rohan provides answers to the most important questions around data security for data warehouses like Snowflake, Amazon Redshift, and others that are in frequent use. Read where exfiltration risk originates when using data warehouses, why encryption is only part of the solution, and how cloud-native DLP can help solve the problems that lead to data exfiltration in data warehouses.

Q: In data warehouses, data is always moving around and being transformed so it can be made available for users. How can we reduce risk of data exposure in data warehouses? 

A: A data warehouse is a place where folks will aggregate information from many different data sources or data silos. For example, a big retail company would bring in data from marketing, customer orders, finance, and other departments into one place, so they can build analyses or applications on top of that data to run the business better. The data warehouse is effectively the warehouse of all the different data silos that you would want to bring together in one place. 

Data warehouses present slightly different use cases from traditional databases. A traditional database is used by your applications. It's used more for access programmatically through applications, and less often by end users. Typically the use case for data warehouses is an analytics team or a data science team running queries against the data within the warehouse. So the question is, what are the risks in using a data warehouse when you start to store data in the data warehouse? That data could be sensitive.

In a data warehouse, because people are loading in data from different sources, you might have a lot of replicated copies of sensitive data that you don't know about. That’s where the risk starts. One example is a user in your organization imported order history or customer log data into your warehouse, but didn't realize that there was already a version of that information in there, so you end up with multiple versions of the data in places that you didn't expect. 

Or when you load data into your warehouse, you might want to only add to certain fields, but accidentally end up importing other sensitive information. One example of this is adding de-identified patient data from your electronic health system, but accidentally loading in the whole patient record that contains all their sensitive data. This happens all the time and most companies have no idea what kind of sensitive data is sitting in that data warehouse. There’s a rising concern for security leaders about what data could be loaded into data warehouses. Also, it’s difficult to maintain legacy systems in a secure manner. An enterprise company always faces risk when plugging in a legacy system to gather necessary information in their quest to keep up with the competition.

Q: I have encryption in place for my data warehouse. Why would I need additional protection?

A: Encryption could be good enough if you have some sort of field level encryption that only specific people can access or query those fields. If information is encrypted at that level then you'd have to know exactly which of those fields have sensitive data. I think that's part of the problem. If I know there's a credit card number field, then maybe I'll encrypt that at the field level so that someone who has access to that with the key can see it and everyone else can't. But to do that, you would need to have awareness of all the places where the sensitive data is, and have tight access controls in place.

If there's sensitive data somewhere in your data warehouse, you could give access to queries or even query columns or tables only for specific people or groups. But that can get messy very quickly because inevitably people will say they need a wider level of permissions to do their job. It becomes cumbersome to manage exactly who can access and see specific things in these environments.

It also depends on how many layers of security you want. Certainly encryption is one mechanism to accomplish that. The risk there comes from somebody getting access to the encryption keys. They can decrypt the data and then make use of it however they want. So the question is always, does that data need to exist in the warehouse if it's encrypted? What is the utility of something being encrypted in the data warehouse, if you're not going to query against it and make use of it?

Q: What types of information do users need to protect in their data warehouses?

A: It’s important to protect data like personally identifiable information (PII), anything that should be protected according to PCI-DSS or HIPAA compliance such as financial information like credit card numbers or healthcare records. Secrets and credentials are less commonly exposed in data warehouses. However, if you're cloning items to add to your data warehouse, you could accidentally copy information like passwords or authentication tokens and unintentionally introduce them into an environment where they could be exposed. You should always be aware of what's actually in your data warehouses as a first step to protecting that information. That warehouse can quickly become a repository for every piece of data your company uses.

Q: Would using DLP help an organization better understand what data exists within their data warehouses?

A: Using a data protection platform like Nightfall will help you classify the data. From there you can take steps to then prevent exfiltration or inappropriate access, and enforce compliance requirements as needed.

Q: How should SaaS companies approach data security within data warehouses like Snowflake?

A: The first thing is to understand where your sensitive data is in a data warehouse. Are you proliferating sensitive data into your data warehouse? Is that intended? If it's not intended then you have to understand what data is going into the data warehouse so you can change your processes to prevent this. If it is intended then you should add on extra layers of security onto those specific fields, like encryption or redaction or in some cases audit logging for who is accessing specific columns.

Q: Why is Nightfall different from other solutions that protect data within data warehouses? How can companies protect data in data warehouses?

A: This is an emerging category. I think you can say the general concept of data warehouses has exploded in the last decade. And thus security has been sort of an afterthought because this space is moving so quickly. Traditionally, there is no strategy for making sure sensitive data doesn't exist in data warehouses. It's a manual effort where somebody has to look at the data and make sure that certain columns or rows aren't being copied over. There's never really been an automated approach before Nightfall.

Q: As teams are demanding larger data sets and other developments for data warehouses, that can be a big strain on the team. How does this type of stress impact data security?

A: Typically the problem starts when an analytics team asks for the ability to query data quickly, and they need the data to be formatted properly for them to complete the task. Security is more of an afterthought in the beginning of a company's journey in building a data warehouse. Those problems tend to be hard to clean up once the size of the data in your data warehouse or data lake starts to expand over time.

Q: Should the same security principles and applications we’re talking about be applied to data lakes?

A: Data lakes and data warehouses work in the same way. The main difference is that data lakes host unstructured data and raw information, while a data warehouse typically hosts structured data that has been slightly processed. You could end up with unstructured data that remains unprocessed in your data warehouse if you aren’t aware of what’s being imported and shared within the data warehouse. That’s why you should always classify the content in your data warehouses, regardless of if it’s unstructured or structured. And sometimes, data lakes might pose a greater risk because companies could dump everything they have in there without having a specific goal.

Q: I’m interested in using Nightfall to protect my data in my data warehouse. What kind of things could I do with Nightfall?

A: With the current version of the Nightfall Developer Platform, you can leverage a programmatic API or classification API to categorize rows and columns inside your data warehouse. Here’s an example of what that could look like within your Nightfall dashboard when working with Microsoft Azure SQL.

if __name__ == '__main__':
  print(f"Scanning {table_name} Azure Database with Nightfall")
  nightfall = Nightfall(nightfall_api_key, condition_set_uuid)
  data, columns, primary_key = get_from_azure()
  primary_key_col = []
  if len(data) == 0:
      msg = f"Table {table_name} is empty! No data to scan."
  all_findings = []
  for col_idx, col in enumerate(columns):
      print(f"Scanning {col} column.")
      payload = [str(i[col_idx]) for i in data]
      if col == primary_key:
          primary_key_col = payload
      findings = nightfall.scan(payload)
      for item_idx, finding in enumerate(findings):
          if finding is not None:
              for item in finding:
                      item['index'] = primary_key_col[int(item_idx)]
                      item['index'] = int(item_idx)
  print(f"{len(all_findings)} sensitive findings in {table_name}.")
  with open('azure_findings.json', 'w') as output_file:
      json.dump(all_findings, output_file)

We are adding native integrations for the most popular data warehouses like Snowflake, Amazon Redshift, Google’s Big Query, or Azure SQL Data Warehouse, to automate a lot of that process. 

For more information about the Nightfall Developer Platform for data warehouses and to see a demo, click on the Calendly below to schedule a call with our team.

Nightfall Mini Logo

Getting started is easy

Install in minutes to start protecting your sensitive data.

Get a demo