Dave Wells October 21st, 2022

Tips to Secure Shadow Data

Shadow data is, first and foremost, a concern for the security and compliance teams. What can they do to ensure the security and compliance of data stores they are unaware of or have little or no control over? 

There is an inherent tension between DevOps and security teams amplified by regulatory frameworks, such as GDPR and CCPA, that set and enforce high privacy standards for data.

In this blog post, we will discuss what shadow data is, why you should not ignore it, and how to secure it effectively.

What Does Data Shadow Mean?

The term shadow data refers to your organization's data that is likely copied, backed up, or located in a data store that is neither governed, under the same security structure or updated on a regular basis. The following are examples of shadow data:

  • Customer data that has been copied from production to development for testing purposes
  • Data stores that contain sensitive information for an application that is no longer in use
  • An application's byproducts, such as log files, may collect sensitive information
  • Applications that use hard-to-find local databases
  • The data generated by shadow IT 
  • A siloed set of data that is only available to a specific line of business

As a matter of fact, shadow data is primarily a problem for security and compliance professionals. If they are not aware of or have little or no control over data stores, how can they be responsible for their security and compliance? Regulations, such as GDPR and CCPA, that establish and enforce high data privacy standards have amplified the tension between DevOps and security teams.

Shadow data also affects operations teams since unmanaged data sprawl can increase infrastructure costs. Cloud budgets are exceeded, with little or no visibility into how the overspend was incurred or how it can be contained. Here are some real-life examples of shadow data

  • Public, unmanaged databases: The developer implemented an SQLite database for storing sensitive data input by web application users. What is the problem? A standard web server was used to deploy the database, which was, by definition, publicly accessible. The company's security and compliance guardrails were violated, exposing sensitive data to threats without the security team's knowledge.
  • The data generated by backend applications: Backup files, log files and debug dumps serve the needs of DevOps engineers but are typically not monitored by security professionals. Shadow data, however, may contain sensitive information. 
  • Unmanaged cloud resources: Developers may create an S3 bucket in restricted geolocation as part of internal testing procedures that are not audited. These testing resources unnecessarily add to infrastructure costs if they are not properly decommissioned, posing security and compliance risks.

It's time to stop ignoring shadow data

It is now challenging to ignore shadow data due to the prevalence of hybrid and multi-cloud environments. According to a recent report, 92% of enterprises today have a multi-cloud strategy, of which 82% have adopted a hybrid approach. A lack of visibility in these environments makes it difficult to monitor them effectively, and shadow data is likely to accumulate as a result.

Shadow data is also increasing due to the adoption of cloud-based continuous integration and delivery methods. In today's market, developers have more freedom to introduce new products and features. Additionally, the self-service cloud model allows developers to provision data stores with just a few clicks, often without consideration of the organization's governance or compliance policies. 

The proliferation of cloud-native applications based on micro-services, containers, and server-less functions has brought the issue of shadow data to the fore since decentralized workload-based data stores contribute significantly to data sprawl.

What is the difference between shadow data and dark data?

The term dark data refers to all the data within an organization that is unused, untapped, and unknown as the result of users' daily interactions online with countless devices and systems - everything from machine data to server log files to unstructured data derived from social media interaction.

The data may be considered outdated, incomplete, redundant, or inaccessible due to a format that can't be accessed using available tools. They don't even know it exists most of the time.

However, it is essential to note that dark data may be one of an organization's most valuable untapped resources. Data has become a major organizational asset, and competitive organizations must capitalize on its full potential. Furthermore, more stringent data regulations may require organizations to manage their data entirely.

Shadow data differs from dark data in that it is created within an organization's IT infrastructure during routine business operations, serves no other purpose, and becomes unaccounted for over time. Shadow data can be viewed as a subset of dark data. A dark data set is a collection of sensitive information that was once used for legacy applications or irrelevant information generated by an application. 

Shadow data, on the other hand, is created in two main ways: by shadow IT, which is intentionally developed outside an organization's IT infrastructure to leverage cloud-managed and SaaS applications that DevOps teams, DBAs, and others would not be able to access otherwise; or by over-sharing within an organization. Shadow data is unaccounted-for data that poses the same security risks in either case.

A three-step process for securing shadow data

  1. Visibility: It is essential that your security teams identify every cloud-managed environment and SaaS application in which your organization may have sensitive data. There is no way to apply security controls to data that is stored in repositories that you cannot access.
  2. Discovery and classification of data: Data in all of your repositories must be identified and classified so that security controls can be applied. There is a need to extend discovery and classification capabilities beyond traditional structured data; semi-structured and unstructured data must also be able to be classified. You can quickly detect anomalous behavior by consolidating your data repositories into a single source and allowing dashboard access to see what is happening across all data sources.
  3. Control the privileges of data access: Shadow data can only be mitigated by preventing insiders from creating it inadvertently. When it comes to rooting out malicious user behavior, a rigorous analysis of anomalous behavior is very effective. It is possible to baseline typical access for privileged users and send alerts if it deviates from that. It is also possible to use machine learning analytics to determine what data is business-critical and if it can be accessed by privileged users.

Minimizing the risks associated with shadow data

Data management best practices can mitigate shadow data risks in several ways, including:

  • Understand your data: Maintain a catalog of data assets that are categorized according to their sensitivity and criticality as you scan your workloads continuously. It is important that the data catalog is comprehensive, accessible to all stakeholders, and searchable according to a variety of parameters, such as owner, sensitivity, used by, version, and so on.  
  • Follow your data: A next-generation data catalog should also incorporate visualizations that highlight relationships, flows, and dependencies among data stores in a typical mid-sized to large organization. The ability to build outflows and identify who is interacting with what data will allow you to identify shadow data that is not being used. It is not only a waste of storage resources to store this data, but it can also pose a threat to cybersecurity, such as data exfiltration, due to its unused nature.
  • Clean your data: A key component of IT hygiene is keeping your data assets lean and mean. In general, each time a developer replicates a data store for testing or a database is mirrored before an upgrade, the developer or operation person should delete it when the testing environment is decommissioned, or a successful upgrade has been completed. Your environment, however, is full of redundant, incomplete, or low-value data. Ensure that shadow data is regularly disposed of.
  • Protect your data: A risk-based data protection policy and process must include shadow data. Risk assessments, which should be as automated as possible, will take into account the type of shadow data and its location, as well as any compliance requirements related to its sensitivity. After establishing the appropriate guardrails, you should include access controls, minimal privileges, monitoring for anomalous behavior, alerting to threats, and resolving misconfigurations.


By eliminating shadow data and including new data that can be useful, businesses and organizations can significantly improve their analytics, reporting, machine learning, and AI. It consequently leads to faster, and more intelligent decision-making across the enterprise.

The best way to combat shadow data is to begin by educating your entire firm's staff about the issue. 

Educate people in your vicinity about shadow data and why they shouldn't ignore it. Engage experts who have demonstrated their ability to successfully manage data-intensive operations if you want to gain a greater understanding of the shadow data.

Featured image by Shubham Dhage on Unsplash

Dave Wells

A tech enthusiast, and SaaS marketing expert that has helped many startups to grow from zero to ground up. Dave loves to read on tech, and share findings to help businesses grow.

Leave a Reply

Your email address will not be published. Required fields are marked *