What is shadow data?
Shadow data is any data contained anywhere in your entire data repository that is not visible to the tools you use to monitor and log data access. Shadow data may include:
- Customer data that DevOps teams copied into an unknown database to test applications they are developing
- Sensitive data that was once used for legacy applications is not visible to your data management system
- Irrelevant output from an application, such as log files containing sensitive data
What’s the relationship between shadow data and shadow IT?
Shadow IT is the deployment and use of systems and applications without the knowledge or consent of an organization’s data security and IT teams. Typically, shadow IT is generated by DevOps teams, DBAs, and others in the organization who have the means and expertise to leverage cloud-managed or software-as-a-service (SaaS) applications to “work around” what they characterize as constraints to innovation that data security and IT teams often present.
In these instances, shadow IT enables DevOps teams, DBAs, and other actors in their effort to develop applications faster to use services that run outside the organization’s IT architecture and generate local databases that are hard for data management systems to find. So it’s fair to say that shadow IT is a principal producer of shadow data, but certainly not the only one.
Shadow IT is not the only generator of shadow data
In some cases, the same sloppy organizational practices that make insider threats a major factor in conventional data breach incidents can also inadvertently lead to the creation of shadow data. For example, organization administrators cannot keep up with “who’s who” and share documents or data with employees who have left or with contractors. Other times a user inherits data access privileges they do not need and should not have. Still, other times geography plays a role. Some organizations struggle with ensuring the sensitive data for which they are responsible is not stored in ways that violate their own policies or local laws and regulations.
How is shadow data different from dark data?
Gartner defines dark data as “any information assets that organizations collect, process, and store during regular business activities but generally fail to use for other purposes.” Organizations often start off keeping dark data to satisfy compliance requirements. It can include past employee or financial information, transaction logs, confidential intelligence data, emails, internal presentations, download attachments, and even surveillance video footage. Dark data is any data generated by a user’s daily digital interactions in the course of general business processes. It is often forgotten, unknown, and unused, and it exists anywhere, spread across an organization’s complete portfolio of data repositories, from data lakes to applications. For more on dark data and why managing it effectively is critical to overall data security management, check out this post.
The principal difference between dark data and shadow data is that dark data is generated within an organization’s IT infrastructure in the course of regular digital business operations, serves no other purpose, and becomes unaccounted for over time. Dark data may be seen as a subset of shadow data. For example, sensitive data that was once used for legacy applications and irrelevant output from an application are both dark data and shadow data. By contrast, shadow data is generated in two principal ways: purposely generated outside an organization’s IT infrastructure by shadow IT to leverage cloud-managed and SaaS applications that DevOps teams, DBAs, and others would not otherwise be able to use; or inadvertently by organizational over-sharing. Either way, shadow data is unaccounted for data and presents the same security risks.
Three keys to securing shadow data
- Visibility is key. The overarching goal must be for your security teams to identify every cloud-managed environment and SaaS application in which your organization may have sensitive data. You cannot apply security controls to data in repositories you cannot see.
- Data discovery and classification. You must be able to identify the data in all your repositories and classify the sensitive data so you can apply security controls to it. Discovery and classification capabilities must extend beyond traditional structured data; you also need to be able to classify semi-structured and unstructured data. The best way to do this is to roll your data repositories into a single source and get dashboard access to see what’s going on across all data sources to quickly detect anomalous behavior.
- Control data access privileges. This is really the only way to mitigate the risks that insiders inadvertently play in creating shadow data. A rigorous analysis of anomalous behavior is very effective at rooting out malicious user activity. Machine learning algorithms can baseline typical access for the privileged user and send alerts on deviations from that behavior. Machine learning analytics can also learn what data is business-critical and see if a privileged user can access that data.
Digital transformation and the drive to leverage cloud-managed development environments in spite of security concerns, coupled with the dramatic changes in the way people work, has generated volumes of shadow data that the average organization is not prepared to manage effectively and keep secure. The solutions are straightforward, but also difficult to implement enterprise-wide. Work with experts with a proven track record of success with some of the largest, most data-heavy organizations in the world. If you would like to know more, please contact us for a chat.
Try Imperva for Free
Protect your business for 30 days on Imperva.