Gartner defines dark data as “the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes such as analytics, business relationships, and direct monetizing.” This data takes many forms, but it is often simply text documents, spreadsheets, presentations, and the like. There are many reasons why organizations accumulate this data: regulatory, legal compliance, etc. But just as often it’s because they lack the policies and/or means of identifying and managing it.
On the surface, the accumulation of this data may not seem like much of an issue. After all, what’s the problem with having a bunch of old, unused files sitting on an array or server? Unfortunately, the problems and risks are plenty.
Many companies retain data for compliance or regulatory requirements. Regulations such as HIPAA, SOX, GDPR, CCPA, and others govern how particular types of data like PII and PHI are stored and/or how long it is retained. Without realizing it, companies may be at substantial risk of violations or fines due to the contents of their dark data. Moreover, retaining some data longer than a particular requirement may constitute a violation in and of itself.
As it is with most things, the more of it you have, the more you must protect. The same is true of an organization’s data. When it comes to security breaches, the concern isn’t just whether you are meeting regulatory or legal requirements, but what attackers might be able to do with your dark data! Does it contain sensitive intellectual property or perhaps passwords to confidential files or systems? It’s imperative to keep and store as little of this data as possible in places where attackers might have the easiest access to it.
While there are potential costs in the form of fines or legal liability associated with dark data, the numbers can be staggering when you consider the total costs of storing and managing it. Consider the fact that most of this data is:
- Backed up at least once (if not several times) depending on the backup methodology
- Replicated for business continuity and disaster planning
- Stored on expensive storage arrays and servers in both primary and backup data centers
- Managed by many levels of IT personnel including systems engineers, administrators, and data center operators
The point here is that independent of the contents of an organization’s dark data, the costs of simply storing and protecting that data can be substantial. Remember, this is data that is infrequently if ever accessed and maybe a substantial liability even though nobody is using it.
Now that we’ve identified that dark data can be problematic on many different levels, let’s examine some ways companies can address this problem. At Data Dynamics, we recommend a three-stage approach: discover, analyze, and archive.
In this stage, we leverage StorageX’s Analytics module to scan selected shares or exports and log all the file metadata information. In addition to common metadata fields like owner, creation time, access time, whether the file is compressed or encrypted, etc., in many cases StorageX can also log the platform operating system version, the filesystem type (e.g., on NetApp, a flexible volume), the hostname of the array among others.
During this phase, StorageX also provides the ability to apply custom tags to the data scans. For example, let’s say you know the share(s) you intend to scan belong to the Finance department in the NYC office. When you create the scan, you can apply tags like “Department” is “Finance” and “Office Location” is “NYC.” The real value of applying custom tags can be realized in the next stage.
Once the data has been identified, StorageX can be used to further classify and analyze the data. StorageX gives one the ability to query the metadata fields collected and custom tags applied in the Discovery stage. These queries can be used to answer questions like how many of these files have not been accessed in over three years and/or no longer have identifiable owners. It is actually common for customers to find that more than 50-70% of their scanned data matches a query like this, and in some cases, we’ve seen over 90%!!
Once you’ve discovered and analyzed the dark data, you can act on it using StorageX. While StorageX can migrate data to less expensive storage tiers (e.g., flash drives to magnetic discs), where the customer can benefit the most is to migrate infrequently accessed data to an object store due to its substantially lower cost. StorageX natively supports the largest on-premise and cloud-based object store providers. Through an easy to use interface, customers can take the results of their data analysis and immediately or on a scheduled basis archive that data to an object-store. The data is natively stored in S3 and Azure Blob protocols, which means this data is fully owned by the client and can be accessed natively via S3 APIs and Azure APIs. In addition, all the metadata fields and custom tags created in the classification stage are stored in the object-store.
At any future time, customers can use any S3-compliant application or Blob-complaint application to query and retrieve files stored in the object-store. Additionally, since enhanced and additional metadata is stored in a JSON format, customers can use enhanced analytics or BI tools to analyze the data.
So with those 3 easy steps – Discover, Analyze, and Archive – you too can defeat the dark data lurking within your environment!