Data classification: What, why and who provides it

You need to know where your data is, what it is, its governance requirements and relationship to the rest of your data. We look at data classification and how AI can help

Stephen Pritchard

Stephen Pritchard

Published: 30 Oct 2024

When it comes to managing data, we need to know where it is – but we also need to know what it is.

With the rise in regulatory controls, enterprises now pay more attention to data sovereignty, especially when it comes to data in the cloud, but to know exactly what information they hold is equally important.

This concept – data classification – is not new. But with the growth of unstructured data in particular, to have a clear picture of all data assets is essential. And increasingly, firms now look to artificial intelligence (AI) tools to help with this.

What is data classification and why do we need it?

Organisations have long organised data by function or “descriptive classifier”, such as whether it is an HR file or sales records. They then categorise by sensitivity, also known as a control requirement. Then there is context-based information, such as when and where data was created, and technical attributes such as file type or size.

Lower cost cloud storage allows organisations to store more data for longer, allowing them to use the data for business intelligence, which nowadays increasingly means to train AI models.

But that data must be organised well so that it is not hard to find and use. Protecting that data is also vital. Data governance and data stewardship depend on effective data classification. Data storage is also less efficient unless the business has a solid data classification plan.

Manual data classification, while possible, is inefficient, unreliable and hard to scale. Although organisations can create policies that require users to classify data by adding labels, tags or keywords, this really only works for the broadest classifications – such as sensitivity – and for newly created files.

As organisations bring in more data from external sources such as web applications, customers and the internet of things, effective data classification really needs to be automated. Data classification is a key part of data lifecycle management and is essential for data security.

Data classification tools

As analysts Gartner point out, manual data classification can lead to misclassification due to human error. Also, labels and tags are “one dimensional” and “do not provide sufficient context for increasing regulatory data controls”. They fail to capture context and are usually static. Data can also be used for different purposes during its lifecycle.

Automation solves some of this by adding context, as well as looking at the content of the data, its location and adjacent documents. According to Gartner, standard classification tools work well with standard data types and in organisations that already have well-formatted data. The task becomes harder as organisations make more use of unstructured data.

Increasingly, vendors are using machine learning to look into datasets and documents, to discover elements they can identify, record and track. But, as Gartner notes, their performance can be limited when it comes to handling propriety data.

Nonetheless, the market offers a range of data classification tools, from standalone applications to those integrated into databases or enterprise applications, especially business intelligence. These are sometimes described as enterprise data catalogues.

Another approach is to bundle classification and cataloguing as part of wider enterprise data governance and compliance applications. Unsurprisingly, vendors are now looking to integrate AI into their tools, to improve accuracy and reduce the need for manual tagging.

AI input, data outputs

Data classification is a natural application for artificial intelligence. Vendors have used machine learning in data cataloguing tools for a while. It is not a use case that relies on generative AI (GenAI) or large language models (LLMs), although some tools now use them.

Some tools vendors use machine learning and neural networks, decision trees and logistical regression. These train AI models to find patterns in data, especially unstructured data. The models can then be used to apply automated tagging to the data.

Customers can then test and refine models before deployment. This is important because customer datasets differ and an out-of-the-box tool might not understand the specifics of that customer’s data or the relationship between different data within the organisation. An effective AI model can be used to enrich the metadata associated with a file or document.

The metadata can then be used to create a catalogue of enterprise data and, in turn, more effective controls. A further advantage of automated and AI-based systems is that they are dynamic. If the enterprise reclassifies data – due to regulatory changes, for example – the data classification tool should be able to update the catalogue on the fly.

The metadata and catalogue can then be used for data retention and in security and data loss prevention tools, as well as to meet rules for data residency. This is hard to do with unstructured data, but solid data management is vital for business intelligence and AI development.

Key data classification providers

Microsoft provides AI-based data classifiers through its Purview product. These, it says, are pre-trained on business data, Microsoft domain knowledge and synthetic data. Purview is a wider data governance, compliance and risk management service that runs on Azure.

IBM offers its Knowledge Catalog for data classification and management using AI and ML. It runs as a SaaS application, or in IBM’s Cloud Pak for Data. IBM uses LLMs for metadata enrichment.

SAP’s Document Classification tool was retired in 2023 and replaced by its generative AI-based Document Information Extraction service.

Oracle Cloud Infrastructure provides “metadata harvesting” from cloud-based sources, and OCI Data Catalog for on-premise and private networks.

Google Cloud’s data classification options include Data Catalog, which builds data asset inventories from Google Cloud sources including BigQuery and its AI offerings, from cloud storage, and from custom data sources through an API.

AWS has the Glue Data Catalog, which includes automated data discovery.

There is also a wide range of specialist data and analytics platforms that provide data classification and management, either directly or as part of business and data intelligence platforms. These include Alatian, Ataccama, Atlan, Collibra, Databricks (through its Unity Catalog), Qlik, Tableau as well as data stalwart Informatica and data security vendor Varonis.