Enterprises Are Training AI With Real Data – The Why, The Risks, The Alternatives

AI is only as smart as the data on which it is trained, but feeding AI models with real data brings massive security and privacy concerns.

Here’s the dilemma: AI is only as smart as the data on which it is trained, but feeding AI models with real data brings massive security and privacy risks. Yet, this is precisely what many organisations are doing in the race to become as AI-focused as their competitors.

There are anecdotes of instances where customer information has been put into AI models without having applied any governance, risking potential breaches, theft, or regulatory issues. Fortunately, there are other ways to train AI with useful information, but why is real data being used in the first place?

First, some teams are using real data because they want the fastest route possible to insight. They are often under pressure from senior management to connect AI to data repositories, whether that’s to build apps faster, gain better customer insight, or address consumer demand for more personalisation. Yet, those teams may not have been equipped with the necessary tools or guardrails, so they begin incorporating real data into public AI models.

A second driver is the knowledge gap. Many teams do not fully understand the security and compliance risks of feeding real data into generative AI models. Even when the risks are recognised, preparing safe and effective training data requires deep expertise in data relationships, model architectures, and governance controls: skills that still largely reside with specialised data science teams.

The risk is compounded by the fact that once sensitive data is part of a model’s training set, it cannot be selectively removed without retraining. Generative AI systems may reproduce memorised data when prompted in certain ways, even if that information is deeply embedded in the model’s parameters. Sensitive data might be obscured by vast amounts of other training material, but it still exists within the model and can surface unexpectedly.

The Risks Are Very Real

Nor are the risks just theoretical: there have been reports ranging from a CEO’s salary exposed, through to code, medical records, secret keys and passwords injected into AI models; all through using real data. The Perforce 2024 State of Data Compliance and Security Report found that 54% of organisations had already experienced a data breach or theft involving sensitive data in non-production environments.

Arguably, it is only a matter of time until there is an incident with profound real-world implications: that is, unless organisations stop training AI models using real data. Instead, organisations need ways to train AI on all the proper nuances of a business and relevant logic, but without exposing the data behind that logic.

The Solution: Educate, Categorise, Disguise

As organisations dismantle data silos to supply generative AI with datasets, governance must evolve to keep pace. Training AI with personal identifiers or other sensitive data must be explicitly prohibited and enforced through policy and technical controls.

Traditional column-level classifications such as social security numbers or customer names are no longer enough; enterprises must expand their definitions of sensitive data to cover all formats and locations, from structured databases to unstructured files.

Effective governance starts with a comprehensive inventory of where data originates, how it moves, and in what forms it exists, ensuring that no sensitive information slips into AI training pipelines unnoticed.

Consider using synthetic data and static data masking, as these methods provide teams with the realistic data they need without exposing sensitive information. Synthetic data, a falsified version of real data, can help AI learn connections and behaviours, and accelerate software development and testing. Synthetic data can also be created quickly and at scale.

On the other hand, static data masking uses real data, but disguises it in a way that renders it useless if accidentally exposed by the AI model. Additionally, maintaining referential integrity across data sources is maintained across tables, applications, and even clouds.

For instance, when building a financial app, knowing Steve Karam’s transaction details is useless to everyone except the development team if the account is in the name of John Green. Plus, static data masking is irreversible, so there is no risk of it being rolled back to the original version.

Most organisations will likely use a blend of synthetic and masked data, and real data only comes into play during runtime. As is often the case, overcoming resistance to yet another set of tools and processes is crucial. Therefore, look for synthetic and masking solutions and techniques that simplify and automate the process.

Everyone is still on an AI learning curve, but regardless of where they are with their exploration and adoption, protecting real data must be the universal priority. So now is the time for security professionals to help their colleagues in other functions understand the risks and put strategies, techniques and tools in place, removing the need or temptation to use real data altogether.

Written by