With the rise of LLMs and agents, companies are increasingly leveraging sensitive data to train and test their models. Even when teams have the best intentions to respect confidentiality, sensitive fields can easily slip into training corpora, evaluation sets, or prompt libraries, especially when they are moving quickly to create and develop AI use cases.
Synthetic data offers a practical solution: generated by algorithms, it is designed to mirror real-world datasets without reproducing actual records. Used correctly, it enables the fine-tuning of AI models, large-scale evaluation, and data curation for agents, while reducing privacy risks.
However, this data is not a miracle solution. If poorly generated, it can still disclose sensitive information if it retains rare attribute combinations or reflects real-world examples too closely. To be truly effective, synthetic data must be treated as an engineering discipline, with controls, rather than as a last resort. Organizations must first define the purpose(s) for which they need this data, which will then determine how the data should be generated. Synthetic data cannot universally replace real data and does not eliminate the need for governance.
On Data Privacy Day, I invite companies to consider synthetic data as a lever for secure innovation, provided that it is properly generated, supervised, and integrated into robust governance to protect confidentiality throughout the AI lifecycle.”
