Fair synthetic data. , 2024; van Breugel et al.
Fair synthetic data Our approach demonstrates that using pre-processing fairness algorithms after the synthetic data is generated is also Fair synthetic data can make machine learning models implicitly fair with respect to the fairness definition you applied to the algorithm. However, existing methods often face limitations in the diversity and quality of synthetic data, leading to compromised fairness and overall model accuracy. Synthetic data may reflect the biases in original data. biased data so that the downstream models trained on fair synthetic. They now include workshops on responsible AI and synthetic data for privacy. We provide a novel model for fair data generation that relies on probabilistic graphical models and characterize the desiderata for the sampling approach to generate justifiably fair data. From a biased dataset D , we are interested in learning a graph structure G that minimises biased pathways Footnote 1 while preserving variable relationships. MOSTLY AI's presented empirical results on fair US census data. GANs in Fair Data Generation. We propose a pipeline to generate fairer synthetic data Machine learning models have been criticized for reflecting unfair biases in the training data. About No description, website, or topics provided. Generating fair synthetic data from unfair data - while remaining truthful to the underlying data-generating process (DGP) - is non the submission of a proposal for synthetic data standards, the IC activity will also seek to advance the concept of fair synthetic data, as well as to support regulators in their understanding of this new technology and how it can be evaluated. Figure 2: Comparison of univariate and bivariate statistics for original and synthetic data. The model uses a Wasserstein Generative Adversarial Network to produce synthetic data with high @inproceedings{kyono2021decaf, title = {DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks}, author = {van Breugel, Boris and Kyono, Trent and Berrevoets, Jeroen and van der Schaar, Mihaela}, year = 2021, booktitle = {Conference on Neural Information Processing Systems(NeurIPS) 2021} } In contrast, truly fair synthetic data are carefully constructed to ensure fair predictions when used to train downstream classifiers. This fair outcome is solely due to This finding has implications for generating fair synthetic data, as previous literature has focused on incorporating fairness constraints directly during the generation of synthetic data [46, 51] to achieve fair synthetic data. We empirically found that the model developed on this syntheti-cally generated fair data is fairer as well as perform better in some cases. Synthetic data has the capacity to leak information about the data it was derived from and is vulnerable to privacy attacks. Apr 7, 2021 · Algorithms learn rules and associations based on the training data that they are exposed to. @inproceedings{kyono2021decaf, title = {DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks}, author = {Kyono, Trent and van Breugel, Boris and Berrevoets, Jeroen and van der Schaar, Mihaela}, year = 2021, booktitle = {Conference on Neural Information Processing Systems(NeurIPS) 2021} } Apr 20, 2023 · Prior work has already addressed some of the data sharing concerns: differentially private synthetic data [3,4,5,6,7,8,6], generating data with reduced bias [9,10,11,12,13,14], and combining these In this work, we present PreFair, a system that allows for DP fair synthetic data generation. One of them is ensuring the factuality and fidelity of synthetic data (Wood et al. 3 2. Mar 4, 2025 · fair and realistic synthetic data generation, but not both. This is followed by Friday’s post, which will be the technical centerpiece of this series. It helps to generate statistical parity synthetic data where you can target a specific column for fairness (for example, income) and easily remove biases based on other sensitive columns in your datasets, such as race, sex, age, or any other attribute that you define as sensitive. ac. PDF Link. Generating fair synthetic data from unfair data - while remaining truthful to the underlying data-generating process (DGP) - is non May 7, 2020 · Fig. DECAF (van Breugel et al. Synthetic data is not a replacement for May 6, 2020 · Part 4 - Tackling AI Bias At Its Source – With Fair Synthetic Data; Part 5 - Diving Deep Into Fair Synthetic Data Generation; Equal Treatment Versus Equal Access In the private as well in the business context we oftentimes strive to achieve fairness by treating everybody exactly the same. May 8, 2020 · In more general terms, adding the fairness constraint expands the objective of our software from generating accurate and private synthetic data to generating accurate, private, and fair synthetic data. Generating fair synthetic data from unfair data - while remaining truthful to the Training an algorithm, training an AI algorithm using synthetic data, that's very, very good synthetic data should produce a model with almost identical performance characteristics to one that's trained on the real data from which the synthetic data was derived. 1145/3706468. 01. The goal of the original paper is to create a model that takes as input a biased dataset and outputs a debiased synthetic dataset that can be used to train downstream models to make unbiased predictions both on synthetic and real data. Jan 3, 2025 · Our results highlight that the DEbiasing CAusal Fairness (DECAF) algorithm achieves the best balance between privacy and fairness. with synthetic data is that it is inherently private. Generating fair synthetic data from unfair data - while remaining truthful to the underlying data-generating process (DGP) - is non-trivial. However, DECAF suffers in utility, as reflected in its predictive accuracy. By equalizing the learned target probability distributions of the synthetic data generator across A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle (2021) Harini Suresh, John Guttag ; DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks (2021) Boris van Breugel et al 4 Fairness of Synthetic Data Algorithmic fairness is a popular topic (e. Mar 4, 2025 · In this work, we present PF-WGAN, a privacy-preserving, fair synthetic tabular data generator based on the WGAN-GP model. Synthetic data, on the other hand, emerges with the promise to biases in real data, and synthetic data can be safe to use and share in sensitive domains. By equalizing the learned target probability distributions of the synthetic data generator across sensitive attributes, a downstream model trained on such synthetic data provides fair predictions across all thresholds, that is, strong fair predictions even when inferring from biased, original data. , 2021) is designed to generate fair synthetic tabular data with By equalizing the learned target probability distributions of the synthetic data generator across sensitive attributes, a downstream model trained on such synthetic data provides fair predictions across all thresholds, that is, strong fair predictions even when inferring from biased, original data. the Dutch scandal for fraud detection we discuss in the section 1 (Heikkilä 2021 ) . FLDGMs (Ramachandranpillai et al. correlated with gender. @inproceedings{kyono2021decaf, title = {DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks}, author = {van Breugel, Boris and Kyono, Trent and Berrevoets, Jeroen and van der Schaar, Mihaela}, year = 2021, booktitle = {Conference on Neural Information Processing Systems(NeurIPS) 2021} } Jul 3, 2024 · The charts below compare the original data with the fair synthetic data. [2016] have adopted this approach. However, for synthetic data to be reproducible, reliable and transparent, it requires specific ways to label and describe it. Generating fair synthetic data from unfair data - while remaining truthful to the Machine learning models have been criticized for reflecting unfair biases in the training data. This repository is devoted to the generation of FAIR synthetic data for patient information from COVID-19 case report forms. In contrast, truly fair synthetic data are carefully constructed to ensure fair predictions when used to train downstream classifiers. Algorithms learn rules and associations based on the training data that they are exposed to. We further study the problem of generating DP fair Oct 24, 2022 · Given the lack of clean training data, generative adversarial techniques are preferred to generate synthetic data with several state-of-the-art architectures readily available across various Machine learning models have been criticized for reflecting unfair biases in the training data. We provide a novel model for fair data generation that relies on probabilistic graphical models and characterize the desiderata for the sampling approach to generate justi˙ably fair data. Jul 10, 2024 · PreFair combines the state-of-the-art synthetic data generation approach that satisfies DP [15] with the definition of justifiable fairness [11]. What we confirmed with this experiment is that this phenomenon happens for all subgroups. 3. Signi cant care is required to produce synthetic data that is useful and comes with privacy guarantees. processed data is utilized to generate synthetic data. 95 Furthermore, they identify five characteristics of fair synthetic data that their method achieves: (1) Mar 7, 2025 · Recent advances in generative models have sparked research on improving model fairness with AI-generated data. Private and Fair Synthetic Data. 4: Modeling pipeline: generate private and fair synthetic data ⇒ build models using the synthetic data ⇒ evaluate and deploy on real data. Synthetic Data Workshop - ICLR 2021 Figure 1: The income gap is significantly mitigated in the synthetic data. We aim to reproduce these claims. We further study the problem of generating DP fair Dec 20, 2022 · In this work, we present PreFair, a system that allows for DP fair synthetic data generation. Mar 4, 2025 · However, most existing research focuses either on privacy-preserving realistic synthetic data generation or on fair and realistic synthetic data generation, but not both. This Contribute to trentkyono/DECAF development by creating an account on GitHub. improve the data utility of the resulting fair synthetic data in downstream analysis, while achiev-ing an intermediate level of fairness. None of these studies address all three aspects—utility, privacy , and fairness—together in synthetic data generation. 3\% 2. Synthetic data has shown significant potential in enhancing algorithmic fairness. What was easy The paper is intuitively written and includes clear graphs that supplement the explanation of the method. , 2023; Guarnera et al. However, the FAIR and CARE principles can serve as the . allows for DP fair synthetic data generation. guarantee that downstream models trained on the generated synthetic data can generate fair predictions on both synthetic and real data. This paper Jun 29, 2021 · On synthetic data's potential for fair AI As the article states, extrapolating new data from an existing data set indeed reproduces the biases embedded in the original. Worse, structured synthetic data can contain intersectional hallucinations—data points that don't make sense—while also showing fidelities—accurate patterns. 2. Feb 26, 2025 · This research focuses on tackling these biases through fair synthetic data practices, aiming to create training datasets that better represent underrepresented groups. Yuchao Tao, Amir Gilad, Ashwin Machanavajjhala, Sudeepa Roy (2023). An equal amount of chocolate. None of these studies address all three aspects—utility, privacy, and fairness—together in synthetic data generation. By utilizing generative Jan 3, 2025 · It starts with (1) synthetic data generation and privacy evaluation, (2) training of bas eline and fair models on both real and synthetic data, and (3) evaluation of baseline and fair models for Aug 20, 2024 · As a fair generative model, our work takes the biased data and creates both fair representation and fair synthetic data. Last but not least, and as further testimony to the validity of the approach, Amazon just published a paper on fast fair synthetic data as well. ” Increasing fairness While generative AI models have been used to create synthetic data in the past, “there’s a danger of producing biased data, which can further bias downstream Mar 18, 2024 · Liu Q Deho O Vadiee F Khalil M Joksimovic S Siemens G (2025) Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms Proceedings of the 15th International Learning Analytics and Knowledge Conference 10. Oct 30, 2023 · AI-generated synthetic data, in addition to protecting the privacy of original data sets, allows users and data consumers to tailor data to their needs. Mar 8, 2025 · Of course, the impact on AI training is not the only negative effect from contaminating the data environment. We find that out-of-the-box predictive models trained on fair synthetic data treat the classes of the sensitive attribute near equally (e. The report also critically Oct 24, 2022 · With the rising adoption of Machine Learning across the domains like banking, pharmaceutical, ed-tech, etc, it has become utmost important to adopt responsible AI methods to ensure models are not unfairly discriminating against any group. This repository contains the implementation of TabFairGAN: Fair Tabular Data Generation with Generative Adversarial Networks. Studies have shown that GenAI systems can generate highly realistic microscopy, radiological, and geographical images and clinical trial and epidemiological data and that these systems can also manipulate or alter real data to enhance support for a scientific hypothesis (5–7). By fostering a deeper understanding of these legal and ethical dynamics, the field can advance towards processed data is utilized to generate synthetic data. ai uses generative AI to accelerate the time-to-insights for research and insights teams. [2017] and Hardt et al. PreFair: Privately Generating Justifiably Fair Synthetic Data. And as their study leverages the same US census data as Machine learning models have been criticized for reflecting unfair biases in the training data. Dec 18, 2020 · While AI has delivered incredible breakthroughs in pattern recognition and organizing immense data sets to help humans make better decisions, certain issues related to bias and privacy have Data integrity concerns can arise because synthetic GenAI data can be easily misrepresented as real data. 93 The authors claim that DECAF allows for the generation of unbiased synthetic data from biased real 94 data and that their method does so with minimal loss in data utility compared to other approaches. guarantee that downstream models trained on the generated synthetic data, can generate fair predictions on both synthetic and real data. It explores the methodologies behind synthetic data generation, spanning traditional statistical models to ad-vanced deep learning techniques, and examines their appli-cations across diverse domains. We propose a pipeline to generate fairer synthetic data Jan 1, 2023 · In this chapter we will explore the different types of synthetic data, how to generate fair synthetic data, and the benefits and challenges presented by synthetic data. A 46 large body of work including Kilbertus et al. 3706546 (591-600) Online publication date: 3-Mar-2025 Fair Synthetic Data Using Causally-Aware Generative Networks” Boris van Breugel University of Cambridge bv292@cam. This section highlights how the underlying graphs of the synthetic and downstream data determine whether a model trained on the synthetic data will be fair in practice. Fair synthetic data is also fair, is also fake data but during the training of the Fairgen. Synthetic data, on the other hand, emerges with the promise to 4 Fairness of Synthetic Data Algorithmic fairness is a popular topic (e. For example, DECAF is a framework that uses SCMs to create fair synthetic data, maintain high-quality data utility, and achieve fairness through causally aware Representative & Fair Synthetic Data . 3 % percent 2. Analysts can leverage synthetic data augmentation to confidently analyze niche markets, previously deemed too small for statistical significance. Generating fair synthetic data from unfair data— while remaining truthful to the Oct 2, 2024 · TabFairGAN (Rajabi & Garibay, 2022) generates synthetic fair data in two stages, first training a GAN to generate synthetic data, then adding a constraint on the synthetic samples to make it fair. The Fairness aware GAN (Fair GAN) (Xu et al. foundation for Indigenous sy nthetic da ta governance. PreFair extends the state-of-the-art DP data generation mechanisms by incorporating a causal fairness criterion that ensures fair synthetic data. Dec 6, 2021 · Machine learning models have been criticized for reflecting unfair biases in the training data. 5, confirming previous results, that showed model trained with GAN-based data acting like random classifiers. We adapt the notion of justifiable fairness to fit Feb 5, 2022 · Additionally, the paper describes a flexible causal approach for modifying this model such that it can generate fair data. Also, the manipulation of datasets to create fair synthetic datasets might result in inaccurate data. , 2017), as models trained on false, hallucinated or biased synthetic data may fail to generalize to real-world scenarios (Van Breugel et al. Using biased data while decision-making might make the decision biased towards some demographics, e. TabFairGAN is a synthetic tabular data generator which could produce synthetic data, with or without fairness constraint. Generating fair synthetic data from unfair data - while remaining truthful to the Synthetic data unlocks the power of AI, but it often misses edge cases and underrepresents minority groups. 2 Related Work To address the issue of historical biases in real-world data, the generation of synthetic data has become a viable option. In the case of models trained with GAN-based COMPAS and COMPAS (fair) synthetic data, accuracy of both subgroups were close to 0. Feb 5, 2022 · Our results clearly show that DECAF could generate fair synthetic data while still maintaining high downstream utility for the first data set. Yet, the very same data that teaches machines to understand and predict the world, contains societal and historic biases, resulting in biased algorithms with the risk of further amplifying these once put into use for decision support. For example, many studies have demonstrated that using balanced synthetic datasets based on generative adversarial networks (GANs) to improve Feb 21, 2024 · “We hope that quality-diversity optimization can help to generate fair synthetic data for broad impacts in medical applications and other types of AI systems. This paper explores the creation of synthetic data that embodies Fairness by Design, focusing on the statistical parity fairness definition. Jan 3, 2025 · This finding has implications for generating fair synthetic data, as previous literature has focused on incorporating fairness constraints directly during the generation of synthetic data (Pujol et al. edu Jeroen Berrevoets University of Cambridge jb2384@cam. Instead of solving for this by introducing fair learning algorithms directly, we focus on generating fair synthetic data, such that any downstream learner is fair. An educational AI-generated synthetic data white paper for data protection authorities and other regulatory bodies; A series of workshops and a final report to advance the concept of fair synthetic data; A report on AI-generated synthetic data for AI auditing and explainable AI Dec 20, 2022 · PreFair extends the state-of-the-art DP data generation mechanisms by incorporating a causal fairness criterion that ensures fair synthetic data. After feeding the Adult data set to our software and training it with the additional parity fairness constraint Feb 1, 2023 · In this work, we present PreFair, a system that allows for DP fair synthetic data generation. Alternatively, 47 one can innovate effective methods to debias the data or generate fair synthetic data. We start with Oct 2, 2024 · They remove biased edges in the causal graph to generate fair synthetic data, ensuring that the generated data adheres to fairness criteria like demographics and equal opportunity. Generating fair synthetic data from unfair data— while remaining truthful to the faceted aspects of synthetic data, particularly emphasizing the challenges and potential biases these datasets may har-bor. 3% higher downstream accuracy than the state-of-the-art in fair synthetic data generation on the Adult dataset. Given the lack of clean training data, generative adversarial techniques are preferred to generate synthetic data with several state-of-the-art architectures tionally, the paper describes a flexible causal approach for modifying this model such that it can generate fair data. Instead of solving this by introducing fair learning algorithms directly, DEACF focuses on generating fair synthetic data, such that any downstream learner is fair. Moreover, many approaches rely on the availability of demographic group labels, which are often costly to annotate. , 2020). Given the lack of clean training data, generative adversarial techniques are preferred to generate synthetic data with several state-of-the-art architectures Oct 25, 2021 · Machine learning models have been criticized for reflecting unfair biases in the training data. , 2023 ) generate fair synthetic samples by generating the latent space with the help of GAN and diffusion architecture. The second approach is For instance, we improve the state-of-the-art in fair synthetic data generation on the Adult (Dua & Graff, 2017) dataset by achieving a 2. The FAIR principles can act as guidelines . This is a particularly important problem since unfair data can lead to unfair downstream predictions. Our approach demonstrates that using pre-processing fairness algorithms after May 28, 2021 · Find out how to create fair synthetic data from our earlier blog post! Figure 1. Jan 3, 2025 · This study investigates which synthetic data generators can best balance privacy and fairness, and whether pre-processing fairness algorithms, typically applied to real datasets, are effective on synthetic data. However, it is possible to augment the data to make it fairer via synthetization. uk Mihaela van der Schaar University of Cambridge University of California, Los Angeles The Alan Abstract: We attempt to reproduce the results of "DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks". We adapt the notion of justifiable fairness to fit the synthetic data genera-tion scenario. We adapt the notion of justifiable fairness to fit the synthetic data generation scenario. In this paper, we extend these techniques to the creation of fair In particular, at the same fairness level, we achieve 2. A key concern of ours is the fairness of synthetic data. We start with Nov 6, 2023 · AI-generated synthetic data, in addition to protecting the privacy of original data sets, allows users and data consumers to tailor data to their needs. This is not the case. In this paper, we introduce DECAF: a GAN-based fair One can develop fair learning algorithms which can detect bias in the data and create fair predictors. data will have fair predictions in real data. Research Projects Secure, Robust and Reliable Machine Learning This is the official repository for the paper "Fair Latent Deep Generative Models (FLDGMs) for Syntax-agnostic and Fair Synthetic Data Generation" - fahim-sikder/FLDGM 4 Fairness of Synthetic Data Algorithmic fairness is a popular topic (e. 01 0. Fair synthetic data. g. Sep 28, 2023 · Synthetic data fairness means generating fair synthetic data from. 4 Fairness of Synthetic Data Algorithmic fairness is a popular topic (e. We start with This article explores the use of Generative Adversarial Networks (GANs) for synthetic data generation using Pytorch. Chart 4: Sex Income Bias in the 1994 US Census Dataset and De-Biased Synthetic Data Nov 28, 2024 · Fair synthetic data generation is not just about compliance; it's about creating AI systems that truly represent and serve all segments of society. Oct 25, 2021 · Machine learning models have been criticized for reflecting unfair biases in the training data. In contrast to black-box methods that require time-consuming training, our FDA framework generates synthetic data directly from the predictive distributions de- Oct 25, 2021 · Generating fair synthetic data from unfair data - while remaining truthful to the underlying data-generating process (DGP) - is non-trivial. This repository contains all code required to replicate our replication study of the paper NeurIPS 2021 DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative networks. Jeff: Why is it challenging to eliminate bias? Removing features like ethnicity doesn’t eliminate the bias. MOSTLY AI allows the generation of fair synthetic data. The aggregate amount of indiscernible synthetic data potentially creates systemic damage: not just to AI training processes, but also in its potential to challenge the trust placed in online information and how it is exchanged. As AI ethics evolves, leading conferences like ICLR are broadening their scope. May 4, 2020 · We are already very much looking forward to Thursday, where we will introduce the brand new concept of Fair Synthetic Data (which is bias-corrected, fully anonymous data that is free to use and innovate with). We will also explore the synthetic-real field gap and how to overcome it with field Aug 10, 2024 · Despite its promise, synthetic data also presents challenges that need to be addressed. Nov 18, 2024 · This paper focuses on a relatively under-explored approach: generating new fair synthetic data from fair graphical models. We have modified the original WGAN-GP by adding privacy and fairness constraints forcing it to produce privacy-preserving fair data. Goal. This page is a resource for people interested in making their synthetic data open and valuable in a responsible way. uk Trent Kyono University of California, Los Angeles tmkyono@ucla. In PVLDB 16(5), 2023. Nov 3, 2023 · indigenous synthetic health data. , 2018) approach trains a generator to produce fair representations by an additional dis-criminator that knows the real distribution of the protected features in the training data. In this paper, we introduce DECAF: a GAN . In this paper, we extend these techniques to the creation of fair Oct 24, 2022 · With the rising adoption of Machine Learning across the domains like banking, pharmaceutical, ed-tech, etc, it has become utmost important to adopt responsible AI methods to ensure models are not unfairly discriminating against any group. We start with allows for DP fair synthetic data generation. Furthermore, van Breugel et al. For females, “proxy” equals to 1 in 90% of all cases and equals to 0 for the remaining 10%. To summarize, in this paper, we make the following contri-butions : 1. Additionally, we demonstrate that CuTS is able to stack several diverse Dec 14, 2021 · Turning unfair real-world data into fair synthetic data. 3 % higher downstream accuracy and a 2 × 2\times 2 × lower demographic parity distance of 0. Methodology synthetic data while ensuring responsible and fair application within various sectors. , 2021; Heusel et al. , see [13, 28]), but fair synthetic data has been much less explored. uk Mihaela van der Schaar University of Cambridge University of California, Los Angeles The Alan Fair Synthetic Data Using Causally-Aware Generative Networks” Boris van Breugel University of Cambridge bv292@cam. , 2021) to achieve fair synthetic data. This is why our lab has been exploring approaches to creating fair synthetic data, which can be used to create fair predictive models. In the statistical parity synthetic data, the percentages of people with high and low income by sex, race, and sex and race combined have been equalized to represent the population more fairly. Our work is based on techniques developed in fairness research, which are capable of producing predictions that are fair across all thresholds. Quality of the model depends on the data source: the quality of synthetic data is highly correlated with the quality of the original data and the data generation model. , 2024; van Breugel et al. , female and male). fvstu klpjkaon lwnpyaa lxqwy ojbie qxu eri tcsyee ksgvk jvfsgqy jiwn jdtqb guvizy abd bguyo