Taking Bias out of AI with Synthetic Data

What is Synthetic Data:

Synthetic Data is a subset of Generative AI (GenAI) which is used as a substitute for real data in training Machine Learning (ML) models. It is trained on real data from which untraceable datasets are created that mimic the statistical properties of real world data without having privacy leakage implications since synthetic data does not contain information about real individuals. This allows data consumers to access high-quality privacy-preserving data faster and safely.

Difference between real and synthetic data

Importance of Synthetic Data for ML Models:

ML models require massive amounts of data to operate. But data is hard to get by and good quality data takes time to be organized and cleaned up. Real-world data is protected by data-protection regulations such as the Personal Data Protection Act (PDPA) in Singapore and the General Data Protection Regulation (GDPR) that impose massive fines if they find a company is in breach of data privacy law(s). One of the key challenges with real data is that it contains human biases, is generally imbalanced, and is therefore prone to data drift over time.

An analysis of more than 5,000 images created with Stable Diffusion found that it takes racial and gender disparities to extremes — worse than those found in the real world. - Bloomberg

While all of these are valid problems, bias in data can render ML models completely useless. In this article, we will go through what bias is and how synthetic data can help you remove bias in ML.

Read Also: Improving Machine Learning Models with Synthetic Data

Type of Data Bias:

1. Undersampling:

Undersampling occurs when certain classes or groups within the dataset are underrepresented. This can lead to models that perform poorly in these minority classes because they have not been adequately learned during training. 

  • Causes: Undersampling can happen due to various reasons such as data collection methods, availability of data, or even intentional/unintentional neglect.

  • Consequences: Models trained on undersampled data can exhibit poor generalization to underrepresented classes, leading to biased predictions. Suppose a bank has a dataset where 99% of the transactions are legitimate (majority class) and 1% are fraudulent (minority class). To balance the dataset, the bank decides to undersample the legitimate transactions. If the bank randomly removes a large portion of the legitimate transactions, it might inadvertently exclude many types of legitimate behavior patterns that are crucial for distinguishing between fraudulent and non-fraudulent transactions. As a result, the trained model might fail to recognize legitimate variations in spending behavior, leading to higher false positive rates where legitimate transactions are incorrectly flagged as fraud.
An illustration of how data is reduced from majority class to match minority class resulting in undersampling.

2. Labeling Errors:

Labeling errors refer to instances where the data has been incorrectly labeled. This can introduce noise into the dataset and adversely affect model performance.

  • Causes: Labeling errors can arise from human mistakes, automated labeling processes, or ambiguous data.

  • Consequences: Models trained on mislabeled data can learn incorrect patterns, leading to reduced accuracy and reliability.  For instance, consider a medical image classification task where the goal is to detect cancerous tumors in radiology scans. If a significant number of images are incorrectly labeled due to human error—such as mislabeling healthy scans as having tumors and vice versa—the model trained on this dataset will learn incorrect associations. This labeling error bias can result in a model that misclassifies healthy patients as having cancer (false positives) or, more critically, misses cancerous tumors (false negatives)

3. User-Generated Bias:

User-generated bias occurs when the actions of data analysts or engineers unintentionally introduce bias during data processing and model training.

  • Causes: This type of bias can stem from various sources including selection bias, confirmation bias, and overfitting.

  • Consequences: User-generated bias can lead to models that reflect the biases of the analysts rather than the true patterns in the data. This can result in unfair or inaccurate predictions such as a hiring algorithm that unfairly favors certain demographic groups. For example, consider a movie recommendation system that relies on user ratings and reviews to suggest films to other users. If a certain demographic, such as young adults, is more active in providing ratings and reviews, the recommendation model may become biased toward the preferences of this demographic. This user-generated bias can result in the system predominantly recommending movies that appeal to young adults while underrepresenting or ignoring the preferences of older adults or other less active demographic groups.

An illustration of how data is skewed towards a specific demographic that impacts ML results

4. Skewed or Under-represented Samples

Skewed samples occur when the dataset is disproportionately represented in certain features, leading to biased models.

  • Causes: You may have a training data set that is skewed or underrepresented where an underrepresented group has less training data resulting in a poorer performance of ML models on the underrepresented groups.

  • Consequences: Skewed sample bias can significantly impact the performance and fairness of ML models, particularly in predictive policing. For instance, consider a predictive policing system that uses historical crime data to forecast future crime hotspots and allocate police resources. If the historical data is skewed towards certain neighborhoods due to over-policing in those areas, the training dataset will reflect this bias showing higher crime rates in these over-policed neighborhoods. Consequently, the model will learn to disproportionately predict higher crime rates in these areas, leading to a self-fulfilling prophecy where more police forces are sent to the already over-policed neighborhoods. This can exacerbate existing inequalities, as these neighborhoods continue to receive excessive police attention while other areas may be under-monitored.

6. Limited Features in Training Sets:

Limited features in training sets refer to the insufficient or incomplete data attributes used to train ML models. 

  • Causes: The causes of limited features in training sets often include historical data collection practices, where only a subset of potentially relevant attributes were recorded.

  • Consequences: The consequences of having limited features in training sets are significant and can lead to biased and suboptimal ML models. Such models can lead to missed opportunities, as they fail to leverage all relevant information to make informed decisions. In the long term, this undermines trust in the fairness and reliability of automated decision-making systems, particularly in high-stakes applications like finance, healthcare, and employment. For instance, two candidates with similar educational backgrounds and years of experience might be evaluated similarly by the model, despite one having additional certifications, a diverse skill set, and a strong work portfolio that makes them a better fit for the position. As a result, the model could overlook highly qualified candidates who do not fit the limited criteria but possess other valuable attributes.

An illustration of how limited features in a dataset leads to poor predictions and decisions by the ML model

Impact of Bias on Artificial Intelligence (AI): 

The impact of bias on AI is profound and multifaceted, affecting the accuracy, fairness, and societal trust in AI systems. Bias in AI can arise from skewed training data, biased algorithms, or prejudiced design decisions, leading to models that systematically favor certain groups over others. In 2023 the  US Equal Employment Opportunity Commission (EEOC) settled a suit against iTutor for $365,000 which was accused of using AI-powered recruiting systems that automatically rejected female applicants aged 55 or older and male applicants aged 60 or older. This can manifest in various ways, such as discriminatory hiring practices, biased credit scoring, and unfair judicial outcomes, where AI systems perpetuate and even exacerbate existing social inequalities. Technically, biased models can exhibit reduced generalization capabilities, performing well on the overrepresented groups while failing on underrepresented ones. This compromises the robustness and reliability of AI applications, leading to poor decision-making and suboptimal outcomes. Additionally, biased AI systems can erode public trust, as users become wary of automated decisions perceived as unfair or discriminatory. A 2019 paper showed that black patients received lower health emergency risk scores than people with lighter skin tones. Addressing bias in AI is thus crucial for ensuring ethical, equitable, and effective AI deployment, necessitating comprehensive strategies that include diverse and representative training data, bias detection and mitigation techniques, and continuous monitoring and evaluation.

How Synthetic Data Helps:

Synthetic data can play a crucial role in removing AI bias by providing balanced and representative datasets that mitigate the limitations of real-world data. Synthetic data is artificially generated rather than collected from real-world events, allowing for the creation of datasets that can be controlled and tailored to include diverse and equitable representations of various demographic groups. This helps in addressing issues like underrepresentation and limited features that often lead to biased AI models. By supplementing or replacing biased real-world data, synthetic data can ensure that machine learning models are trained on a more comprehensive and unbiased dataset. Technically, synthetic data generation techniques such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) can be used to simulate realistic and varied data points that reflect the full spectrum of potential scenarios and populations. This enhances the generalization capabilities of AI models, ensuring they perform well across different groups and conditions. Moreover, synthetic data can be used to test and validate AI systems, identifying and correcting biases before deployment. By integrating synthetic data into the training process, developers can create fairer, more robust, and trustworthy AI systems that better serve diverse populations.

BetterData’s Bias Correcting Synthetic Data System:

Many synthetic data companies have tried to address this bias using synthetic data by increasing the number of samples of this underrepresented group. However, Betterdata’s ML team findings resonate with the ones in the ICML 2023 paper where it is proven that simply adding synthetic data without consideration of the downstream ML model will not lead to an improvement in performance. Hence, the ML algorithm will still perform poorly in the underrepresented group. Instead, Betterdata’s programmable synthetic data platform directly improves the areas of weakness in underrepresented classes of the downstream ML task by targeting its wrongly predicted samples and creating synthetic data that can improve its performance. Through this, Betterdata’s technology can consistently improve the precision rate by up to 5% (which checks for positive predictions).

don’t let data
slow you down

Our 3 step synthetic data solution increases your business performance by 10x
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.