In October 2024, Betterdata was the only startup in south-east Asia and among the 4 synthetic data startups globally to be awarded a contract building DHS's synthetic data generation capabilties. Where we solved ML training on high-quality, high utility and non-anonymized data for the US Department of Homeland Security (DHS) without compromising or risking data privacy.
What Challenge did DHS face?
The Department of Homeland Security faced 2 critical problems when training AI and ML models to improve cyber security,
Restrictive Data Sharing:
Due to privacy, security, and regulatory constraints, operational data is often sensitive and cannot be shared across departments.
Ineffective Data Protection:
Traditional anonymization techniques fail to protect against re-identification risks fully, limiting the ability to conduct effective cybersecurity and infrastructure protection exercises.
Thus, the Department of Homeland Security (DHS) required high-quality data to train ML models, test critical systems, and simulate real-world scenarios.
Our Solution, i.e., Large Tabular Model (LTM):
Through our foundational model, ‘Large Tabular Model (LTM)’, DHS can generate high-fidelity synthetic data that mirrors the statistical properties of real datasets, while protecting sensitive or personally identifiable information (PII).
Zero-shot and few-shot adaptability:
Unlike rule-based or deep learning-based synthetic data generators that need retraining for each use case, LTM can adapt to new datasets with minimal input.
Built-in privacy audits:
Every synthetic dataset generated undergoes rigorous privacy assessments, ensuring compliance with DHS’s stringent security and privacy standards
Benefits of Synthetic Data?
Synthetic data is increasingly being used as an alternative to real data because of the following advantages:
Advanced Privacy Protection:
- Synthetic data does not contain any Personally Identifiable Information (PII), making it completely safe and secure from reidentification attacks or risk of sensitive data exposure.
- At Betterdata, we enhance data privacy by incorporating differential privacy into the entire synthetic data pipeline, improving data protection, providing quantifiable data privacy guarantees while balancing data privacy and data utility.
Statistically Similar:
- Synthetic data mimics the statistical properties of real data, such as marginal distributions, correlation structure, temporal and sequential patterns (for time-series), child-parent relationships (for relational data), etc.
Customizable:
- Synthetic data can be customized depending on the enterprise’s unique data needs.
- Synthetic data can be augmented to increase domain coverage for better model generalizability.
- Synthetic data can be enhanced to improve fairness, reduce imbalance, and bias in datasets.
Is Synthetic Data High-Quality?
Yes. Synthetic data is not only high-quality but also,
- High Utility
- High Dimensional
- Highly Private
Making it ideal for data-intensive tasks such as machine learning, data analysis, data sharing, data monetization, and so on.
The Impact:
Through the adoption of Betterdata’s LTM, DHS is now achieving game-changing outcomes:
Faster Cyber Defense Simulations:
DHS can now simulate sophisticated cyber-attack scenarios using realistic yet risk-free datasets, accelerating training and strategic planning.
Enhanced Anomaly Detection:
ML models trained on synthetic data identify anomalies and threats more accurately, without ever accessing real user information.
Secure Interagency Collaboration:
Agencies can share synthetic datasets freely, breaking down data silos without risking policy violations.
Regulatory Compliance:
DHS remains fully aligned with national cybersecurity mandates while advancing its AI-driven threat intelligence programs.
Synthetic data has the potential to transform industries by enabling government agencies and enterprises to innovate without any restrictions because it allows enterprises and government agencies to work with accessible, fair, and scalable data. Something that was not possible in the not-so-distant past. With data protected and utility maintained (or even improved in some cases), innovation is not a question of how but when.