Dr. Uzair Javaid

Dr. Uzair Javaid is the CEO and Co-Founder of Betterdata AI, a company focused on Programmable Synthetic Data generation using Generative AI and Privacy Engineering. Betterdata’s technology helps data science and engineering teams easily access and share sensitive customer/business data while complying with global data protection and AI regulations.
Previously, Uzair worked as a Software Engineer and Business Development Executive at Merkle Science (Series A $20M+), where he worked on developing taint analysis techniques for blockchain wallets. 

Uzair has a strong academic background in Computer Science/Engineering with a Ph.D. from National University of Singapore (Top 10 in the world). His research focused on designing and analyzing blockchain-based cybersecurity solutions for cyber-physical systems with specialization in data security and privacy engineering techniques. 

In one of his PhD. projects, he reverse engineered the encryption algorithm of Ethereum blockchain and ethically hacked 670 user wallets. He has been cited 600+ times across 15+ publications in globally reputable conferences and journals, and has also received recognition for his work including Best Paper Award and Scholarships. 

In addition to his work at Betterdata AI, Uzair is also an advisor at German Entrepreneurship Asia, providing guidance and expertise to support entrepreneurship initiatives in the Asian region. He has been actively involved in paying-it-forward as well, volunteering as a peer student support group member at National University of Singapore and serving as a technical program committee member for the International Academy, Research, and Industry Association.

Case Study: Generate High Quality Relational Synthetic Databases using IRG.

Dr. Uzair Javaid
May 6, 2025
Summary:
  • IRG generates structurally accurate synthetic data, preserving primary and foreign key constraints with 100% integrity.
  • Statistical similarity between real and synthetic datasets is extremely high, with minimal divergence in metrics like KS Statistic (0.0539) and Wasserstein Distance (0.0077).
  • Synthetic datasets maintain near-identical row counts, with <2% variance, ensuring scalability and schema consistency.
  • All relational dependencies are preserved, with zero orphan or null foreign keys, ensuring referential soundness.
  • IRG’s synthetic data is production-ready, enabling safe, high-fidelity use in ML, analytics, and software testing.
  • Table of Contents

    What Is IRG?

    Incremental Relational Generator (IRG) is our state-of-the-art, advanced relational synthetic data generation model that uses deep learning to generate accurate synthetic data without compromising database integrity. 

    How Does IRG Work?

    Incremental Relational Generator (IRG) works by generating relational synthetic databases through,

    • Incremental table generation
    • Context-aware deep learning
    • Sequential dependency modeling
    • Composite and nullable key support
    • Ensuring constraint accuracy

    Use Cases For IRG:

    Understanding complex relational database schemas, enterprises can generate realistic synthetic data at scale for use cases such as, 

    Pre-Training AI And ML Model Training:

    Protect Data Privacy:

    • Share data with partners, vendors, or research teams without violating GDPR, HIPAA, CCPA, or PDPC regulations.
    • Enable cross-team collaboration in large enterprises with restricted data access.

    Software Testing And QA:

    • Test relational database systems with realistic but safe synthetic relational datasets.
    • Simulate, augment, and enhance scarce or low-quality datasets to address edge cases, performance limits, and integration scenarios.

    Synthetic Sandboxing for LLMs & Agents:

    • Create safe relational environments for training/testing AI agents, RAG pipelines, or structured retrieval tasks.

    Data Analytics & BI Prototyping:

    • Build and test dashboards or analytics pipelines without relying on sensitive production data.
    • Helps avoid production slowdowns or risk exposure.

    Have a specific use case? Schedule a call to learn more.

    About The Case Study:

    This case study evaluates the synthetic dataset generated from a relational sports database, comprising four tables: game, player, appearances, and shots. The goal is to assess the synthetic data's statistical fidelity, structural integrity, and referential consistency relative to the real data. 

    Through comprehensive comparisons involving primary key uniqueness, foreign key validity, and distributional similarity measures (KS Statistic and Wasserstein Distance), we demonstrate that the synthetic dataset achieves high fidelity with minimal divergence, making it suitable for downstream ML workflows and privacy-compliant analytics.

    Dataset Overview:

    The dataset consists of four relational tables:

    • rdb-200-game: 200 rows
    • rdb-200-player: 4430 rows
    • rdb-200-appearances: ~5600 rows
    • rdb-200-shots: ~5100 rows

    The synthetic dataset maintains nearly identical row counts, with <2% relative difference in any table, ensuring scalability and structural balance across the data schema.

    Validation Methodology:

    Structural Validation:

    We validate primary keys by confirming uniqueness and non-null constraints. All primary keys in the synthetic dataset (e.g., gameID, playerID, composite keys) satisfy these constraints, with a 100% match to the real data.

    Foreign Key Integrity:

    Four foreign key relationships were evaluated based on:

    • Null ratio
    • Orphan ratio (referential violations)
    • Idle parent ratio (parent records not referenced)
    • Degree distributions (fan-out statistics)

    Statistical Distance Measures:

    • KS Statistic (0–1; lower is better)
    • Wasserstein Distance (≥0; lower is better)

    ‍Low scores indicate high similarity between real and synthetic distributions.

    Distributional Alignment:

    We compared marginal distributions across all columns using histograms and tabular summary statistics. 

    • Distributions of counts per player, shots per game, and player appearances remained consistent.
    • Marginal distributions retained strong alignment in data modeling.

    Results:

    Real-World Parity:

    • Synthetic tables (e.g., rdb-200-game) match real data sizes exactly (200 real vs. 200 synthetic).
    • Minor differences in other tables (e.g., a 0.5% size gap in rdb-200-appearances) are statistically negligible, proving precision.

    Guaranteed Data Integrity: 

    • Primary keys (like playerID) are 100% unique and non-null in both real and synthetic data. 
    • Databases and applications run flawlessly, with no errors from missing or duplicate entries.

    Zero Orphans, Zero Risk: 

    • Foreign keys (e.g., links between rdb-200-game and rdb-200-appearances) have 0% orphans and 0% nulls in synthetic data, mirroring real data. 
    • All relationships (e.g., which players appeared in which games) are preserved precisely, avoiding broken workflows or analytics errors.

    Statistically Accurate: 

    • KS Statistic (0.0539) and Wasserstein Distance (0.0077) confirm that synthetic data distributions (e.g., shot counts, player stats) align almost perfectly with real data. 
    • These values are extremely close to zero, indicating high standards for synthetic data quality.

    Conclusion:

    Despite minor discrepancies in upper bounds of degree distributions (e.g., max degrees in synthetic data slightly truncated compared to real), the overall alignment is statistically sound. These differences are acceptable trade-offs in favor of privacy and generalization, particularly for non-identifiable insights or ML training. Importantly, all structural constraints are preserved, ensuring referential soundness.

    The slight variance in standard deviation and mean degrees reflects natural stochasticity in generative models and does not impact downstream analytical validity.

    The case study exhibits high fidelity and referential integrity, achieving near-perfect alignment with the original data in both structure and statistical behavior, confirming IRG’s ability to improve the relational synthetic generation pipeline’s supporting production-grade applications, particularly in ML, analytics, and software testing, without risking data privacy.

    Dr. Uzair Javaid
    Dr. Uzair Javaid is the CEO and Co-Founder of Betterdata AI, specializing in programmable synthetic data generation using Generative AI and Privacy Engineering. With a Ph.D. in Computer Science from the National University of Singapore, his research has focused on blockchain-based cybersecurity solutions. He has 15+ publications and 600+ citations, and his work in data security has earned him awards and recognition. Previously, he worked at Merkle Science, developing taint analysis techniques for blockchain wallets. Dr. Javaid also advises at German Entrepreneurship Asia, supporting entrepreneurship in the region.
    Related Articles

    don’t let data
    slow you down

    Our 3 step synthetic data solution increases your business performance by 10x
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.