Case Study: Generate High Quality Relational Synthetic Databases using IRG.

What Is IRG?

Incremental Relational Generator (IRG) is our state-of-the-art, advanced relational synthetic data generation model that uses deep learning to generate accurate synthetic data without compromising database integrity.

‍

How Does IRG Work?

Incremental Relational Generator (IRG) works by generating relational synthetic databases through,

Incremental table generation
Context-aware deep learning
Sequential dependency modeling
Composite and nullable key support
Ensuring constraint accuracy

‍

Use Cases For IRG:

Understanding complex relational database schemas, enterprises can generate realistic synthetic data at scale for use cases such as,

Pre-Training AI And ML Model Training:

Pre-train AI and ML models on high-quality differentially private synthetic data without exposing real PII or confidential data.
Maintain relational structure (e.g., users → orders → payments) for real-world learning scenarios.

Protect Data Privacy:

Share data with partners, vendors, or research teams without violating GDPR, HIPAA, CCPA, or PDPC regulations.
Enable cross-team collaboration in large enterprises with restricted data access.

Software Testing And QA:

Test relational database systems with realistic but safe synthetic relational datasets.
Simulate, augment, and enhance scarce or low-quality datasets to address edge cases, performance limits, and integration scenarios.

Synthetic Sandboxing for LLMs & Agents:

Create safe relational environments for training/testing AI agents, RAG pipelines, or structured retrieval tasks.

Data Analytics & BI Prototyping:

Build and test dashboards or analytics pipelines without relying on sensitive production data.
Helps avoid production slowdowns or risk exposure.

‍

Have a specific use case? Schedule a call to learn more.

‍

About The Case Study:

This case study evaluates the synthetic dataset generated from a relational sports database, comprising four tables: game, player, appearances, and shots. The goal is to assess the synthetic data's statistical fidelity, structural integrity, and referential consistency relative to the real data.

Through comprehensive comparisons involving primary key uniqueness, foreign key validity, and distributional similarity measures (KS Statistic and Wasserstein Distance), we demonstrate that the synthetic dataset achieves high fidelity with minimal divergence, making it suitable for downstream ML workflows and privacy-compliant analytics.

‍

Dataset Overview:

The dataset consists of four relational tables:

rdb-200-game: 200 rows
rdb-200-player: 4430 rows
rdb-200-appearances: ~5600 rows
rdb-200-shots: ~5100 rows

The synthetic dataset maintains nearly identical row counts, with <2% relative difference in any table, ensuring scalability and structural balance across the data schema.

‍

‍

Validation Methodology:

Structural Validation:

We validate primary keys by confirming uniqueness and non-null constraints. All primary keys in the synthetic dataset (e.g., gameID, playerID, composite keys) satisfy these constraints, with a 100% match to the real data.

‍

‍

Foreign Key Integrity:

Four foreign key relationships were evaluated based on:

Null ratio
Orphan ratio (referential violations)
Idle parent ratio (parent records not referenced)
Degree distributions (fan-out statistics)

Statistical Distance Measures:

KS Statistic (0–1; lower is better)
Wasserstein Distance (≥0; lower is better)

‍Low scores indicate high similarity between real and synthetic distributions.

Distributional Alignment:

We compared marginal distributions across all columns using histograms and tabular summary statistics.

Distributions of counts per player, shots per game, and player appearances remained consistent.
Marginal distributions retained strong alignment in data modeling.

‍

‍

Results:

Real-World Parity:

Synthetic tables (e.g., rdb-200-game) match real data sizes exactly (200 real vs. 200 synthetic).
Minor differences in other tables (e.g., a 0.5% size gap in rdb-200-appearances) are statistically negligible, proving precision.

Guaranteed Data Integrity:

Primary keys (like playerID) are 100% unique and non-null in both real and synthetic data.
Databases and applications run flawlessly, with no errors from missing or duplicate entries.

Zero Orphans, Zero Risk:

Foreign keys (e.g., links between rdb-200-game and rdb-200-appearances) have 0% orphans and 0% nulls in synthetic data, mirroring real data.
All relationships (e.g., which players appeared in which games) are preserved precisely, avoiding broken workflows or analytics errors.

Statistically Accurate:

KS Statistic (0.0539) and Wasserstein Distance (0.0077) confirm that synthetic data distributions (e.g., shot counts, player stats) align almost perfectly with real data.
These values are extremely close to zero, indicating high standards for synthetic data quality.

‍

Conclusion:

Despite minor discrepancies in upper bounds of degree distributions (e.g., max degrees in synthetic data slightly truncated compared to real), the overall alignment is statistically sound. These differences are acceptable trade-offs in favor of privacy and generalization, particularly for non-identifiable insights or ML training. Importantly, all structural constraints are preserved, ensuring referential soundness.

The slight variance in standard deviation and mean degrees reflects natural stochasticity in generative models and does not impact downstream analytical validity.

The case study exhibits high fidelity and referential integrity, achieving near-perfect alignment with the original data in both structure and statistical behavior, confirming IRG’s ability to improve the relational synthetic generation pipeline’s supporting production-grade applications, particularly in ML, analytics, and software testing, without risking data privacy.

Dr. Uzair Javaid