Introducing Incremental Relational Generator. An advanced approach to synthetic relational data generation, through deep learning, to first understand relational database structure and then produce synthetic data with accuracy and scalability, preserving database integrity.
IRG is the latest addition to our collection of state-of-the-art synthetic data generation and augmentation models, i.e., ARF, TabTreeFormer, and TAEGAN. To understand their application for your specific use cases, request a call with our data team.
What is the Challenge with Relational Synthetic Data?
Data cannot work in a silo. For it to make sense, it has to be connected to 2 or more data points forming a relation. These relational databases are the building blocks of modern analytical and management systems, from e-commerce platforms analyzing orders to healthcare systems tracking patient records. These databases rely on precise relationships between tables, governed by constraints like primary keys (PKs) and foreign keys (FKs) to ensure data consistency. Generating relational synthetic data that accurately mirrors these interconnected structures, however, has long been a challenge.
Traditional data generation methods often fail to handle complex relational schemas, such as composite primary keys, nullable foreign keys, and sequential dependencies. These limitations lead to synthetic datasets that break schema integrity and are unusable for production.
Incremental Relational Generator (IRG) addresses these challenges using advanced deep learning. It generates high-quality relational synthetic data that maintains full schema integrity, accurately supports multi-column keys, handles optional references between tables, and preserves the order and dependencies in time-series or event-based data.
With IRG, enterprises can create realistic, constraint-compliant synthetic datasets ideal for AI/ML training, testing, and safe data sharing, no matter how complex the schema.
How Does Incremental Relational Generator Work?
Incremental Table Generation:
IRG generates tables step-by-step, with each new table informed by previously created ones. This ensures dependencies (like FK constraints) are honored. For example, a "customer orders" table is generated only after its parent "customers" table exists.
Context-Aware Deep Learning:
Using a modified conditional GAN framework, IRG generates synthetic data for each table while conditioning on relevant parent rows. This means a "product reviews" table is generated in context with its linked "products" and "users" tables, preserving relationships.
Sequential Dependency Modeling:
For time-series data (e.g., payment logs), IRG uses a conditional time-series model to capture patterns like order timestamps or user activity sequences.
Composite and Nullable Key Support:
Unlike prior methods, IRG handles overlapping composite keys (e.g., a PK made of two FKs) and nullable FKs (e.g., an "assisted_by" column that can be empty).
Guaranteed Constraint Accuracy:
IRG ensures 100% validity for PKs and FKs, ensuring there are no duplicates or broken links. This is critical for analytics-ready synthetic data.
Advantages Over Existing Methods:
Handles Complex Schemas:
Supports composite keys, cyclic dependencies, and tables with multiple parents (e.g., a "step-sibling" table sharing a parent).
Scalability:
By decomposing tasks into smaller subtasks, IRG efficiently scales to databases with millions of rows.
Sequential Data:
Captures time-series patterns, like purchase histories, without manual tuning.
Privacy Preservation:
Avoids overfitting to exact key distributions, reducing leakage risks.
Experimental Results:
We ran experiments on 3 relational databases with varying scales and fields with sufficiently complex schema and different categorical and continuous columns other than IDs, comparing IRG with leading models such as IND, HMA, and RC-TGAN. In all three experiments, IRG outperformed all models in relational synthetic data generation.
Football Database:
- Challenge: Composite PKs (e.g., game_id + player_id in appearances) and nullable FKs (e.g., assister_id in shots).
- IRG’s Performance: Enforced 100% uniqueness for composite keys and valid nullable references.
- Outcome: Generated 5,000 games without errors; outperformed competitors in 8/10 analytical queries.
Brazilian E-Commerce Database:
- Challenge: Composite PKs with serial IDs (e.g., order_id + item_number).
- IRG’s Performance: Guaranteed 100% valid serial IDs, avoiding duplicates.
- Outcome: Outperformed IND in metrics like order-item distributions while other competitors crashed during training.
Beatport Tracks Database:
- Challenge: Million-row tables with composite keys (e.g., artist_id + track_id).
- IRG’s Solution: Scaled seamlessly, avoiding memory bottlenecks.
- Outcome: Achieved 100% constraint compliance; statistical metrics (K-S < 0.2, Wasserstein < 0.05) mirrored real data.
Conclusion:
IRG represents a paradigm shift in relational synthetic data generation. By using deep learning to understand complex database schemas, IRG can generate realistic relational synthetic data at scale for enterprises to innovate and grow seamlessly.
To Read the complete research paper click here.