Dr. Uzair Javaid

Dr. Uzair Javaid is the CEO and Co-Founder of Betterdata AI, a company focused on Programmable Synthetic Data generation using Generative AI and Privacy Engineering. Betterdata’s technology helps data science and engineering teams easily access and share sensitive customer/business data while complying with global data protection and AI regulations.
Previously, Uzair worked as a Software Engineer and Business Development Executive at Merkle Science (Series A $20M+), where he worked on developing taint analysis techniques for blockchain wallets. 

Uzair has a strong academic background in Computer Science/Engineering with a Ph.D. from National University of Singapore (Top 10 in the world). His research focused on designing and analyzing blockchain-based cybersecurity solutions for cyber-physical systems with specialization in data security and privacy engineering techniques. 

In one of his PhD. projects, he reverse engineered the encryption algorithm of Ethereum blockchain and ethically hacked 670 user wallets. He has been cited 600+ times across 15+ publications in globally reputable conferences and journals, and has also received recognition for his work including Best Paper Award and Scholarships. 

In addition to his work at Betterdata AI, Uzair is also an advisor at German Entrepreneurship Asia, providing guidance and expertise to support entrepreneurship initiatives in the Asian region. He has been actively involved in paying-it-forward as well, volunteering as a peer student support group member at National University of Singapore and serving as a technical program committee member for the International Academy, Research, and Industry Association.

Multi-Table Synthesis: Handling Complex Relational Constraints with IRG-1

Dr. Zilong ZHAO
February 26, 2026

Table of Contents

Summary:
  • Relational Integrity Preserved: IRG-1 maintains complex schema constraints—including PKs, FKs, composite keys, nullable fields, and inter-table dependencies—ensuring structurally valid synthetic datasets.
  • Preprocessing Enables Valid DAG: Real-world datasets often violate declared schemas; targeted preprocessing (e.g., resolving cyclic dependencies) is essential to enable generation via a directed acyclic graph.
  • Handles Advanced Constraints: The platform supports overlapping composite foreign keys, non-equality constraints, and sequentially ordered tables (e.g., by date or time).
  • Topology-Aware Generation: Defining a dependency-respecting topological order ensures child tables are generated only after their parent tables.
  • Zero Orphans, Consistent Shapes: Synthetic outputs retain table shapes, enforce PK uniqueness, and satisfy FK constraints without orphan records.
  • Flexible Configuration: UI and JSON-based schema setup supports both interactive use and automated pipelines.

For multi-table synthetic data generation, one of the most challenging tasks is preserving the relational schema constraints of the original dataset in the synthetic output. This includes maintaining primary key (PK), foreign key (FK), composite key, nullable constraints, and inter-table dependencies.

In this blog, we use a complex football dataset to illustrate the practical challenges of multi-table synthesis and how we address them using Betterdata’s IRG-1 platform.

Data Preprocessing

One issue with the raw football dataset is that it does not fully comply with its declared relational schema. Therefore, a preprocessing step is required to clean and standardize the data before synthesis. Beyond basic cleaning, we apply the following structural adjustments:

  1. Only 5 leagues are present, so the table is removed, and other tables involving league ID are treated as categorical.
  2. No textual information would be used in our model, so names, etc., are removed. However, after the removal of these columns, some tables become an ID-only table. To avoid corner case bugs, we insert a placeholder unary value in the tables.
  3. Some (game + assist provider) in SHOTS are not “appeared” in the same game are transformed into N/A.
  4. Cyclic dependency between teamstats and games is decomposed into games -> TEAMSTATS -> GAMESTATS.

This transformation ensures a valid directed acyclic graph (DAG) structure for generation.

The full preprocessing script is provided at the end of this blog. The table reference of the processed football dataset is provided at Figure 1, the primary key (PK), foreign key (FK) constraints are explicitly explained at Table 1.

Why This Dataset Is Difficult

To the best of our knowledge, most existing multi-table synthesizers can handle simple PK–FK relationships. However, this football dataset introduces several advanced challenges:

  1. Composite primary keys (e.g., in APPEARANCES)
  2. Overlapping composite foreign keys (in GAMES and SHOTS)
  3. Nullable foreign keys (in SHOTS)
  4. Non-equality constraints (across GAMES and SHOTS)
  5. Sequential structure:
  • TEAMSTATS sorted by date
  • SHOTS sorted by minute

These structural properties significantly increase modeling complexity.

Figure 1. Relational Schema of Football Dataset

Table 1. Primary Key, Foreign Key Constraints for Football Dataset

IRG-1 Solution

In IRG-1, we systematically address all of the above constraints. Within the IRG-1 platform, users explicitly configure relational constraints before training.

1. Schema Configuration

Figure 2. Configuration for Primary Key for GAMESTATS Table in IRG-1

Figure 3. Configuration for Foreign Key for TEAMSTATS Table in IRG-1

2. Topological Generation Order

Beyond schema configuration, IRG-1 requires defining a topological order of tables, which determines the generation sequence. This order must respect inter-table dependencies.

For example:

  • APPEARANCES depends on both PLAYERS and GAMES
  • Therefore, those parent tables must be generated first

Note that there is not a single valid ordering, but certain precedence rules must always be satisfied.

Figure 4. Topology Order of Football Dataset

3. JSON-Based Configuration

The IRG-1 platform allows toggling between UI mode and JSON mode for schema configuration. This provides flexibility for advanced users and automation pipelines.

Figure 5. JSON Configuration of Relational Schema for Football Dataset

Result Evaluation

After the model training and data sampling, we will evaluate the synthetic multi-table data quality. 

1. Table Shape Consistency

In Figure 6, we can see, our generated data has exactly the same number of columns and rows between real and synthetic data.

Figure 6. Table shapes of real and synthetic data. #C is the number of columns, and #R is the number of rows. 

2. Primary Key Uniqueness

Figure 7 shows that PKs remain unique across all real and synthetic tables.

Figure 7. Uniqueness of Primary Key

3. Foreign Key Satisfaction

In Figure 8, we introduce a concept “Degree”: number of child rows corresponding to the same parent row. For example for a player (PLAYERS, i.e., parent table), it can appear (APPEARANCES, i.e., child table) in many games. In our report, we evaluate the degree distribution between real and synthetic data.

Figure 8. Degree Distribution for Three FKs

The relations between different tables in a multi-table dataset are essentially defined by FK constraints. A multi-table dataset is considered valid only if all FK constraints are satisfied. An FK constraint is considered satisfied in the absence of "orphans", i.e., values in the child table that are not found in the parent. Orphans are strictly not allowed for FK constraints satisfaction.

Figure 9. Foreign Key Satisfaction and Statistics of Real and Synthetic Data.

4. Single-Table Statistical Evaluation

We also have more detailed single table evaluation, which include: (1) marginal distribution (2) pair-wise correlation and (3) privacy evaluation. Due to page limits, we will not show all the figures in this blog. 

The IRG-1 codebase is publicly available at:

👉 https://github.com/li-jiayu-ljy/irg

If you would like to explore multi-table synthesis for your own structured datasets, feel free to reach out — we would be happy to discuss further.

👉 zilong@betterdata.ai

Dr. Zilong ZHAO
Dr. Zilong Zhao is the Head of Research and Development at Betterdata, bringing a wealth of experience in data science and generative models. With a Ph.D. from the University of Grenoble Alpes, Dr. Zhao has pursued postdoctoral research at institutions including TU Delft, TU Munich, and the National University of Singapore. Throughout his academic career, Dr. Zhao has focused on advancing generative models for structured data, contributing to the field with his work on single table, time series, and relational databases. His papers, CTABGAN and CTABGAN+, are among the most cited in the area of single-table data synthesis.
Related Articles
Access Data 10x Faster