Multi-Table Synthesis: Handling Complex Relational Constraints with IRG-1

For multi-table synthetic data generation, one of the most challenging tasks is preserving the relational schema constraints of the original dataset in the synthetic output. This includes maintaining primary key (PK), foreign key (FK), composite key, nullable constraints, and inter-table dependencies.

‍

In this blog, we use a complex football dataset to illustrate the practical challenges of multi-table synthesis and how we address them using Betterdata’s IRG-1 platform.

‍

Data Preprocessing

One issue with the raw football dataset is that it does not fully comply with its declared relational schema. Therefore, a preprocessing step is required to clean and standardize the data before synthesis. Beyond basic cleaning, we apply the following structural adjustments:

Only 5 leagues are present, so the table is removed, and other tables involving league ID are treated as categorical.
No textual information would be used in our model, so names, etc., are removed. However, after the removal of these columns, some tables become an ID-only table. To avoid corner case bugs, we insert a placeholder unary value in the tables.
Some (game + assist provider) in SHOTS are not “appeared” in the same game are transformed into N/A.
Cyclic dependency between teamstats and games is decomposed into games -> TEAMSTATS -> GAMESTATS.

This transformation ensures a valid directed acyclic graph (DAG) structure for generation.

The full preprocessing script is provided at the end of this blog. The table reference of the processed football dataset is provided at Figure 1, the primary key (PK), foreign key (FK) constraints are explicitly explained at Table 1.

‍

Why This Dataset Is Difficult

To the best of our knowledge, most existing multi-table synthesizers can handle simple PK–FK relationships. However, this football dataset introduces several advanced challenges:

Composite primary keys (e.g., in APPEARANCES)
Overlapping composite foreign keys (in GAMES and SHOTS)
Nullable foreign keys (in SHOTS)
Non-equality constraints (across GAMES and SHOTS)
Sequential structure:

TEAMSTATS sorted by date
SHOTS sorted by minute

‍

These structural properties significantly increase modeling complexity.

‍

‍

Figure 1. Relational Schema of Football Dataset

‍

‍

Table 1. Primary Key, Foreign Key Constraints for Football Dataset

‍

IRG-1 Solution

In IRG-1, we systematically address all of the above constraints. Within the IRG-1 platform, users explicitly configure relational constraints before training.

‍

1. Schema Configuration

‍

‍

Figure 2. Configuration for Primary Key for GAMESTATS Table in IRG-1

‍

‍

Figure 3. Configuration for Foreign Key for TEAMSTATS Table in IRG-1

‍

2. Topological Generation Order

Beyond schema configuration, IRG-1 requires defining a topological order of tables, which determines the generation sequence. This order must respect inter-table dependencies.

For example:

APPEARANCES depends on both PLAYERS and GAMES
Therefore, those parent tables must be generated first

Note that there is not a single valid ordering, but certain precedence rules must always be satisfied.

‍

‍

Figure 4. Topology Order of Football Dataset

‍

3. JSON-Based Configuration

The IRG-1 platform allows toggling between UI mode and JSON mode for schema configuration. This provides flexibility for advanced users and automation pipelines.

‍

‍

Figure 5. JSON Configuration of Relational Schema for Football Dataset

‍

Result Evaluation

After the model training and data sampling, we will evaluate the synthetic multi-table data quality.

‍

1. Table Shape Consistency

In Figure 6, we can see, our generated data has exactly the same number of columns and rows between real and synthetic data.

‍

‍

Figure 6. Table shapes of real and synthetic data. #C is the number of columns, and #R is the number of rows.

‍

2. Primary Key Uniqueness

Figure 7 shows that PKs remain unique across all real and synthetic tables.

‍

‍

Figure 7. Uniqueness of Primary Key

3. Foreign Key Satisfaction

In Figure 8, we introduce a concept “Degree”: number of child rows corresponding to the same parent row. For example for a player (PLAYERS, i.e., parent table), it can appear (APPEARANCES, i.e., child table) in many games. In our report, we evaluate the degree distribution between real and synthetic data.

‍

‍

Figure 8. Degree Distribution for Three FKs

‍

The relations between different tables in a multi-table dataset are essentially defined by FK constraints. A multi-table dataset is considered valid only if all FK constraints are satisfied. An FK constraint is considered satisfied in the absence of "orphans", i.e., values in the child table that are not found in the parent. Orphans are strictly not allowed for FK constraints satisfaction.

‍

‍

Figure 9. Foreign Key Satisfaction and Statistics of Real and Synthetic Data.

‍

4. Single-Table Statistical Evaluation

We also have more detailed single table evaluation, which include: (1) marginal distribution (2) pair-wise correlation and (3) privacy evaluation. Due to page limits, we will not show all the figures in this blog.

‍

The IRG-1 codebase is publicly available at:

👉 https://github.com/li-jiayu-ljy/irg

If you would like to explore multi-table synthesis for your own structured datasets, feel free to reach out — we would be happy to discuss further.

👉 zilong@betterdata.ai

Dr. Uzair Javaid