Dr. Uzair Javaid

Dr. Uzair Javaid is the CEO and Co-Founder of Betterdata AI, a company focused on Programmable Synthetic Data generation using Generative AI and Privacy Engineering. Betterdata’s technology helps data science and engineering teams easily access and share sensitive customer/business data while complying with global data protection and AI regulations.
Previously, Uzair worked as a Software Engineer and Business Development Executive at Merkle Science (Series A $20M+), where he worked on developing taint analysis techniques for blockchain wallets. 

Uzair has a strong academic background in Computer Science/Engineering with a Ph.D. from National University of Singapore (Top 10 in the world). His research focused on designing and analyzing blockchain-based cybersecurity solutions for cyber-physical systems with specialization in data security and privacy engineering techniques. 

In one of his PhD. projects, he reverse engineered the encryption algorithm of Ethereum blockchain and ethically hacked 670 user wallets. He has been cited 600+ times across 15+ publications in globally reputable conferences and journals, and has also received recognition for his work including Best Paper Award and Scholarships. 

In addition to his work at Betterdata AI, Uzair is also an advisor at German Entrepreneurship Asia, providing guidance and expertise to support entrepreneurship initiatives in the Asian region. He has been actively involved in paying-it-forward as well, volunteering as a peer student support group member at National University of Singapore and serving as a technical program committee member for the International Academy, Research, and Industry Association.

Peeling the Layers of the Privacy-Utility Onion on Tabular Data

Dr. Zilong ZHAO
June 1, 2026

Table of Contents

Summary:
  • Privacy Onion Effect: Once the most vulnerable training records are removed and the model is retrained, a new layer of previously safer records can become vulnerable.
  • The privacy onion effect is not limited to image models. We observe the same pattern across multiple tabular datasets.
  • Layer-by-layer removal works better than one-shot removal for privacy. Re-auditing and re-ranking the remaining data after each removal step pushes the real privacy curve closer to the ideal one.
  • Privacy gains come with utility costs. The records most vulnerable to membership inference are often also the ones carrying strong learning signal, which makes privacy-driven pruning a structural privacy-utility tradeoff.

Machine learning systems trained on sensitive tabular data can leak information through memorization. In practice, that leakage often appears as membership inference attacks (MIA), where an attacker with query access to a model attempts to determine whether a specific record was included in its training set.

For teams building or auditing privacy-sensitive systems, MIA is not just a threat, but also a diagnostic tool. A strong MIA can help identify the most exposed records in a training set, which naturally raises a practical question:

If we remove the riskiest records, do we actually solve the privacy problem?

The short answer is no. In tabular learning, privacy risk behaves more like an onion than a checklist. Peel one vulnerable layer away, retrain the model, and another vulnerable layer can appear.

1. Why LiRA Matters for Privacy Auditing:

We use Likelihood Ratio Attack (LiRA) because it is one of the strongest black-box membership inference attacks for identifying highly exposed records. LiRA is especially informative in the extreme-risk tail, where average privacy risk may look low, but a small set of records remains highly vulnerable.

Figure 1 shows exactly why privacy auditing cannot stop at average-case metrics. On the ADULT dataset, the average TPR at 0.1% FPR is very small, but the worst-case samples are exposed almost perfectly. In other words, a model can look safe in aggregate while still leaking badly for a narrow but important slice of the training set.

That observation naturally suggests an intuitive mitigation strategy:

What if we simply remove the extreme-risk tail?

In practice, this approach falls short because of the privacy onion effect.

2. The Privacy Onion Effect:

Prior work, The Privacy Onion Effect. Memorization is Relative, introduces the concept of the privacy onion effect:

Removing one layer of the most vulnerable samples and retraining reveals a new layer of previously safe samples that become vulnerable.

The experimental procedure is straightforward:

  • Rank training records by vulnerability.
  • Remove the highest-risk records.
  • Retrain the model.
  • Re-evaluate privacy risk on the remaining training set.

The effect is visualized using ROC curves. By varying the LiRA decision threshold τ, multiple TPR-FPR pairs are obtained and used to construct the curve. A lower ROC curve indicates stronger privacy because the attack achieves a lower false positive rate at the same true positive rate.

The original study removed the top 5,000 most vulnerable samples from CIFAR-10 and compared three settings:

  • Baseline: Original model evaluated on the original training set.
  • Ideal: Original model and metrics retained, but removed samples are ignored.
  • Reality: Model retrained after removal and re-evaluated.

The key finding is the gap between Ideal and Reality.

If privacy risk were static, retraining would match the ideal outcome. Instead, retraining redistributes memorization pressure across the remaining records, causing some previously safe records to become newly exposed.

While this phenomenon was previously observed on image datasets, an important question remained:

Does the same effect appear in tabular datasets?

3. Why Tabular Data Matters:

Tabular data differs fundamentally from image data.

Compared with image datasets, tabular datasets are more likely to contain:

  • Heterogeneous feature scales
  • Sparse or repeated patterns
  • Structured feature dependencies
  • Rare but highly predictive rows

Privacy leakage in tabular models is often concentrated around atypical records. These records can be identifiable because they are unusual, while simultaneously being highly informative for model learning.

Removing them therefore does more than reduce risk. It reshapes the training landscape and changes which remaining records become memorized next.

4. From One-Shot Removal to Layer-by-Layer Removal:

Most prior discussions focus on one-shot removal:

  • Rank once
  • Remove vulnerable samples once
  • Retrain once

We instead ask:

What happens if privacy auditing itself becomes iterative?

5. Layer-by-Layer Workflow:

Given a total removal budget K:

  • Split the budget into m layers
  • Remove K/m records at a time
  • Retrain after each step
  • Re-audit privacy risk after retraining
  • Re-rank the remaining records

This creates two strategies:

6. One-Shot Removal:

  • Rank once on the original model
  • Remove top-K records in a single step

7. Layer-by-Layer Removal:

  • Audit
  • Remove one layer
  • Retrain
  • Re-rank
  • Repeat until the budget is exhausted

The key difference is whether privacy auditing acknowledges that vulnerability is dynamic.

8. Experimental Setup:

Four widely used tabular datasets were evaluated:

Dataset Training Samples Test Samples Classes
ADULT 30,162 15,060 2
ALOI 86,400 21,600 1,000
BANKLOAN 4,000 1,000 2
COVERTYPE 464,809 116,203 7

Experimental settings:

  • Removal budget K = 5% of the training set
  • Number of layers m = 20
  • MLP models on ADULT, ALOI, and COVERTYPE
  • CatBoost on BANKLOAN

Consistent results were also observed under alternative architectures.

9. Pattern 1: Layer-by-Layer Privacy Onion Effect:

The first major finding is:

Vulnerability is dynamic, and layer-by-layer removal achieves stronger privacy than one-shot removal.

Two observations support this conclusion:

a. The Privacy Onion Effect Exists in Tabular Data:

Across all datasets:

  • Reality curves remain higher than ideal curves.
  • One-shot removal improves privacy.
  • Privacy remains worse than expected after retraining.

This confirms that the privacy onion effect is not limited to image models.

b. Layer-by-Layer Removal Reduces the Onion Effect:

Across all datasets:

  • Layer-by-layer reality curves are lower than one-shot curves.
  • Layer-by-layer curves are closer to ideal curves.

This indicates that iterative auditing more effectively reduces worst-case privacy leakage.

c. Takeaway:

Vulnerability is dynamic, so privacy auditing should also be dynamic.

If the model changes, or the dataset changes, privacy risk should be re-audited rather than relying on a single ranking generated at the beginning.

10. Pattern 2: The Privacy-Utility Tradeoff Is Structural

Privacy is only half of the story.

The next question is:

Are the most vulnerable samples also important for model performance?

To answer this, two utility-related signals were examined:

a. Training Loss:

Measures how difficult a sample is for the model to learn.

b. GraNd (Gradient Norm):

Measures how strongly a sample influences training.

The results reveal a consistent pattern:

  • High-risk samples tend to have higher loss.
  • High-risk samples tend to have higher gradient norms.
  • These records are often rare, difficult, or highly influential.

After removal, many extreme outliers disappear, reducing worst-case privacy risk.

However, this also alters the training signal.

c. Comparing Three Removal Strategies:

Under identical removal budgets:

  1. Random removal
  2. One-shot privacy-driven removal
  3. Layer-by-layer privacy-driven removal

The findings are consistent across all datasets:

  • Random removal preserves utility best
  • Privacy-driven removal reduces utility more
  • Layer-by-layer removal achieves the strongest privacy but lowest utility

d. Why This Happens:

The privacy-utility tradeoff is not accidental.

It occurs because:

Privacy-vulnerable samples often overlap with utility-critical samples.

This is especially true in tabular datasets, where rare feature combinations frequently carry both predictive power and disclosure risk.

e. Takeaway

Stronger privacy can come at the cost of lower utility because the records most vulnerable to disclosure are often the same records most useful for learning.

11. Implications for Privacy-Sensitive ML Systems:

For teams working in:

  • Finance
  • Healthcare
  • Government analytics
  • Enterprise AI

the lesson is practical:

A single privacy audit is not enough.

Removing today's highest-risk records does not guarantee privacy tomorrow.

Privacy operations must balance three realities simultaneously:

  • Risk is concentrated
  • Risk is dynamic
  • Risk mitigation can reduce model quality

The goal is not to blindly prune every risky record, nor to ignore privacy auditing altogether.

Instead, privacy and utility should be treated as coupled system properties that require continuous evaluation.

12. Future Directions:

One promising direction is reducing tail leakage without removing utility-critical records.

A strong candidate is:

Differentially Private Stochastic Gradient Descent

DP-SGD introduces:

  • Gradient clipping
  • Noise injection

These mechanisms reduce the influence of extreme samples and may mitigate tail membership leakage while preserving more utility than aggressive record removal.

13. Final Takeaway:

The central finding of this work is simple but important:

Privacy risk in tabular machine learning is layered, relative, and dynamic.

Removing the current risk tail does not eliminate privacy leakage. It reveals the next layer.

Layer-by-layer removal is more effective than one-shot removal for reducing worst-case leakage, but it also intensifies the privacy-utility tradeoff.

For real-world privacy auditing, the question is no longer:

Which records are vulnerable right now?

It must also be:

How will the vulnerability landscape change after we intervene?

That is the onion we need to pee

Dr. Zilong ZHAO
Dr. Zilong Zhao is the Head of Research and Development at Betterdata, bringing a wealth of experience in data science and generative models. With a Ph.D. from the University of Grenoble Alpes, Dr. Zhao has pursued postdoctoral research at institutions including TU Delft, TU Munich, and the National University of Singapore. Throughout his academic career, Dr. Zhao has focused on advancing generative models for structured data, contributing to the field with his work on single table, time series, and relational databases. His papers, CTABGAN and CTABGAN+, are among the most cited in the area of single-table data synthesis.
Related Articles
Access Data 10x Faster