Synthetic Data in the Era of LLMs

Tutorial at ACL 2025

Vienna, Austria

Vijay Viswanathan
Vijay Viswanathan
Carnegie Mellon University
Xiang Yue
Xiang Yue
Carnegie Mellon University
Alisa Liu
Alisa Liu
University of Washington
Yizhong Wang
Yizhong Wang
University of Washington
Graham Neubig
Graham Neubig
Carnegie Mellon University

Abstract

Progress in natural language processing has historically been driven by better data, and researchers today are increasingly using ‘synthetic data’ - data generated with the assistance of large language models - to make dataset construction faster and cheaper.

However, most synthetic data generation approaches are executed in an ad hoc manner and ‘reinvent the wheel’ rather than build on prior foundations. This tutorial seeks to build a shared understanding of recent progress in synthetic data generation from NLP and related fields by grouping and describing major methods, applications, and open problems.

Our tutorial will be divided into four main sections. First, we will describe algorithms for producing high-quality synthetic data. Second, we will describe how synthetic data can be used to advance the general-purpose development and study of language models. Third, we will demonstrate how to customize synthetic data generation to support scenario-specific applications. Finally, we will discuss open questions about the production and use of synthetic data that must be answered to overcome some of their current limitations. Our goal is that by unifying recent advances in this emerging research direction, we can build foundations upon which the community can improve the rigor, understanding, and effectiveness of synthetic data moving forward.

Schedule

July 27, 2025 - Hall B

  • 2:00pm: How do we evaluate data quality? (15 minutes)
  • 2:20pm: How do we create high-quality synthetic data? (35 minutes + Q&A)
  • 3:05pm: How do we use synthetic data (Pt 1)? (25 minutes)
  • 3:30pm: 30 minute break
  • 4:00pm: How do we use synthetic data (Pt 2)? (20 minutes + Q&A)
  • 4:25pm: Scenario-specific applications (35 minutes + Q&A)
  • 5:00pm: Limitations and open questions (25 minutes + Q&A)
  • 5:30pm: End

Bibliography

The papers referenced in the tutorial can be found below.

Distilling the Knowledge in a Neural Network
Hinton et al., 2015

Improving Neural Machine Translation Models with Monolingual Data
Sennrich et al., 2016

Sequence-Level Knowledge Distillation
Kim & Rush, 2016

Scalable agent alignment via reward modeling: a research direction
Leike et al., 2018

Are Pretrained Language Models Symbolic Reasoners Over Knowledge?
Kassner et al., 2020

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks
Gururangan et al., 2020

Generating Datasets with Pretrained Language Models
Schick & Schütze, 2021

SynthBio: A Case Study in Faster Curation of Text Datasets
Yuan et al., 2021

Constitutional AI: Harmlessness from AI Feedback
Bai et al., 2022

Red Teaming Language Models with Language Models
Perez et al., 2022

Self-Instruct: Aligning Language Models with Self-Generated Instructions
Wang et al., 2022

STaR: Bootstrapping Reasoning With Reasoning
Zelikman et al., 2022

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Honovich et al., 2022

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
Liu et al., 2022

Alpaca: A Strong, Replicable Instruction-Following Model
Taori et al., 2023

LongForm: Effective Instruction Tuning with Reverse Instructions
Köksal et al., 2023

Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Mukherjee et al., 2023

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Kim et al., 2023

Prompt2Model: Generating Deployable Models from Natural Language Instructions
Viswanathan et al., 2023

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Dong et al., 2023

Self-Alignment with Instruction Backtranslation
Li et al., 2023

Self-Refine: Iterative Refinement with Self-Feedback
Madaan et al., 2023

SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
Kim et al., 2023

The False Promise of Imitating Proprietary LLMs
Gudibande et al., 2023

Textbooks Are All You Need
Gunasekar et al., 2023

UltraFeedback: Boosting Language Models with Scaled AI Feedback
Cui et al., 2023

WizardLM: Empowering Large Pre-trained Language Models to Follow Complex Instructions
Xu et al., 2023

AI models collapse when trained on recursively generated data
Shumailov et al., 2024

Better Synthetic Data by Retrieving and Transforming Existing Datasets
Gandhi et al., 2024

Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Hu et al., 2024

Checklists Are Better Than Reward Models For Aligning Language Models
Viswanathan et al., 2024

Evaluating Language Models as Synthetic Data Generators
Kim et al., 2024

Evaluating Reward Models for Language Modeling
Lambert et al., 2024

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Miranda et al., 2024

MAmmoTH2: Scaling Instructions from the Web
Yue et al., 2024

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
Maini et al., 2024

SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning
Zhao et al., 2024

Synthetic continued pretraining
Yang et al., 2024

Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
Ziegler et al., 2024

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models
Chen et al., 2024

All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
Swamy et al., 2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL
Deepseek-AI, 2025

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning
Jung et al., 2025

SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains
Ramesh et al., 2025

What Makes a Reward Model a Good Teacher? An Optimization Perspective
Razin et al., 2025

BibTeX

@inproceedings{synth-data-tutorial,
    title = "Synthetic Data in the Era of Large Language Models",
    author = "Viswanathan, Vijay  and
      Yue, Xiang  and
      Liu, Alisa  and
      Wang, Yizhong  and
      Neubig, Graham",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)",
    publisher = "Association for Computational Linguistics",
}