Progress in natural language processing has historically been driven by better data, and researchers today are increasingly using ‘synthetic data’ - data generated with the assistance of large language models - to make dataset construction faster and cheaper.
However, most synthetic data generation approaches are executed in an ad hoc manner and ‘reinvent the wheel’ rather than build on prior foundations. This tutorial seeks to build a shared understanding of recent progress in synthetic data generation from NLP and related fields by grouping and describing major methods, applications, and open problems.
Our tutorial will be divided into four main sections. First, we will describe algorithms for producing high-quality synthetic data. Second, we will describe how synthetic data can be used to advance the general-purpose development and study of language models. Third, we will demonstrate how to customize synthetic data generation to support scenario-specific applications. Finally, we will discuss open questions about the production and use of synthetic data that must be answered to overcome some of their current limitations. Our goal is that by unifying recent advances in this emerging research direction, we can build foundations upon which the community can improve the rigor, understanding, and effectiveness of synthetic data moving forward.
July 27, 2025 - Hall B
The papers referenced in the tutorial can be found below.
Distilling the Knowledge in a Neural Network
Hinton et al., 2015
Improving Neural Machine Translation Models with Monolingual Data
Sennrich et al., 2016
Sequence-Level Knowledge Distillation
Kim & Rush, 2016
Scalable agent alignment via reward modeling: a research direction
Leike et al., 2018
Are Pretrained Language Models Symbolic Reasoners Over Knowledge?
Kassner et al., 2020
Don't Stop Pretraining: Adapt Language Models to Domains and Tasks
Gururangan et al., 2020
Generating Datasets with Pretrained Language Models
Schick & Schütze, 2021
SynthBio: A Case Study in Faster Curation of Text Datasets
Yuan et al., 2021
Constitutional AI: Harmlessness from AI Feedback
Bai et al., 2022
Red Teaming Language Models with Language Models
Perez et al., 2022
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Wang et al., 2022
STaR: Bootstrapping Reasoning With Reasoning
Zelikman et al., 2022
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Honovich et al., 2022
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
Liu et al., 2022
Alpaca: A Strong, Replicable Instruction-Following Model
Taori et al., 2023
LongForm: Effective Instruction Tuning with Reverse Instructions
Köksal et al., 2023
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Mukherjee et al., 2023
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Kim et al., 2023
Prompt2Model: Generating Deployable Models from Natural Language Instructions
Viswanathan et al., 2023
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Dong et al., 2023
Self-Alignment with Instruction Backtranslation
Li et al., 2023
Self-Refine: Iterative Refinement with Self-Feedback
Madaan et al., 2023
SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
Kim et al., 2023
The False Promise of Imitating Proprietary LLMs
Gudibande et al., 2023
Textbooks Are All You Need
Gunasekar et al., 2023
UltraFeedback: Boosting Language Models with Scaled AI Feedback
Cui et al., 2023
WizardLM: Empowering Large Pre-trained Language Models to Follow Complex Instructions
Xu et al., 2023
AI models collapse when trained on recursively generated data
Shumailov et al., 2024
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Gandhi et al., 2024
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Hu et al., 2024
Checklists Are Better Than Reward Models For Aligning Language Models
Viswanathan et al., 2024
Evaluating Language Models as Synthetic Data Generators
Kim et al., 2024
Evaluating Reward Models for Language Modeling
Lambert et al., 2024
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Miranda et al., 2024
MAmmoTH2: Scaling Instructions from the Web
Yue et al., 2024
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
Maini et al., 2024
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning
Zhao et al., 2024
Synthetic continued pretraining
Yang et al., 2024
Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
Ziegler et al., 2024
The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models
Chen et al., 2024
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
Swamy et al., 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL
Deepseek-AI, 2025
Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning
Jung et al., 2025
SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains
Ramesh et al., 2025
What Makes a Reward Model a Good Teacher? An Optimization Perspective
Razin et al., 2025
@inproceedings{synth-data-tutorial,
title = "Synthetic Data in the Era of Large Language Models",
author = "Viswanathan, Vijay and
Yue, Xiang and
Liu, Alisa and
Wang, Yizhong and
Neubig, Graham",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)",
publisher = "Association for Computational Linguistics",
}