Progress in natural language processing has historically been driven by better data, and researchers today are increasingly using ‘synthetic data’ - data generated with the assistance of large language models - to make dataset construction faster and cheaper.
However, most synthetic data generation approaches are executed in an ad hoc manner and ‘reinvent the wheel’ rather than build on prior foundations. This tutorial seeks to build a shared understanding of recent progress in synthetic data generation from NLP and related fields by grouping and describing major methods, applications, and open problems.
Our tutorial will be divided into four main sections. First, we will describe algorithms for producing high-quality synthetic data. Second, we will describe how synthetic data can be used to advance the general-purpose development and study of language models. Third, we will demonstrate how to customize synthetic data generation to support scenario-specific applications. Finally, we will discuss open questions about the production and use of synthetic data that must be answered to overcome some of their current limitations. Our goal is that by unifying recent advances in this emerging research direction, we can build foundations upon which the community can improve the rigor, understanding, and effectiveness of synthetic data moving forward.
July 27, 2025 - Hall B
The papers referenced in the tutorial can be found below.
              Distilling the Knowledge in a Neural Network
              Hinton et al., 2015
            
              Improving Neural Machine Translation Models with Monolingual Data
              Sennrich et al., 2016
            
              Sequence-Level Knowledge Distillation
              Kim & Rush, 2016
            
              Scalable agent alignment via reward modeling: a research direction
              Leike et al., 2018
            
              Are Pretrained Language Models Symbolic Reasoners Over Knowledge?
              Kassner et al., 2020
            
              Don't Stop Pretraining: Adapt Language Models to Domains and Tasks
              Gururangan et al., 2020
            
              Generating Datasets with Pretrained Language Models
              Schick & Schütze, 2021
            
              SynthBio: A Case Study in Faster Curation of Text Datasets
              Yuan et al., 2021
            
              Constitutional AI: Harmlessness from AI Feedback
              Bai et al., 2022
            
              Red Teaming Language Models with Language Models
              Perez et al., 2022
            
              Self-Instruct: Aligning Language Models with Self-Generated Instructions
              Wang et al., 2022
            
              STaR: Bootstrapping Reasoning With Reasoning
              Zelikman et al., 2022
            
              Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
              Honovich et al., 2022
            
              WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
              Liu et al., 2022
            
              Alpaca: A Strong, Replicable Instruction-Following Model
              Taori et al., 2023
            
              LongForm: Effective Instruction Tuning with Reverse Instructions
              Köksal et al., 2023
            
              Orca: Progressive Learning from Complex Explanation Traces of GPT-4
              Mukherjee et al., 2023
            
              Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
              Kim et al., 2023
            
              Prompt2Model: Generating Deployable Models from Natural Language Instructions
              Viswanathan et al., 2023
            
              RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
              Dong et al., 2023
            
              Self-Alignment with Instruction Backtranslation
              Li et al., 2023
            
              Self-Refine: Iterative Refinement with Self-Feedback
              Madaan et al., 2023
            
              SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
              Kim et al., 2023
            
              The False Promise of Imitating Proprietary LLMs
              Gudibande et al., 2023
            
              Textbooks Are All You Need
              Gunasekar et al., 2023
            
              UltraFeedback: Boosting Language Models with Scaled AI Feedback
              Cui et al., 2023
            
              WizardLM: Empowering Large Pre-trained Language Models to Follow Complex Instructions
              Xu et al., 2023
            
              AI models collapse when trained on recursively generated data
              Shumailov et al., 2024
            
              Better Synthetic Data by Retrieving and Transforming Existing Datasets
              Gandhi et al., 2024
            
              Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
              Hu et al., 2024
            
              Checklists Are Better Than Reward Models For Aligning Language Models
              Viswanathan et al., 2024
            
              Evaluating Language Models as Synthetic Data Generators
              Kim et al., 2024
            
              Evaluating Reward Models for Language Modeling
              Lambert et al., 2024
            
              Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
              Miranda et al., 2024
            
              MAmmoTH2: Scaling Instructions from the Web
              Yue et al., 2024
            
              Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
              Maini et al., 2024
            
              SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning
              Zhao et al., 2024
            
              Synthetic continued pretraining
              Yang et al., 2024
            
              Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
              Ziegler et al., 2024
            
              The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models
              Chen et al., 2024
            
              All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
              Swamy et al., 2025
            
              DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL
              Deepseek-AI, 2025
            
              Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning
              Jung et al., 2025
            
              SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains
              Ramesh et al., 2025
            
              What Makes a Reward Model a Good Teacher? An Optimization Perspective
              Razin et al., 2025
            
@inproceedings{synth-data-tutorial,
    title = "Synthetic Data in the Era of Large Language Models",
    author = "Viswanathan, Vijay  and
      Yue, Xiang  and
      Liu, Alisa  and
      Wang, Yizhong  and
      Neubig, Graham",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)",
    publisher = "Association for Computational Linguistics",
}