Limits of Literature-Conditioned Large Language Models for Predicting Behavioral Experiments

Na, Robin

Author(s)

Na, Robin

DownloadThesis PDF (1.211Mb)

Advisor

Almaatouq, Abdullah

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Large language models (LLMs) have recently shown potential in various capabilities contributing to scientific progress. Recent work shows that they can predict experimental outcomes as accurately as human forecasters. Other works show that retrieval-augmented generation (RAG)– conditioning LLMs on relevant documents or databases–can improve the quality of model outputs across various research synthesis tasks. Here, we combine these two streams of work and ask: does conditioning LLMs on published research articles improve their predictions of outcomes in new behavioral experiments? We test this using 20 new experiments on peer punishment in cooperation dilemmas, where the prediction task is to determine how much punishment mechanisms increase or decrease group welfare across different settings. Consistent with prior findings, the baseline offthe-shelf GPT-4.1 model performance matches or exceeds every human (laypeople and experts) forecaster we tested. We then condition the model on 1,398 published papers studying punishment, testing both individual papers and collections constructed by grouping papers in different ways (e.g., theory-focused versus empirical studies, recent versus older publications, high-impact versus lower-impact journals). To our surprise, conditioning on individual papers rarely reduces prediction error, and in many cases it makes predictions worse. Conditioning on collections substantially increases the model’s confidence without increasing its accuracy. Simply providing research articles to language models does not seem to improve predictions of outcomes in new experiments, suggesting that more effective systems may require different approaches to representing and processing scientific evidence.

Date issued

2026-02

URI

https://hdl.handle.net/1721.1/165538

Department

Sloan School of Management

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses