Limits of Literature-Conditioned Large Language Models for Predicting Behavioral Experiments

Na, Robin

dc.contributor.advisor	Almaatouq, Abdullah
dc.contributor.author	Na, Robin
dc.date.accessioned	2026-04-21T18:12:41Z
dc.date.available	2026-04-21T18:12:41Z
dc.date.issued	2026-02
dc.date.submitted	2026-02-04T16:28:51.258Z
dc.identifier.uri	https://hdl.handle.net/1721.1/165538
dc.description.abstract	Large language models (LLMs) have recently shown potential in various capabilities contributing to scientific progress. Recent work shows that they can predict experimental outcomes as accurately as human forecasters. Other works show that retrieval-augmented generation (RAG)– conditioning LLMs on relevant documents or databases–can improve the quality of model outputs across various research synthesis tasks. Here, we combine these two streams of work and ask: does conditioning LLMs on published research articles improve their predictions of outcomes in new behavioral experiments? We test this using 20 new experiments on peer punishment in cooperation dilemmas, where the prediction task is to determine how much punishment mechanisms increase or decrease group welfare across different settings. Consistent with prior findings, the baseline offthe-shelf GPT-4.1 model performance matches or exceeds every human (laypeople and experts) forecaster we tested. We then condition the model on 1,398 published papers studying punishment, testing both individual papers and collections constructed by grouping papers in different ways (e.g., theory-focused versus empirical studies, recent versus older publications, high-impact versus lower-impact journals). To our surprise, conditioning on individual papers rarely reduces prediction error, and in many cases it makes predictions worse. Conditioning on collections substantially increases the model’s confidence without increasing its accuracy. Simply providing research articles to language models does not seem to improve predictions of outcomes in new experiments, suggesting that more effective systems may require different approaches to representing and processing scientific evidence.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Limits of Literature-Conditioned Large Language Models for Predicting Behavioral Experiments
dc.type	Thesis
dc.description.degree	S.M.
dc.contributor.department	Sloan School of Management
mit.thesis.degree	Master
thesis.degree.name	Master of Science in Management Research

Files in this item

Name:: Na-robinna-SMMR-Management-202 ...
Size:: 1.211Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record