Show simple item record

dc.contributor.advisorAlmaatouq, Abdullah
dc.contributor.authorNa, Robin
dc.date.accessioned2026-04-21T18:12:41Z
dc.date.available2026-04-21T18:12:41Z
dc.date.issued2026-02
dc.date.submitted2026-02-04T16:28:51.258Z
dc.identifier.urihttps://hdl.handle.net/1721.1/165538
dc.description.abstractLarge language models (LLMs) have recently shown potential in various capabilities contributing to scientific progress. Recent work shows that they can predict experimental outcomes as accurately as human forecasters. Other works show that retrieval-augmented generation (RAG)– conditioning LLMs on relevant documents or databases–can improve the quality of model outputs across various research synthesis tasks. Here, we combine these two streams of work and ask: does conditioning LLMs on published research articles improve their predictions of outcomes in new behavioral experiments? We test this using 20 new experiments on peer punishment in cooperation dilemmas, where the prediction task is to determine how much punishment mechanisms increase or decrease group welfare across different settings. Consistent with prior findings, the baseline offthe-shelf GPT-4.1 model performance matches or exceeds every human (laypeople and experts) forecaster we tested. We then condition the model on 1,398 published papers studying punishment, testing both individual papers and collections constructed by grouping papers in different ways (e.g., theory-focused versus empirical studies, recent versus older publications, high-impact versus lower-impact journals). To our surprise, conditioning on individual papers rarely reduces prediction error, and in many cases it makes predictions worse. Conditioning on collections substantially increases the model’s confidence without increasing its accuracy. Simply providing research articles to language models does not seem to improve predictions of outcomes in new experiments, suggesting that more effective systems may require different approaches to representing and processing scientific evidence.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleLimits of Literature-Conditioned Large Language Models for Predicting Behavioral Experiments
dc.typeThesis
dc.description.degreeS.M.
dc.contributor.departmentSloan School of Management
mit.thesis.degreeMaster
thesis.degree.nameMaster of Science in Management Research


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record