MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Limits of Literature-Conditioned Large Language Models for Predicting Behavioral Experiments

Author(s)
Na, Robin
Thumbnail
DownloadThesis PDF (1.211Mb)
Advisor
Almaatouq, Abdullah
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Large language models (LLMs) have recently shown potential in various capabilities contributing to scientific progress. Recent work shows that they can predict experimental outcomes as accurately as human forecasters. Other works show that retrieval-augmented generation (RAG)– conditioning LLMs on relevant documents or databases–can improve the quality of model outputs across various research synthesis tasks. Here, we combine these two streams of work and ask: does conditioning LLMs on published research articles improve their predictions of outcomes in new behavioral experiments? We test this using 20 new experiments on peer punishment in cooperation dilemmas, where the prediction task is to determine how much punishment mechanisms increase or decrease group welfare across different settings. Consistent with prior findings, the baseline offthe-shelf GPT-4.1 model performance matches or exceeds every human (laypeople and experts) forecaster we tested. We then condition the model on 1,398 published papers studying punishment, testing both individual papers and collections constructed by grouping papers in different ways (e.g., theory-focused versus empirical studies, recent versus older publications, high-impact versus lower-impact journals). To our surprise, conditioning on individual papers rarely reduces prediction error, and in many cases it makes predictions worse. Conditioning on collections substantially increases the model’s confidence without increasing its accuracy. Simply providing research articles to language models does not seem to improve predictions of outcomes in new experiments, suggesting that more effective systems may require different approaches to representing and processing scientific evidence.
Date issued
2026-02
URI
https://hdl.handle.net/1721.1/165538
Department
Sloan School of Management
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.