MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Simulating LLM Runtime Latency

Author(s)
Wang, Sarah Y.
Thumbnail
DownloadThesis PDF (2.116Mb)
Advisor
Madden, Samuel
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Large Language Models (LLMs) are expensive to run and can incur high latencies. Each LLM application has its own cost and latency targets. For example, AI voice assistants operate under low latency objectives, while large document batch processing jobs are typically cost-sensitive. However, navigating these trade-offs is not trivial, as LLM latency is highly task– specific and depends on factors such as the offered query load, the hardware configurations, request properties, and various model characteristics. To support the user in configuring their deployment according to their application needs, we introduce vLLMSim, an accurate simulator that estimates the latency of a given workload on different hardware configurations. vLLMSim advances two key avenues toward latency-aligned LLM deployments. First, the simulated latency metrics inform the user’s model and hardware choice, so they can use a configuration that is ideal for their workload. Second, our simulator enables researchers to quickly test latency-improving ideas, bypassing the need for time-consuming implementations before validating their effectiveness. In fact, vLLMSim is already used in two research projects with the goal of reducing latency and cost of LLM inference. In this thesis, we show how vLLMSim’s design allows it to accurately support the use cases above, while providing highly accurate runtime predictions. To support hardware exploration without GPU access, vLLMSim provides precomputed performance profiles that are sufficient to accurately simulate the user’s workload. The simulator code can be found here, and the instrumented vLLM code for creating profiles can be found here.
Date issued
2025-05
URI
https://hdl.handle.net/1721.1/162997
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.