Simulating LLM Runtime Latency

Wang, Sarah Y.

Author(s)

Wang, Sarah Y.

DownloadThesis PDF (2.116Mb)

Advisor

Madden, Samuel

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Large Language Models (LLMs) are expensive to run and can incur high latencies. Each LLM application has its own cost and latency targets. For example, AI voice assistants operate under low latency objectives, while large document batch processing jobs are typically cost-sensitive. However, navigating these trade-offs is not trivial, as LLM latency is highly task– specific and depends on factors such as the offered query load, the hardware configurations, request properties, and various model characteristics. To support the user in configuring their deployment according to their application needs, we introduce vLLMSim, an accurate simulator that estimates the latency of a given workload on different hardware configurations. vLLMSim advances two key avenues toward latency-aligned LLM deployments. First, the simulated latency metrics inform the user’s model and hardware choice, so they can use a configuration that is ideal for their workload. Second, our simulator enables researchers to quickly test latency-improving ideas, bypassing the need for time-consuming implementations before validating their effectiveness. In fact, vLLMSim is already used in two research projects with the goal of reducing latency and cost of LLM inference. In this thesis, we show how vLLMSim’s design allows it to accurately support the use cases above, while providing highly accurate runtime predictions. To support hardware exploration without GPU access, vLLMSim provides precomputed performance profiles that are sufficient to accurately simulate the user’s workload. The simulator code can be found here, and the instrumented vLLM code for creating profiles can be found here.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/162997

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses