Simulating LLM Runtime Latency
Author(s)
Wang, Sarah Y.
DownloadThesis PDF (2.116Mb)
Advisor
Madden, Samuel
Terms of use
Metadata
Show full item recordAbstract
Large Language Models (LLMs) are expensive to run and can incur high latencies. Each LLM application has its own cost and latency targets. For example, AI voice assistants operate under low latency objectives, while large document batch processing jobs are typically cost-sensitive. However, navigating these trade-offs is not trivial, as LLM latency is highly task– specific and depends on factors such as the offered query load, the hardware configurations, request properties, and various model characteristics. To support the user in configuring their deployment according to their application needs, we introduce vLLMSim, an accurate simulator that estimates the latency of a given workload on different hardware configurations. vLLMSim advances two key avenues toward latency-aligned LLM deployments. First, the simulated latency metrics inform the user’s model and hardware choice, so they can use a configuration that is ideal for their workload. Second, our simulator enables researchers to quickly test latency-improving ideas, bypassing the need for time-consuming implementations before validating their effectiveness. In fact, vLLMSim is already used in two research projects with the goal of reducing latency and cost of LLM inference. In this thesis, we show how vLLMSim’s design allows it to accurately support the use cases above, while providing highly accurate runtime predictions. To support hardware exploration without GPU access, vLLMSim provides precomputed performance profiles that are sufficient to accurately simulate the user’s workload. The simulator code can be found here, and the instrumented vLLM code for creating profiles can be found here.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology