Simulating LLM Runtime Latency

Wang, Sarah Y.

dc.contributor.advisor	Madden, Samuel
dc.contributor.author	Wang, Sarah Y.
dc.date.accessioned	2025-10-06T17:38:47Z
dc.date.available	2025-10-06T17:38:47Z
dc.date.issued	2025-05
dc.date.submitted	2025-06-23T14:04:10.057Z
dc.identifier.uri	https://hdl.handle.net/1721.1/162997
dc.description.abstract	Large Language Models (LLMs) are expensive to run and can incur high latencies. Each LLM application has its own cost and latency targets. For example, AI voice assistants operate under low latency objectives, while large document batch processing jobs are typically cost-sensitive. However, navigating these trade-offs is not trivial, as LLM latency is highly task– specific and depends on factors such as the offered query load, the hardware configurations, request properties, and various model characteristics. To support the user in configuring their deployment according to their application needs, we introduce vLLMSim, an accurate simulator that estimates the latency of a given workload on different hardware configurations. vLLMSim advances two key avenues toward latency-aligned LLM deployments. First, the simulated latency metrics inform the user’s model and hardware choice, so they can use a configuration that is ideal for their workload. Second, our simulator enables researchers to quickly test latency-improving ideas, bypassing the need for time-consuming implementations before validating their effectiveness. In fact, vLLMSim is already used in two research projects with the goal of reducing latency and cost of LLM inference. In this thesis, we show how vLLMSim’s design allows it to accurately support the use cases above, while providing highly accurate runtime predictions. To support hardware exploration without GPU access, vLLMSim provides precomputed performance profiles that are sufficient to accurately simulate the user’s workload. The simulator code can be found here, and the instrumented vLLM code for creating profiles can be found here.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Simulating LLM Runtime Latency
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: wang-sarahyw-meng-eecs-2025-th ...
Size:: 2.116Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record