Show simple item record

dc.contributor.advisorMadden, Samuel
dc.contributor.authorWang, Sarah Y.
dc.date.accessioned2025-10-06T17:38:47Z
dc.date.available2025-10-06T17:38:47Z
dc.date.issued2025-05
dc.date.submitted2025-06-23T14:04:10.057Z
dc.identifier.urihttps://hdl.handle.net/1721.1/162997
dc.description.abstractLarge Language Models (LLMs) are expensive to run and can incur high latencies. Each LLM application has its own cost and latency targets. For example, AI voice assistants operate under low latency objectives, while large document batch processing jobs are typically cost-sensitive. However, navigating these trade-offs is not trivial, as LLM latency is highly task– specific and depends on factors such as the offered query load, the hardware configurations, request properties, and various model characteristics. To support the user in configuring their deployment according to their application needs, we introduce vLLMSim, an accurate simulator that estimates the latency of a given workload on different hardware configurations. vLLMSim advances two key avenues toward latency-aligned LLM deployments. First, the simulated latency metrics inform the user’s model and hardware choice, so they can use a configuration that is ideal for their workload. Second, our simulator enables researchers to quickly test latency-improving ideas, bypassing the need for time-consuming implementations before validating their effectiveness. In fact, vLLMSim is already used in two research projects with the goal of reducing latency and cost of LLM inference. In this thesis, we show how vLLMSim’s design allows it to accurately support the use cases above, while providing highly accurate runtime predictions. To support hardware exploration without GPU access, vLLMSim provides precomputed performance profiles that are sufficient to accurately simulate the user’s workload. The simulator code can be found here, and the instrumented vLLM code for creating profiles can be found here.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleSimulating LLM Runtime Latency
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record