Quantization Methods for Matrix Multiplication and Efficient Transformers

Savkin, Semyon

Author(s)

Savkin, Semyon

DownloadThesis PDF (1.671Mb)

Advisor

Polyanskiy, Yury

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

We study quantization in Machine Learning. First, we introduce NestQuant — a technique for quantization of matrix products and post-training quantization of LLMs. Beyond reducing the memory footprint, quantization accelerates inference, as the primary bottleneck during autoregressive generation is often the memory bandwidth. NestQuant leverages two nested lattices to construct an efficient vector codebook for quantization, along with practical encoding and decoding algorithms. The approach is grounded in recent theoretical work that characterizes the optimal rate–distortion trade-off for matrix products. Empirically, on Llama-3-8B, it reduces the perplexity gap between full-precision and quantized models by more than 55% relative to the current state-of-the-art technique (SpinQuant). Second, we investigate data-domain quantization for RF signals. We propose a tokenized transformer for source separation that discretizes RF waveforms into learned tokens and operates directly on the resulting sequences, outperforming strong convolutional baselines. Together, these contributions connect information-theoretic limits with deployable systems: structured vector quantizers accelerate LLM inference and enable competitive discrete representations for RF tasks.

Date issued

2025-09

URI

https://hdl.handle.net/1721.1/164834

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses