Probabilistic Programming with Low-Level, High-Performance GPU Programmable Inference

Chung, Karen

dc.contributor.advisor	Mansinghka, Vikash K.
dc.contributor.author	Chung, Karen
dc.date.accessioned	2026-02-12T17:12:42Z
dc.date.available	2026-02-12T17:12:42Z
dc.date.issued	2025-09
dc.date.submitted	2025-09-15T14:56:31.150Z
dc.identifier.uri	https://hdl.handle.net/1721.1/164823
dc.description.abstract	GPU-compatible probabilistic programming languages (PPLs) have enabled high-performance, data-parallel programmable inference. However, these systems face fundamental trade-offs between expressiveness and performance, as their GPU code generation is automated and black-boxed, limiting optimization opportunities and imposing restrictions on program expressivity. This thesis introduces GenCUDA, a probabilistic programming system that addresses this limitation by embedding the CUDA GPU programming language directly into a C++/CUDA frontend, enabling GPU programmable inference with fine-grained control over runtime and memory profiles. GenCUDA extends the Gen probabilistic programming architecture by providing a dynamic modeling language (DML) that allows users to write performance-critical sections of generative functions as CUDA kernels while maintaining automatic trace management and the generative function interface (GFI). The system supports both sequential and parallel execution contexts through specialized effect handlers that seamlessly compose CPU and GPU code paths. Key technical contributions include: (1) a high-performance GPU distributions library achieving 10-100× speedups over TensorFlow-Probability, (2) memory-efficient trace management via template-optimized parallel effect handlers, and (3) vectorized generative functions that enable massive parallelization of inference algorithms. We demonstrate GenCUDA’s capabilities through comprehensive benchmarks on inference algorithms applied to diverse models including factor graphs, mixture models, and Hidden Markov Models. Results show significant performance improvements over JAX-based implementations: up to 3× speedup for importance sampling on a hierarchical model, 5.7× speedup for parallel Gibbs sampling on factor graphs, and memory efficiency improvements for large-scale mixture models supporting up to 6× as many clusters compared to existing frameworks’ limits. The system maintains the composability and expressiveness of probabilistic programming while unlocking GPU performance optimization techniques such as kernel fusion and memory hierarchy exploitation that are inaccessible to higher-level frameworks. GenCUDA demonstrates that embedding low-level GPU programming within automated probabilistic inference workflows can achieve both performance gains and algorithmic expressivity without sacrificing the modularity of probabilistic programming paradigms.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Probabilistic Programming with Low-Level, High-Performance GPU Programmable Inference
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: chung-seoyeon-meng-eecs-2025-t ...
Size:: 4.270Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record