Probabilistic Programming with Low-Level, High-Performance GPU Programmable Inference

Chung, Karen

Author(s)

Chung, Karen

DownloadThesis PDF (4.270Mb)

Advisor

Mansinghka, Vikash K.

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

GPU-compatible probabilistic programming languages (PPLs) have enabled high-performance, data-parallel programmable inference. However, these systems face fundamental trade-offs between expressiveness and performance, as their GPU code generation is automated and black-boxed, limiting optimization opportunities and imposing restrictions on program expressivity. This thesis introduces GenCUDA, a probabilistic programming system that addresses this limitation by embedding the CUDA GPU programming language directly into a C++/CUDA frontend, enabling GPU programmable inference with fine-grained control over runtime and memory profiles. GenCUDA extends the Gen probabilistic programming architecture by providing a dynamic modeling language (DML) that allows users to write performance-critical sections of generative functions as CUDA kernels while maintaining automatic trace management and the generative function interface (GFI). The system supports both sequential and parallel execution contexts through specialized effect handlers that seamlessly compose CPU and GPU code paths. Key technical contributions include: (1) a high-performance GPU distributions library achieving 10-100× speedups over TensorFlow-Probability, (2) memory-efficient trace management via template-optimized parallel effect handlers, and (3) vectorized generative functions that enable massive parallelization of inference algorithms. We demonstrate GenCUDA’s capabilities through comprehensive benchmarks on inference algorithms applied to diverse models including factor graphs, mixture models, and Hidden Markov Models. Results show significant performance improvements over JAX-based implementations: up to 3× speedup for importance sampling on a hierarchical model, 5.7× speedup for parallel Gibbs sampling on factor graphs, and memory efficiency improvements for large-scale mixture models supporting up to 6× as many clusters compared to existing frameworks’ limits. The system maintains the composability and expressiveness of probabilistic programming while unlocking GPU performance optimization techniques such as kernel fusion and memory hierarchy exploitation that are inaccessible to higher-level frameworks. GenCUDA demonstrates that embedding low-level GPU programming within automated probabilistic inference workflows can achieve both performance gains and algorithmic expressivity without sacrificing the modularity of probabilistic programming paradigms.

Date issued

2025-09

URI

https://hdl.handle.net/1721.1/164823

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses