Decentralized Machine Learning over Fragmented Data
Author(s)
Singh, Abhishek
DownloadThesis PDF (46.05Mb)
Advisor
Raskar, Ramesh
Terms of use
Metadata
Show full item recordAbstract
The remarkable scaling of data and computation has unlocked unprecedented capabilities in text and image generation, raising the question: Why hasn’t healthcare seen similar breakthroughs? This disparity stems primarily from healthcare data being fragmented across thousands of institutions, each safeguarding patient records in regulatory-compliant silos. The problem is not limited to healthcare but extends to other industries with fragmented data across institutions and individuals. Instead of centralizing various datasets to solve the fragmentation problem, which raises regulatory and ethical concerns, this thesis proposes systems and algorithms to decentralize the machine learning pipeline. Current approaches in this area have centered around Federated Learning (FL), which enables model training over distributed data. However, FL’s dependence on central coordination and inflexibility with heterogeneous systems limit its applicability in healthcare settings. Motivated by these challenges, I explore the following three core themes:
1) Coordination – Today’s coordination algorithms typically rely on static rules or randomized communication, approaches that turn out to be sub-optimal when data heterogeneity is high. I present a new system and a benchmark framework that enables systematic assessment of different coordination algorithms. Next, I propose an adaptive coordination algorithm that leverages historical performance and learning dynamics to improve coordination.
2) Heterogeneity – Data owners can vary significantly in their data distributions, computational resources, and privacy requirements. To address this heterogeneity, I turn the focus from the traditionally protected training phase to securing the critical inference process. Next, I develop techniques for distributed training that adapt to heterogeneous computational capabilities across different agents.
3) Scalability – Enabling scaling in decentralized ML requires addressing three key challenges: parallelization, synchronization, and self-scaling. While parallelization has advanced significantly, the other two remain challenging. I present a framework for offline collaboration through sanitized, synthetic datasets that eliminates constant synchronization needs while preserving privacy.
This thesis identifies and addresses some of the bottlenecks along these three core themes through a complementary set of solutions: adaptive coordination, heterogeneity-aware training, and scalable collaboration. Together, these building blocks can enable a practical framework for unlocking data silos across institutions.
Date issued
2025-05Department
Program in Media Arts and Sciences (Massachusetts Institute of Technology)Publisher
Massachusetts Institute of Technology