Show simple item record

dc.contributor.advisorRaskar, Ramesh
dc.contributor.authorSingh, Abhishek
dc.date.accessioned2025-12-03T16:12:26Z
dc.date.available2025-12-03T16:12:26Z
dc.date.issued2025-05
dc.date.submitted2025-09-21T19:39:43.012Z
dc.identifier.urihttps://hdl.handle.net/1721.1/164169
dc.description.abstractThe remarkable scaling of data and computation has unlocked unprecedented capabilities in text and image generation, raising the question: Why hasn’t healthcare seen similar breakthroughs? This disparity stems primarily from healthcare data being fragmented across thousands of institutions, each safeguarding patient records in regulatory-compliant silos. The problem is not limited to healthcare but extends to other industries with fragmented data across institutions and individuals. Instead of centralizing various datasets to solve the fragmentation problem, which raises regulatory and ethical concerns, this thesis proposes systems and algorithms to decentralize the machine learning pipeline. Current approaches in this area have centered around Federated Learning (FL), which enables model training over distributed data. However, FL’s dependence on central coordination and inflexibility with heterogeneous systems limit its applicability in healthcare settings. Motivated by these challenges, I explore the following three core themes: 1) Coordination – Today’s coordination algorithms typically rely on static rules or randomized communication, approaches that turn out to be sub-optimal when data heterogeneity is high. I present a new system and a benchmark framework that enables systematic assessment of different coordination algorithms. Next, I propose an adaptive coordination algorithm that leverages historical performance and learning dynamics to improve coordination. 2) Heterogeneity – Data owners can vary significantly in their data distributions, computational resources, and privacy requirements. To address this heterogeneity, I turn the focus from the traditionally protected training phase to securing the critical inference process. Next, I develop techniques for distributed training that adapt to heterogeneous computational capabilities across different agents. 3) Scalability – Enabling scaling in decentralized ML requires addressing three key challenges: parallelization, synchronization, and self-scaling. While parallelization has advanced significantly, the other two remain challenging. I present a framework for offline collaboration through sanitized, synthetic datasets that eliminates constant synchronization needs while preserving privacy. This thesis identifies and addresses some of the bottlenecks along these three core themes through a complementary set of solutions: adaptive coordination, heterogeneity-aware training, and scalable collaboration. Together, these building blocks can enable a practical framework for unlocking data silos across institutions.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleDecentralized Machine Learning over Fragmented Data
dc.typeThesis
dc.description.degreePh.D.
dc.contributor.departmentProgram in Media Arts and Sciences (Massachusetts Institute of Technology)
dc.identifier.orcidhttps://orcid.org/0000-0003-0217-9801
mit.thesis.degreeDoctoral
thesis.degree.nameDoctor of Philosophy


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record