Show simple item record

dc.contributor.advisorZhang, Feng
dc.contributor.authorAltae-Tran, Han
dc.date.accessioned2026-04-06T22:06:47Z
dc.date.available2026-04-06T22:06:47Z
dc.date.issued2023-09
dc.date.submitted2023-11-16T01:06:11.807Z
dc.identifier.urihttps://hdl.handle.net/1721.1/165341
dc.description.abstractThis dissertation presents a big data-driven approach for biological and biomedical discovery. The topics covered include the evolution and diversity of CRISPR systems, the identification and analysis of hypervariable protein systems, the identification of ancestral systems, and the development of RNA-guided systems for genome editing and therapeutic applications. Additionally, one of the chapters focuses on population-scale longitudinal mapping of COVID-19 symptoms, behavior, and testing, providing valuable insights for public health officials during the early stages of the pandemic. Through the development of novel methodologies and the utilization of big data-driven methods, this dissertation contributes to the expanding landscape of biomedical research. Chapter II delves into the origins of Cas9 and Cas12, examining the evolutionarily conserved non-coding RNA associated with IscB and the diverse RNA-guided nucleases encoded by IS200/IS605 elements. Through phylogenetic analysis and experimental characterization, we gain insight into the evolutionary history and diversity of IscB systems, and their potential biological functions. In Chapter III, we explore the diversity and function of Obligate Mobile Element Guided Activity (OMEGA) systems, focusing on TnpB and its relationship with Cas12 systems. We examine the taxonomy, genomic features, and evolution of these systems, as well as their mobility and potential exaptation. Chapter IV is dedicated to optimizing OMEGA RNA-guided systems for therapeutic applications. We screen natural IscB variants for efficient genome editing and engineer OrufIscB for enhanced activity, demonstrating its potential as a versatile genome interrogation tool. In Chapter V, we employ deep terascale clustering to discover functionally diverse CRISPR systems. Using a fast locality-sensitive hashing algorithm, we identify rare CRISPR systems, such as DinG-HNH, Type I Cascade components with HNH domains, and the Type VII CRISPR system, which is a precise RNA-guided RNA endonuclease complex containing a β-CASP nuclease. In Chapter VII, we investigate compact RNA editors, focusing on the discovery and characterization of Cas13bt. We repurpose Cas13bt for base editing and deliver these base editors to human cells using adeno-associated viruses (AAV), demonstrating their potential for therapeutic applications. Chapter VIII focuses on the identification and analysis of hypervariable protein systems with repeat signatures, seeking to find generalizations of concepts from other repeat systems such as CRISPR and TALENs. A computational pipeline is established to identify hypervariable repeat signatures in proteins, resulting in candidate systems that were characterized in additional detail. Multiple new mechanisms of modularity (two functions that are decoupled via an interchangeable domain or structure, such as repeats) were identified, pointing to a greater landscape of hypervariable protein systems than previously thought. These findings have implications for the understanding of protein architectures and may also provide valuable insights for the design of novel protein-based tools and therapeutics. Finally, Chapter IX presents one of the early large-scale studies conducted during the COVID pandemic, focusing on population-scale longitudinal mapping of COVID-19 symptoms, behavior, and testing. The study was conducted relatively early during the pandemic and collected data from a large user base of the How We Feel application. The data-driven approach employed various data analysis techniques, such as logistic regression, UMAP, and prediction models, to identify factors associated with testing propensity, symptoms associated with COVID, and behavior of patients after contracting COVID. The findings from this study could have provided valuable insights in the early stages of the pandemic, informing policymakers and public health officials such as the state of Connecticut to make data-driven decisions. Overall, this dissertation presents methods for and results from applying big data-driven methods to discovery from large biomedical databases. It specifically focuses on the exploration of diverse CRISPR systems, ancestors of CRISPR systems, hypervariable protein systems, and protein engineering for therapeutics and genome editing applications. From examining the evolutionary origins of Cas9 and Cas12 to investigating the diversity of OMEGA systems and optimizing them for therapeutic use, this work deepens our understanding of these complex biological systems. The discovery of rare CRISPR systems and compact RNA editors further broadens the landscape of genetic tools with potential therapeutic applications. Additionally, the identification and analysis of hypervariable protein systems reveal new mechanisms of modularity, with implications for protein architectures and the development of novel protein-based tools. Finally, the large-scale study of COVID-19 symptoms, behavior, and testing during the early stages of the pandemic demonstrates the power of data-driven approaches in informing public health decisions. Collectively, this research contributes significantly to our understanding of complex biological systems and highlights the potential for their application in advancing human health and biotechnology.
dc.publisherMassachusetts Institute of Technology
dc.rightsAttribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://creativecommons.org/licenses/by-sa/4.0/
dc.titleData Driven Discovery of Modular Biological Systems
dc.typeThesis
dc.description.degreePh.D.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Biological Engineering
mit.thesis.degreeDoctoral
thesis.degree.nameDoctor of Philosophy


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record