Show simple item record

dc.contributor.advisorBerger, Bonnie
dc.contributor.advisorAmarasinghe, Saman
dc.contributor.authorShajii, Ariya
dc.date.accessioned2022-02-07T15:22:57Z
dc.date.available2022-02-07T15:22:57Z
dc.date.issued2021-09
dc.date.submitted2021-09-21T19:30:53.561Z
dc.identifier.urihttps://hdl.handle.net/1721.1/140081
dc.description.abstractNext-generation sequencing data is growing at an unprecedented rate, leading to new revelations in biology, healthcare, and medicine. Many researchers use high-level programming languages to navigate and analyze this data, but as gigabytes grow to terabytes or even petabytes, high-level languages become prohibitive and impractical for performance reasons. This thesis introduces Seq, a Python-based, domain-specific language for bioinformatics and genomics that combines the power and usability of high-level languages like Python with the performance of low-level languages like C or C++. Seq allows for shorter, simpler code, is readily usable by a novice programmer, and obtains significant performance improvements over existing languages and frameworks. Seq is showcased and evaluated by implementing a range of standard, widely-used applications from all stages of the genomics analysis pipeline, including genomic index construction, data pre- and post-processing, read mapping and alignment, and haplotype phasing. We show that the Seq implementations are up to an order of magnitude faster than existing hand-optimized implementations, with just a fraction of the code. Seq's substantial performance gains are made possible by a host of novel genomics-specific compiler optimizations that are out of reach for general-purpose compilers, coupled with a static type system that avoids all of Python's runtime overhead and object metadata. By enabling researchers of all backgrounds to easily implement high-performance analysis tools, Seq aims to act as a catalyst for scientific discovery and innovation. Finally, we also generalize many of the principles used by Seq to create a domain-configurable compiler called Codon, which can be applied to other domains with similar results.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright MIT
dc.rights.urihttp://rightsstatements.org/page/InC-EDU/1.0/
dc.titleHigh-Performance Computational Genomics
dc.typeThesis
dc.description.degreePh.D.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeDoctoral
thesis.degree.nameDoctor of Philosophy


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record