Adaptive scheduling in Spark

Mahajan, Rohan

Author(s)

Mahajan, Rohan

DownloadFull printable version (414.4Kb)

Other Contributors

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.

Advisor

Matei Zaharia.

Terms of use

M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

Because most data processing systems are distributed in nature, data must be transferred between machines. Currently, Spark, a prominent such system, predetermines the strategies for shuffling this data, but in certain situations, different shuffle strategies would improve performance. We add functionality to track metrics about the data during the job and appropriately adapt the shuffle strategy. We show improvements in ShuffledRDD performance, joins using Spark's RDD interface, and joins in Spark SQL.

Description

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.

This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.

Cataloged from student-submitted PDF version of thesis.

Includes bibliographical references (page 33).

Date issued

2016

URI

http://hdl.handle.net/1721.1/105977

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Graduate Theses