MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Balancing utility and privacy of high-dimensional datasets : mobile phone metadata

Author(s)
Noriega Campero, Alejandro
Thumbnail
DownloadFull printable version (5.696Mb)
Other Contributors
Technology and Policy Program.
Advisor
Alex 'Sandy' Pentland.
Terms of use
M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582
Metadata
Show full item record
Abstract
Large-scale datasets of human behavior have the potential to fundamentally transform the way we develop cities, fight disease and crime, and respond to natural disasters. However, understanding the privacy of these data sets is key to their broad use and potential impact, for these consist of sensitive information such as citizens' geo-location. Moreover, recent research has shown adversarial methods that successfully associate sensitive information in the datasets to individuals, even under pseudonymization of all personal identifiers. This thesis conceptualizes, relates, and generalizes salient methodologies for disclosure analysis of pseudonymized data that have been developed in the last two decades, such as: k-anonymity, t-closeness, and unicity. Data at the core of the so-called "big data" revolution is fundamentally high-dimensional. We show implications of high-dimensionality as paradigmatic to modern disclosure analysis. Consequently, we propose and analyze a methodological framework that couples information-theoretic concepts from t-closeness and [delta]-disclosure with the partial adversarial knowledge model introduced by unicity [1] [2], as well as its possible extensions. The various methodologies were applied and compared on a large dataset of mobile phone records (CDRs), where results empirically showed ordinal equivalence among unicity measures and information distance measures EM-disclosure and KL-disclosure. Advantages of the proposed framework are highlighted, and future research avenues identified. We also investigate the tradeoff between data privacy and data usefulness related to mobile phone metadata (CDRs) and its real-world applications. On the disclosure side, four spatio-temporal points were enough to identify uniquely +95% of individuals, at a [ZIP code, 1 hour] spatiotemporal granularity - consistent with main results in the literature. As the dataset was coarsened in space and time, the ratio (unicity) decreased to values below 0.2% for data specified at [District, 1 week] granularity or lower. We confirmed the existence of a utility-privacy tradeoff for the 10 experts surveyed for this study, i.e., a positive relationship between reidentification risk and data utility. However, Pareto analysis revealed that several granularity levels (generalization profiles) are Pareto-suboptimal, thus the tradeoff is not strict. Non-strictness implies that not all privacy gains entail utility loss, and conversely, not all utility gains entail privacy loss. Results thus suggest that data policy decisions should rest on an understanding of the underlying privacy-utility tradeoff, as inefficient policies can otherwise be implemented, unnecessarily incurring in privacy or utility losses. Lastly we show that, due to ordinal equivalence tested on the CDR dataset, Pareto properties are preserved and thus these results on the utility-privacy tradeoff are invariant to assessing disclosure by information distance measures such as EM-disclosure and KL-disclosure. This work contributes to shed light on the privacy and utility tradeoff inherent to high-dimensional datasets of large societal systems. Its results and methodology are relevant for actors in both academic and policy domains, and germane as society engages in debate over technological and legal frameworks for potentially ubiquitous data generation and use.
Description
Thesis: S.M. in Technology and Policy, Massachusetts Institute of Technology, Institute for Data, Systems, and Society, Technology and Policy Program, 2015.
 
Cataloged from PDF version of thesis.
 
Includes bibliographical references (pages 75-76).
 
Date issued
2015
URI
http://hdl.handle.net/1721.1/103573
Department
Massachusetts Institute of Technology. Engineering Systems Division; Massachusetts Institute of Technology. Institute for Data, Systems, and Society; Technology and Policy Program
Publisher
Massachusetts Institute of Technology
Keywords
Institute for Data, Systems, and Society., Engineering Systems Division., Technology and Policy Program.

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.