Show simple item record

dc.contributor.advisorAlex 'Sandy' Pentland.en_US
dc.contributor.authorNoriega Campero, Alejandroen_US
dc.contributor.otherTechnology and Policy Program.en_US
dc.date.accessioned2016-07-11T14:44:29Z
dc.date.available2016-07-11T14:44:29Z
dc.date.copyright2015en_US
dc.date.issued2015en_US
dc.identifier.urihttp://hdl.handle.net/1721.1/103573
dc.descriptionThesis: S.M. in Technology and Policy, Massachusetts Institute of Technology, Institute for Data, Systems, and Society, Technology and Policy Program, 2015.en_US
dc.descriptionCataloged from PDF version of thesis.en_US
dc.descriptionIncludes bibliographical references (pages 75-76).en_US
dc.description.abstractLarge-scale datasets of human behavior have the potential to fundamentally transform the way we develop cities, fight disease and crime, and respond to natural disasters. However, understanding the privacy of these data sets is key to their broad use and potential impact, for these consist of sensitive information such as citizens' geo-location. Moreover, recent research has shown adversarial methods that successfully associate sensitive information in the datasets to individuals, even under pseudonymization of all personal identifiers. This thesis conceptualizes, relates, and generalizes salient methodologies for disclosure analysis of pseudonymized data that have been developed in the last two decades, such as: k-anonymity, t-closeness, and unicity. Data at the core of the so-called "big data" revolution is fundamentally high-dimensional. We show implications of high-dimensionality as paradigmatic to modern disclosure analysis. Consequently, we propose and analyze a methodological framework that couples information-theoretic concepts from t-closeness and [delta]-disclosure with the partial adversarial knowledge model introduced by unicity [1] [2], as well as its possible extensions. The various methodologies were applied and compared on a large dataset of mobile phone records (CDRs), where results empirically showed ordinal equivalence among unicity measures and information distance measures EM-disclosure and KL-disclosure. Advantages of the proposed framework are highlighted, and future research avenues identified. We also investigate the tradeoff between data privacy and data usefulness related to mobile phone metadata (CDRs) and its real-world applications. On the disclosure side, four spatio-temporal points were enough to identify uniquely +95% of individuals, at a [ZIP code, 1 hour] spatiotemporal granularity - consistent with main results in the literature. As the dataset was coarsened in space and time, the ratio (unicity) decreased to values below 0.2% for data specified at [District, 1 week] granularity or lower. We confirmed the existence of a utility-privacy tradeoff for the 10 experts surveyed for this study, i.e., a positive relationship between reidentification risk and data utility. However, Pareto analysis revealed that several granularity levels (generalization profiles) are Pareto-suboptimal, thus the tradeoff is not strict. Non-strictness implies that not all privacy gains entail utility loss, and conversely, not all utility gains entail privacy loss. Results thus suggest that data policy decisions should rest on an understanding of the underlying privacy-utility tradeoff, as inefficient policies can otherwise be implemented, unnecessarily incurring in privacy or utility losses. Lastly we show that, due to ordinal equivalence tested on the CDR dataset, Pareto properties are preserved and thus these results on the utility-privacy tradeoff are invariant to assessing disclosure by information distance measures such as EM-disclosure and KL-disclosure. This work contributes to shed light on the privacy and utility tradeoff inherent to high-dimensional datasets of large societal systems. Its results and methodology are relevant for actors in both academic and policy domains, and germane as society engages in debate over technological and legal frameworks for potentially ubiquitous data generation and use.en_US
dc.description.statementofresponsibilityby Alejandro Noriega Campero.en_US
dc.format.extent76 pagesen_US
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsM.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582en_US
dc.subjectInstitute for Data, Systems, and Society.en_US
dc.subjectEngineering Systems Division.en_US
dc.subjectTechnology and Policy Program.en_US
dc.titleBalancing utility and privacy of high-dimensional datasets : mobile phone metadataen_US
dc.typeThesisen_US
dc.description.degreeS.M. in Technology and Policyen_US
dc.contributor.departmentMassachusetts Institute of Technology. Engineering Systems Division
dc.contributor.departmentMassachusetts Institute of Technology. Institute for Data, Systems, and Society
dc.contributor.departmentTechnology and Policy Program
dc.identifier.oclc938937787en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record