Balancing utility and privacy of high-dimensional datasets : mobile phone metadata
Author(s)
Noriega Campero, Alejandro
DownloadFull printable version (5.696Mb)
Other Contributors
Technology and Policy Program.
Advisor
Alex 'Sandy' Pentland.
Terms of use
Metadata
Show full item recordAbstract
Large-scale datasets of human behavior have the potential to fundamentally transform the way we develop cities, fight disease and crime, and respond to natural disasters. However, understanding the privacy of these data sets is key to their broad use and potential impact, for these consist of sensitive information such as citizens' geo-location. Moreover, recent research has shown adversarial methods that successfully associate sensitive information in the datasets to individuals, even under pseudonymization of all personal identifiers. This thesis conceptualizes, relates, and generalizes salient methodologies for disclosure analysis of pseudonymized data that have been developed in the last two decades, such as: k-anonymity, t-closeness, and unicity. Data at the core of the so-called "big data" revolution is fundamentally high-dimensional. We show implications of high-dimensionality as paradigmatic to modern disclosure analysis. Consequently, we propose and analyze a methodological framework that couples information-theoretic concepts from t-closeness and [delta]-disclosure with the partial adversarial knowledge model introduced by unicity [1] [2], as well as its possible extensions. The various methodologies were applied and compared on a large dataset of mobile phone records (CDRs), where results empirically showed ordinal equivalence among unicity measures and information distance measures EM-disclosure and KL-disclosure. Advantages of the proposed framework are highlighted, and future research avenues identified. We also investigate the tradeoff between data privacy and data usefulness related to mobile phone metadata (CDRs) and its real-world applications. On the disclosure side, four spatio-temporal points were enough to identify uniquely +95% of individuals, at a [ZIP code, 1 hour] spatiotemporal granularity - consistent with main results in the literature. As the dataset was coarsened in space and time, the ratio (unicity) decreased to values below 0.2% for data specified at [District, 1 week] granularity or lower. We confirmed the existence of a utility-privacy tradeoff for the 10 experts surveyed for this study, i.e., a positive relationship between reidentification risk and data utility. However, Pareto analysis revealed that several granularity levels (generalization profiles) are Pareto-suboptimal, thus the tradeoff is not strict. Non-strictness implies that not all privacy gains entail utility loss, and conversely, not all utility gains entail privacy loss. Results thus suggest that data policy decisions should rest on an understanding of the underlying privacy-utility tradeoff, as inefficient policies can otherwise be implemented, unnecessarily incurring in privacy or utility losses. Lastly we show that, due to ordinal equivalence tested on the CDR dataset, Pareto properties are preserved and thus these results on the utility-privacy tradeoff are invariant to assessing disclosure by information distance measures such as EM-disclosure and KL-disclosure. This work contributes to shed light on the privacy and utility tradeoff inherent to high-dimensional datasets of large societal systems. Its results and methodology are relevant for actors in both academic and policy domains, and germane as society engages in debate over technological and legal frameworks for potentially ubiquitous data generation and use.
Description
Thesis: S.M. in Technology and Policy, Massachusetts Institute of Technology, Institute for Data, Systems, and Society, Technology and Policy Program, 2015. Cataloged from PDF version of thesis. Includes bibliographical references (pages 75-76).
Date issued
2015Department
Massachusetts Institute of Technology. Engineering Systems Division; Massachusetts Institute of Technology. Institute for Data, Systems, and Society; Technology and Policy ProgramPublisher
Massachusetts Institute of Technology
Keywords
Institute for Data, Systems, and Society., Engineering Systems Division., Technology and Policy Program.