Nowhere to hide

Privacy-sensitive data (social relationships, mobility traces, medical records) are increasingly becoming public to provide open access to application developers or to facilitate data-mining researchers. To protect users’ privacy, data anonymization techniques have been the focus of extensive investigations. However, even equipped with sophisticated anonymization techniques, the privacy of data still faces a significant challenge - adversaries can also get rich auxiliary information (also called background information) from other channels such as the Internet or public records. The era of big data further increases such threats since more correlated data containing individual records are exposed in the Internet, which can serve as the auxiliary information for inferring users’ private data. Most privacy-sensitive data are closely related to individual behavior, and thus contain rich graph structural characteristics. For example, social networks can be modeled as graphs in a straightforward manner. Mobility traces can also be modeled as graph topologies. Due to the vulnerabilities of existing anonymization schemes, it has been experimentally demonstrated that structurebased de-anonymization can break the data’s privacy based on the structural information of the data. One concrete example is the revelation of users’ social relationships. Many people nowadays have accounts through various social networks such as Facebook, Twiter, Google+, Myspace and Flicker. Based on the inherent cross-site correlations, a user's information can be effectively de-anonymized by utilizing another auxiliary information. Furthermore, other public datasets may also contain individual behavior information, e.g., purchasing history, movie-watching history and wifi-access. Despite these empirical de-anonymization approaches, we have no rigorous theoretical analysis about the vulnerabilities of existing anonymization techniques. For an anonymization approach, not only the users’ sensitive information should be protected, but also the anonymized data should remain useful for applications, i.e. the anonymized utility should be guaranteed. Then, under what circumstances for anonymized utility, is it possible for the privacy of an individual to be broken? We aim to quantify the vulnerabilities of existing anonymization techniques and establish the inherent relationships between the anonymized utility with the de-anonymization capability. Furthermore, previous de-anonymization techniques suffer from the imprecision of the adversary’s background knowledge and the perturbations that may have been applied to the data prior to release. We aim to solve these problems by proposing a novel de-anonymization technique.