The inefficient utilization of ubiquitous graph data with combinatorial structures necessitates graph embedding methods, aiming at learning a continuous vector space for the graph which is amenable to be adopted in traditional machine learning algorithms in favor of vector representations. Graph embedding methods build an important bridge between social network analysis and data analytics as social networks naturally generate an unprecedented volume of graph data continuously. Publishing social network data not only bring benefit for public health, disaster response, commercial promotion, and many other applications, but also give birth to threats that jeopardize each individual’s privacy and security. Unfortunately, most existing works in publishing social graph embedding data only focus on preserving social graph structure with less attention paid to the privacy issues inherited from social networks. To be specific, attackers can infer the presence of a sensitive relationship between two individuals by training a predictive model with the exposed social network embedding. In this paper, we propose a novel link-privacy preserved graph embedding framework using adversarial learning, which can reduce adversary’s prediction accuracy on sensitive links while persevering sufficient non-sensitive information such as graph topology and node attributes in graph embedding. Extensive experiments are conducted to evaluate the proposed framework using ground truth social network datasets.
- Article type
- Year
- Co-author
Graph data publication has been considered as an important step for data analysis and mining. Graph data, which provide knowledge on interactions among entities, can be locally generated and held by distributed data owners. These data are usually sensitive and private, because they may be related to owners’ personal activities and can be hijacked by adversaries to conduct inference attacks. Current solutions either consider private graph data as centralized contents or disregard the overlapping of graphs in distributed manners. Therefore, this work proposes a novel framework for distributed graph publication. In this framework, differential privacy is applied to justify the safety of the published contents. It includes four phases, i.e., graph combination, plan construction sharing, data perturbation, and graph reconstruction. The published graph selection is guided by one data coordinator, and each graph is perturbed carefully with the Laplace mechanism. The problem of graph selection is formulated and proven to be NP-complete. Then, a heuristic algorithm is proposed for selection. The correctness of the combined graph and the differential privacy on all edges are analyzed. This study also discusses a scenario without a data coordinator and proposes some insights into graph publication.
Privacy preserving data releasing is an important problem for reconciling data openness with individual privacy. The state-of-the-art approach for privacy preserving data release is differential privacy, which offers powerful privacy guarantee without confining assumptions about the background knowledge about attackers. For genomic data with huge-dimensional attributes, however, current approaches based on differential privacy are not effective to handle. Specifically, amount of noise is required to be injected to genomic data with tens of million of SNPs (Single Nucleotide Polymorphisms), which would significantly degrade the utility of released data. To address this problem, this paper proposes a differential privacy guaranteed genomic data releasing method. Through executing belief propagation on factor graph, our method can factorize the distribution of sensitive genomic data into a set of local distributions. After injecting differential-privacy noise to these local distributions, synthetic sensitive data can be obtained by sampling on noise distribution. Synthetic sensitive data and factor graph can be further used to construct approximate distribution of non-sensitive data. Finally, non-sensitive genomic data is sampled from the approximate distribution to construct a synthetic genomic dataset.