Data61, Commonwealth Scientific and Industrial Research Organization, Melbourne 3168, Australia
School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China
School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541010, China
Show Author Information
Hide Author Information
Abstract
With the development of information technology, a mass of data are generated every day. Collecting and analysing these data help service providers improve their services and gain an advantage in the fierce market competition. -means clustering has been widely used for cluster analysis in real life. However, these analyses are based on users’ data, which disclose users’ privacy. Local differential privacy has attracted lots of attention recently due to its strong privacy guarantee and has been applied for clustering analysis. However, existing -means clustering methods with local differential privacy protection cannot get an ideal clustering result due to the large amount of noise introduced to the whole dataset to ensure the privacy guarantee. To solve this problem, we propose a novel method that provides local distance privacy for users who participate in the clustering analysis. Instead of making the users’ records in-distinguish from each other in high-dimensional space, we map the user’s record into a one-dimensional distance space and make the records in such a distance space not be distinguished from each other. To be specific, we generate a noisy distance first and then synthesize the high-dimensional data record. We propose a Bounded Laplace Method (BLM) and a Cluster Indistinguishable Method (CIM) to sample such a noisy distance, which satisfies the local differential privacy guarantee and local -privacy guarantee, respectively. Furthermore, we introduce a way to generate synthetic data records in high-dimensional space. Our experimental evaluation results show that our methods outperform the traditional methods significantly.
No abstract is available for this article. Click the button above to view the PDF directly.
References
[1]
G.Guo and C.Altrjman, E-commerce customer segmentation method under improved k-means algorithm, in Proc. 4th International Conference on Multi-Modal Information Analytics, Hohhot, China, 2022, pp. 1083–1089.
M.Jebakumari, T.Palaniraja, K. A.Patrick, and Ashwini, Blocking of spam mail using k-means clustering algorithm, International Journal of Information Technology and Computer Engineering (IJITC), vol. 2, no. 3, pp. 19–24, 2022.
M.Dhalaria and E.Gandotra, Android malware detection techniques: A literature review, Recent Patents on Engineering, vol. 15, no. 2, pp. 225–245, 2021.
Y.Qu, S.Yu, W.Zhou, S.Chen, and J.Wu, Customizable reliable privacy-preserving data sharing in cyber-physical social networks, IEEE Transactions on Network Science and Engineering, vol. 8, no. 1, pp. 269–281, 2020.
U.Erlingsson, V.Pihur, and A.Korolova, RAPPOR: Randomized aggregatable privacy-preserving ordinal response, in Proc. 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 2014, pp. 1054–1067.
B.Ding, J.Kulkarni, and S.Yekhanin, Collecting telemetry data privately, in Proc. 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 2017, pp. 3574–3583.
C.Dwork and A.Roth, The algorithmic foundations of differential privacy, Foundations and Trends in Theoretical Computer Science, vol. 9, nos. 3&4, pp. 211–407, 2014.
S. P.Kasiviswanathan, H. K.Lee, K.Nissim, S.Raskhodnikova, and A.Smith, What can we learn privately? SIAM Journal on Computing, vol. 40, no. 3, pp. 793–826, 2011.
M. E.Andrés, N. E.Bordenabe, K.Chatzikokolakis, and C.Palamidessi, Geo-indistinguishability: Differential privacy for location-based systems, in Proc. 2013 ACM SIGSAC Conference on Computer and Communications Security, Berlin, Germany, 2013, pp. 901–914.
N.Wang, X.Xiao, Y.Yang, J.Zhao, S. C.Hui, H.Shin, J.Shin, and G.Yu, Collecting and analyzing multidimensional data with local differential privacy, in Proc. 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 2019, pp. 638–649.
J. C.Duchi, M. I.Jordan, and M. J.Wainwright, Minimax optimal procedures for locally private estimation, Journal of the American Statistical Association, vol. 113, no. 521, pp. 182–201, 2018.
A.Blum, C.Dwork, F.McSherry, and K.Nissim, Practical privacy: The SuLQ framework, in Proc. Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Baltimore, MD, USA, 2005, pp. 128–138.
A.Smith, Privacy-preserving statistical estimation with optimal convergence rates, in Proc. 43rd Annual ACM Symposium on Theory of Computing, San Jose, CA, USA, 2011, pp. 813–822.
J.Zhang, X.Xiao, Y.Yang, Z.Zhang, and M.Winslett, PrivGene: Differentially private model fitting using genetic algorithms, in Proc. 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 2013, pp. 665–676.
Z.Lu and H.Shen, A convergent differentially private k-means clustering algorithm, in Proc. 23rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, Macao, China, 2019, pp. 612–624.
D.Su, J.Cao, N.Li, E.Bertino, M.Lyu, and H.Jin, Differentially private k-means clustering and a hybrid approach to private optimization, ACM Transactions on Privacy and Security, vol. 20, no. 4, pp. 1–33, 2017.
M.Yang, I.Tjuawinata, and K. Y.Lam, K-means clustering with local -privacy for privacy-preserving data analysis, IEEE Transactions on Information Forensics and Security, vol. 17, pp. 2524–2537, 2022.
M.Yang, L.Lyu, J.Zhao, T.Zhu, and K. Y.Lam, Local differential privacy and its applications: A comprehensive survey, arXiv preprint arXiv: 2008.03686, 2020.
K.Nissim and U.Stemmer, Clustering algorithms for the centralized and local models, Proceedings of Algorithmic Learning Theory, vol. 83, pp. 619–653, 2018.
H.Kaplan and U.Stemmer, Differentially private -means with constant multiplicative error, in Proc. 32nd International Conference on Neural Information Processing Systems (NeurIPS), Montréal, Canada, 2018, pp. 5436–5446.
U.Stemmer, Locally private -means clustering, in Proc. Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms, Salt Lake City, UT, USA, 2020, pp. 548–559.
Yang M, Huang L, Tang C. K-Means Clustering with Local Distance Privacy. Big Data Mining and Analytics, 2023, 6(4): 433-442. https://doi.org/10.26599/BDMA.2022.9020050
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
Traditional method
The traditional method to deal with the -means clustering algorithm under distributed non-interactive setting is utilizing the traditional differential private perturbation mechanism to perturb each attribute value independently. As shown in Algorithm 1, the privacy budget is split according to the number of attribute (Line 2). Then, each value is perturbed by a random algorithm with a split privacy budget (Lines 3–5). The perturbed data record is sent to the server (Line 6). The -means clustering algorithm is performed on the perturbed data records collected from all users on the server side (Line 8).
The traditional method perturbs the user’s data record using a split privacy budget, which increases the statistical variance significantly. Besides, the perturbation algorithm is performed independently for each data attribute. It may not well preserve the distance property, which plays a critical role in the -means clustering algorithm.
Our proposed method
Instead of perturbing the data value separately, we propose a novel solution that we treat the data record as a whole and perturb the distance property of the data record. Then we introduce a method to synthesise a data record, which keeps the noisy distance property with the original true data record. Algorithm 2 shows the details of the proposed method.
Let be the farthest distance between any two data records, then the distance of the perturbed data record to the true data record is in the range of . As shown in Algorithm 2, the user outputs the perturbed data record by firstly sampling a noisy distance following a specific distribution (Line 3). And then generate the synthetic data record , whose distance to the user’s true data record is equal to the noisy distance (Line 5). The report is then sent to the server (Line 5), and the server performs the -means algorithm after collecting reports from all users (Line 7).
The advantage of the proposed solution is that it perturbs the data record as one value. Thus, it does not need to split the privacy budget, which reduces the perturbation variance significantly. There are two key components in the proposed method: noisy distance sampling and synthetic data record generation. We discuss these two components in detail in the following sections.
Noisy distance sampling
We utilize the one-dimensional distance property to describe the high-dimensional data space. Then all the data points in the domain can be described as the distance relationship to the user’s true data record as shown in
Fig. 1
.
Our purpose is to get a data record whose distance to the user’s real data record cannot be distinguished from the distances between all other data records to the real data record. We say the perturbation method achieves distance privacy if the adversary cannot know how far the perturbed data record to the user’s real data record by observing the report. We propose two methods to achieve distance privacy: the BLM and CIM.
Bounded Laplace method. BLM is similar to the traditional Laplace mechanism, but it is bounded and performed on the distance property instead of the attribute value. Specifically, given the privacy budget and the user’s real data record , the probability density function of generating data record with distance is shown as follows:
Since any possible data record has the distance , the normalization factor ensures the noisy distance be sampled in the range . And the whole sampling process satisfies the local differential privacy definition.
Theorem 3 The proposed BLM satisfies -local differential privacy.
Proof
where . Therefore, the proposed method provides a local differential privacy guarantee.
■
Cluster indistinguishable method. Instead of treating all the data records as the same, CIM incorporates the distance metric for privacy evaluation, which is named -privacy[
12
]. Specifically, it provides stronger privacy for data records with distances closer to the user’s real data record and weaker privacy for records that are far away from the user’s real data record. Following is the sampling distribution.
Theorem 4 The proposed CIM satisfies local -privacy.
Proof
where .
■
Synthetic data record generation
To recover the feature value of perturbed data record , we model the data record generation process as an optimal process. Specifically, we force the generated synthetic data record to follow the restriction that the distance to the user’s real record equals .
Specifically, we define the distance between the perturbed data record and the real data record as a loss function, and then keep updating a synthetic data record until satisfying the stop condition. The details can be found in Algorithm 3.
We initialize a data record randomly (Line 1), and utilize the distance between the synthetic data record and the real data record as the loss function (error) (Line 3). We define as the distance difference (Line 4). We continuously update the synthetic data record until reaches the threshold (Lines 5–12).
Remark We would like to remark here that the proposed method ensures distance privacy that the attacker has no idea how close the observed data record is to the user’s real data record. That is, the distances instead of the data records are indistinguishable. Since the synthetic data record is generated totally random and only restricted to the distance property, some of the generated data value may be outside of the data domain. Then, the attacker can infer that this value is not the true value. But even so, the attacker cannot infer its real value.
To recover the feature value of perturbed data record , we model the data record generation process as an optimal process. Specifically, we force the generated synthetic data record to follow the restriction that the distance to the user’s real record equals .
Specifically, we define the distance between the perturbed data record and the real data record as a loss function, and then keep updating a synthetic data record until satisfying the stop condition. The details can be found in Algorithm 3.
We initialize a data record randomly (Line 1), and utilize the distance between the synthetic data record and the real data record as the loss function (error) (Line 3). We define as the distance difference (Line 4). We continuously update the synthetic data record until reaches the threshold (Lines 5–12).
Remark We would like to remark here that the proposed method ensures distance privacy that the attacker has no idea how close the observed data record is to the user’s real data record. That is, the distances instead of the data records are indistinguishable. Since the synthetic data record is generated totally random and only restricted to the distance property, some of the generated data value may be outside of the data domain. Then, the attacker can infer that this value is not the true value. But even so, the attacker cannot infer its real value.
10.26599/BDMA.2022.9020050.F002
Performance in terms of compactness.
10.26599/BDMA.2022.9020050.F003
Performance in terms of NMI.
10.26599/BDMA.2022.9020050.F004
Performance in terms of RE.