<i>K</i>-Means Clustering with Local Distance Privacy

Mengmeng Yang; Longxia Huang; Chenghua Tang

doi:10.26599/BDMA.2022.9020050

| Sign up

PDF (1.3 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (4)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Tables (1)

Table 1

Open Access

K-Means Clustering with Local Distance Privacy

Mengmeng Yang^¹, Longxia Huang^², Chenghua Tang^³()

1 Data61, Commonwealth Scientific and Industrial Research Organization, Melbourne 3168, Australia

2 School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China

3 School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541010, China

Show Author Information

Abstract

With the development of information technology, a mass of data are generated every day. Collecting and analysing these data help service providers improve their services and gain an advantage in the fierce market competition. $K$ -means clustering has been widely used for cluster analysis in real life. However, these analyses are based on users’ data, which disclose users’ privacy. Local differential privacy has attracted lots of attention recently due to its strong privacy guarantee and has been applied for clustering analysis. However, existing $K$ -means clustering methods with local differential privacy protection cannot get an ideal clustering result due to the large amount of noise introduced to the whole dataset to ensure the privacy guarantee. To solve this problem, we propose a novel method that provides local distance privacy for users who participate in the clustering analysis. Instead of making the users’ records in-distinguish from each other in high-dimensional space, we map the user’s record into a one-dimensional distance space and make the records in such a distance space not be distinguished from each other. To be specific, we generate a noisy distance first and then synthesize the high-dimensional data record. We propose a Bounded Laplace Method (BLM) and a Cluster Indistinguishable Method (CIM) to sample such a noisy distance, which satisfies the local differential privacy guarantee and local $d_{E}$ -privacy guarantee, respectively. Furthermore, we introduce a way to generate synthetic data records in high-dimensional space. Our experimental evaluation results show that our methods outperform the traditional methods significantly.

Keywords

K-means clustering local differential privacy data analysis

References

[1]

Guo

and C.

Altrjman

, E-commerce customer segmentation method under improved k-means algorithm, in Proc. 4^th International Conference on Multi-Modal Information Analytics, Hohhot, China, 2022, pp. 1083–1089.

Crossref Google Scholar

[2]

Jebakumari

, T.

Palaniraja

, K. A.

Patrick

, and Ashwini

, Blocking of spam mail using k-means clustering algorithm, International Journal of Information Technology and Computer Engineering (IJITC), vol. 2, no. 3, pp. 19–24, 2022.

Crossref Google Scholar

[3]

Dhalaria

and E.

Gandotra

, Android malware detection techniques: A literature review, Recent Patents on Engineering, vol. 15, no. 2, pp. 225–245, 2021.

Crossref Google Scholar

[4]

, J.

Zhang

, R.

, X.

Zhang

, X.

Zhai

, and S.

, Generative adversarial networks enhanced location privacy in 5G networks, Science China Information Sciences, vol. 63, no. 12, p. 220303, 2020.

Crossref Google Scholar

[5]

, S.

, W.

Zhou

, S.

Chen

, and J.

, Customizable reliable privacy-preserving data sharing in cyber-physical social networks, IEEE Transactions on Network Science and Engineering, vol. 8, no. 1, pp. 269–281, 2020.

Crossref Google Scholar

[6]

, L.

Gao

, Y.

Xiang

, S.

Shen

, and S.

, FedTwin: Blockchain-enabled adaptive asynchronous federated learning for digital twin networks, IEEE Network, vol. 36, no. 6, pp. 183–190, 2022.

Crossref Google Scholar

[7]

Erlingsson

, V.

Pihur

, and A.

Korolova

, RAPPOR: Randomized aggregatable privacy-preserving ordinal response, in Proc. 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 2014, pp. 1054–1067.

Crossref Google Scholar

[8]

Apple differential privacy technical overview, https://www.apple.com/privacy/docs/Differential\_Privacy\_Overview.pdf, 2022.

[9]

Ding

, J.

Kulkarni

, and S.

Yekhanin

, Collecting telemetry data privately, in Proc. 31^st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 2017, pp. 3574–3583.

Google Scholar

[10]

Dwork

and A.

Roth

, The algorithmic foundations of differential privacy, Foundations and Trends in Theoretical Computer Science, vol. 9, nos. 3&4, pp. 211–407, 2014.

Crossref Google Scholar

[11]

S. P.

Kasiviswanathan

, H. K.

Lee

, K.

Nissim

, S.

Raskhodnikova

, and A.

Smith

, What can we learn privately? SIAM Journal on Computing, vol. 40, no. 3, pp. 793–826, 2011.

Crossref Google Scholar

[12]

M. E.

Andrés

, N. E.

Bordenabe

, K.

Chatzikokolakis

, and C.

Palamidessi

, Geo-indistinguishability: Differential privacy for location-based systems, in Proc. 2013 ACM SIGSAC Conference on Computer and Communications Security, Berlin, Germany, 2013, pp. 901–914.

Crossref Google Scholar

[13]

Wang

, X.

Xiao

, Y.

Yang

, J.

Zhao

, S. C.

Hui

, H.

Shin

, J.

Shin

, and G.

, Collecting and analyzing multidimensional data with local differential privacy, in Proc. 2019 IEEE 35^th International Conference on Data Engineering (ICDE), Macao, China, 2019, pp. 638–649.

Crossref Google Scholar

[14]

J. C.

Duchi

, M. I.

Jordan

, and M. J.

Wainwright

, Minimax optimal procedures for locally private estimation, Journal of the American Statistical Association, vol. 113, no. 521, pp. 182–201, 2018.

Crossref Google Scholar

[15]

Blum

, C.

Dwork

, F.

McSherry

, and K.

Nissim

, Practical privacy: The SuLQ framework, in Proc. Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Baltimore, MD, USA, 2005, pp. 128–138.

Crossref Google Scholar

[16]

, M.

Qiao

, Z.

Chen

, S.

Zhang

, and H.

Zhong

, Utility-efficient differentially private k-means clustering based on cluster merging, Neurocomputing, vol. 424, pp. 205–214, 2021.

Crossref Google Scholar

[17]

Smith

, Privacy-preserving statistical estimation with optimal convergence rates, in Proc. 43^rd Annual ACM Symposium on Theory of Computing, San Jose, CA, USA, 2011, pp. 813–822.

Crossref Google Scholar

[18]

Zhang

, X.

Xiao

, Y.

Yang

, Z.

Zhang

, and M.

Winslett

, PrivGene: Differentially private model fitting using genetic algorithms, in Proc. 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 2013, pp. 665–676.

Crossref Google Scholar

[19]

and H.

Shen

, A convergent differentially private k-means clustering algorithm, in Proc. 23^rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, Macao, China, 2019, pp. 612–624.

Crossref Google Scholar

[20]

, J.

Cao

, N.

, E.

Bertino

, M.

Lyu

, and H.

Jin

, Differentially private k-means clustering and a hybrid approach to private optimization, ACM Transactions on Privacy and Security, vol. 20, no. 4, pp. 1–33, 2017.

Crossref Google Scholar

[21]

Yang

, I.

Tjuawinata

, and K. Y.

Lam

, K-means clustering with local

d_{x}

-privacy for privacy-preserving data analysis, IEEE Transactions on Information Forensics and Security, vol. 17, pp. 2524–2537, 2022.

Crossref Google Scholar

[22]

Yang

, L.

Lyu

, J.

Zhao

, T.

Zhu

, and K. Y.

Lam

, Local differential privacy and its applications: A comprehensive survey, arXiv preprint arXiv: 2008.03686, 2020.

Google Scholar

[23]

Nissim

and U.

Stemmer

, Clustering algorithms for the centralized and local models, Proceedings of Algorithmic Learning Theory, vol. 83, pp. 619–653, 2018.

Google Scholar

[24]

Kaplan

and U.

Stemmer

, Differentially private

k

-means with constant multiplicative error, in Proc. 32^nd International Conference on Neural Information Processing Systems (NeurIPS), Montréal, Canada, 2018, pp. 5436–5446.

Google Scholar

[25]

Stemmer

, Locally private

k

-means clustering, in Proc. Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms, Salt Lake City, UT, USA, 2020, pp. 548–559.

Crossref Google Scholar

[26]

Xia

, J.

Hua

, W.

Tong

, and S.

Zhong

, Distributed

k

-means clustering guaranteeing local differential privacy, Computers & Security, vol. 90, p. 101699, 2020.

Crossref Google Scholar

[27]

Chang

, B.

Ghazi

, R.

Kumar

, and P.

Manurangsi

, Locally private k-means in one round, arXiv preprint arXiv: 2104.09734, 2021.

Google Scholar

Big Data Mining and Analytics

Volume 6 Issue 4,
December 2023

Pages 433-442

DOI: 10.26599/BDMA.2022.9020050

Cite this article:

Yang M, Huang L, Tang C. K-Means Clustering with Local Distance Privacy. Big Data Mining and Analytics, 2023, 6(4): 433-442. https://doi.org/10.26599/BDMA.2022.9020050

3.2 Our proposed method

Instead of perturbing the data value separately, we propose a novel solution that we treat the data record as a whole and perturb the distance property of the data record. Then we introduce a method to synthesise a data record, which keeps the noisy distance property with the original true data record. Algorithm 2 shows the details of the proposed method.

Let $ℓ$ be the farthest distance between any two data records, then the distance of the perturbed data record to the true data record $d (r, r^{'})$ is in the range of $[0, ℓ]$ . As shown in Algorithm 2, the user outputs the perturbed data record $\hat{r}$ by firstly sampling a noisy distance $\hat{d}$ following a specific distribution $F$ (Line 3). And then generate the synthetic data record $\hat{r}$ , whose distance to the user’s true data record is equal to the noisy distance $\hat{d}$ (Line 5). The report is then sent to the server (Line 5), and the server performs the $K$ -means algorithm after collecting reports from all users (Line 7).

The advantage of the proposed solution is that it perturbs the data record as one value. Thus, it does not need to split the privacy budget, which reduces the perturbation variance significantly. There are two key components in the proposed method: noisy distance sampling and synthetic data record generation. We discuss these two components in detail in the following sections.

3.2.1 Noisy distance sampling

We utilize the one-dimensional distance property to describe the high-dimensional data space. Then all the data points in the domain can be described as the distance relationship to the user’s true data record as shown in Fig. 1 .

10.26599/BDMA.2022.9020050.F001 Fig. 1Noisy distance sampling.

Our purpose is to get a data record whose distance to the user’s real data record cannot be distinguished from the distances between all other data records to the real data record. We say the perturbation method achieves distance privacy if the adversary cannot know how far the perturbed data record to the user’s real data record by observing the report. We propose two methods to achieve distance privacy: the BLM and CIM.

Bounded Laplace method. BLM is similar to the traditional Laplace mechanism, but it is bounded and performed on the distance property instead of the attribute value. Specifically, given the privacy budget $ϵ$ and the user’s real data record $r$ , the probability density function of generating data record $r^{'}$ with distance $d (r, r^{'})$ is shown as follows:

(2) $F^{blm} = \frac{ϵ}{ℓ (1 - e^{- ϵ})} \cdot e^{- \frac{ϵ}{ℓ} d (r, r^{'})}$

Since any possible data record $r^{'}$ has the distance $d (r, r^{'}) ⩽ ℓ$ , the normalization factor $\frac{ϵ}{ℓ (1 - e^{- ϵ})}$ ensures the noisy distance be sampled in the range $[0, ℓ]$ . And the whole sampling process satisfies the local differential privacy definition.

Theorem 3 The proposed BLM satisfies $ϵ$ -local differential privacy.

Proof

(3) $\frac{\Pr [d^{*} (r, r^{*}) / d (r, r^{*})]}{\Pr [d^{*} (r^{'}, r^{*}) / d (r^{'}, r^{*})} = e^{- \frac{ϵ}{ℓ} \cdot (d (r^{'}, r^{*}) - d (r, r^{*}))} ⩽$ $e^{\frac{ϵ}{ℓ} \cdot d (r, r^{'})} ⩽ e^{ϵ}$

where $d^{*} (r, r^{*}) = d^{*} (r^{'}, r^{*})$ . Therefore, the proposed method provides a local differential privacy guarantee.

■

Cluster indistinguishable method. Instead of treating all the data records as the same, CIM incorporates the distance metric for privacy evaluation, which is named $d_{χ}$ -privacy^{[

12
]}. Specifically, it provides stronger privacy for data records with distances closer to the user’s real data record and weaker privacy for records that are far away from the user’s real data record. Following is the sampling distribution.

(4) $F^{cim} = \frac{ϵ}{1 - e^{- ϵ}} \cdot e^{- ϵ \cdot d (r, r^{'})}$

Theorem 4 The proposed CIM satisfies local $d_{χ}$ -privacy.

Proof

(5) $\frac{\Pr [d^{*} (r, r^{*}) / d (r, r^{*})]}{\Pr [d^{*} (r^{'}, r^{*}) / d (r^{'}, r^{*})} = e^{- ϵ \cdot (d (r^{'}, r^{*}) - d (r - r^{*}))} ⩽$ $e^{ϵ \cdot d (r, r^{'})}$

where $d^{*} (r, r^{*}) = d^{*} (r^{'}, r^{*})$ .

■

3.2.2 Synthetic data record generation

To recover the feature value of perturbed data record $r^{'}$ , we model the data record generation process as an optimal process. Specifically, we force the generated synthetic data record to follow the restriction that the distance to the user’s real record equals $\hat{d}$ .

$loss = \sqrt{\sum_{i = 1}^{d} {(v_{i} - {\hat{v}}_{i})}^{2}} - \hat{d} .$

Specifically, we define the distance between the perturbed data record and the real data record as a loss function, and then keep updating a synthetic data record until satisfying the stop condition. The details can be found in Algorithm 3.

We initialize a data record randomly (Line 1), and utilize the distance between the synthetic data record and the real data record as the loss function (error) (Line 3). We define $E$ as the distance difference (Line 4). We continuously update the synthetic data record until $E$ reaches the threshold (Lines 5–12).

Remark We would like to remark here that the proposed method ensures distance privacy that the attacker has no idea how close the observed data record is to the user’s real data record. That is, the distances instead of the data records are indistinguishable. Since the synthetic data record is generated totally random and only restricted to the distance property, some of the generated data value may be outside of the data domain. Then, the attacker can infer that this value is not the true value. But even so, the attacker cannot infer its real value.

K-Means Clustering with Local Distance Privacy

Abstract

Keywords

References

Review close

Comments on this article

< Back to all reports

Review Comment

	1	2
Version 1 Review comment