College of Computer Science, Chongqing University, Chongqing 400044, China
Show Author Information
Hide Author Information
Abstract
Minable data publication is ubiquitous since it is beneficial to sharing/trading data among commercial companies and further facilitates the development of data-driven tasks. Unfortunately, the minable data publication is often implemented by publishers with limited privacy concerns such that the published dataset is minable by malicious entities. It prohibits minable data publication since the published data may contain sensitive information. Thus, it is urgently demanded to present some approaches and technologies for reducing the privacy leakage risks. To this end, in this paper, we propose an optimized sanitization approach for minable data publication (named as SA-MDP). SA-MDP supports association rules mining function while providing privacy protection for specific rules. In SA-MDP, we consider the trade-off between the data utility and the data privacy in the minable data publication problem. To address this problem, SA-MDP designs a customized particle swarm optimization (PSO) algorithm, where the optimization objective is determined by both the data utility and the data privacy. Specifically, we take advantage of PSO to produce new particles, which is achieved by random mutation or learning from the best particle. Hence, SA-MDP can avoid the solutions being trapped into local optima. Besides, we design a proper fitness function to guide the particles to run towards the optimal solution. Additionally, we present a preprocessing method before the evolution process of the customized PSO algorithm to improve the convergence rate. Finally, the proposed SA-MDP approach is performed and verified over several datasets. The experimental results have demonstrated the effectiveness and efficiency of SA-MDP.
I.Viktoratos, A.Tsadiras, and N.Bassiliades, Combining community-based knowledge with association rule mining to alleviate the cold start problem in context-aware recommender systems, Expert Syst. Appl., vol. 101, pp. 78-90, 2018.
X.Zheng, G.Luo, and Z.Cai, A fair mechanism for private data publication in online social networks, IEEE Trans. Netw. Sci. Eng., vol. 7, no. 2, pp. 880-891, 2020.
K.Zhang, Z.Tian, Z.Cai, and D.Seo, Link-privacy preserving graph embedding data publication with adversarial learning, Tsinghua Science and Technology, vol. 27, no. 2, pp. 244-256, 2022.
F.Yang, X.Lei, J.Le, N.Mu, and X.Liao, Minable data publication based on sensitive association rule hiding, IEEE Trans. Emerg. Top. Comput. Intell., vol. 14, no. 8. pp. 1-11, 2021.
X.Wu, V.Kumar, J. R.Quinlan, J.Ghosh, Q.Yang, H.Motoda, G. J.Mclachlan, A.Ng, B.Liu, P. S.Yu, et al., Top 10 algorithms in data mining, Knowl. Inf. Syst., vol. 14, no. 1, pp. 1-37, 2008.
F. N.Motlagh and H.Sajedi, MOSAR: A multi-objective strategy for hiding sensitive association rules using genetic algorithm, Appl. Artif. Intell., vol. 30, no. 9, pp. 823-843, 2016.
B.Talebi and N. M.Dehkordi, Sensitive association rules hiding using electromagnetic field optimization algorithm, Expert Syst. Appl., vol. 114, pp. 155-172, 2018.
M. H.Afshari, M. N.Dehkordi, and M.Akbari, Association rule hiding using cuckoo optimization algorithm, Expert Syst. Appl., vol. 64, pp. 340-351, 2016.
H.Pang and B.Wang, Privacy-preserving association rule mining using homomorphic encryption in a multikey environment, IEEE Syst. J., vol. 15, no. 2, pp. 3131-3141, 2021.
Z. H.Zhan, S. H.Wu, and J.Zhang, A new evolutionary computation framework for privacy-preserving optimization, in Proc. 2021 13 Int. Conf. on Advanced Computational Intelligence, Wanzhou, China, 2021, pp. 220-226.
G. M.Fan and H. J.Huang, A novel binary differential evolution algorithm for a class of fuzzy-stochastic resource allocation problems, in Proc. 13 IEEE Int. Conf. on Control and Automation, Ohrid, Macedonia, 2017, pp. 548-553.
I.Dinur and K.Nissim, Revealing information while preserving privacy, in Proc. of the 22 ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems, San Diego, CA, USA, 2003, pp. 202-210.
S.Li, N.Mu, J.Le, and X.Liao, Privacy preserving frequent itemset mining: Maximizing data utility based on database reconstruction, Comput. Secur., vol. 84, pp. 17-34, 2019.
Z.Cai, Z.He, X.Guan, and Y.Li, Collective data-sanitization for preventing sensitive information inference attacks in social networks, IEEE Trans. Dependable Secure Comput., vol. 15, no. 4, pp. 577-590, 2018.
P.Huang, Y.Wang, K.Wang, and K.Yang, Differential evolution with a variable population size for deployment optimization in a UAV-assisted IoT data collection system, IEEE Trans. Emerg. Top. Comput. Intell., vol. 4, no. 3, pp. 324-335, 2020.
A.Khan, M. S.Qureshi, and A.Hussain, Improved genetic algorithm approach for sensitive association rules hiding, World Appl. Sci. J., vol. 31, no. 12, pp. 2087-2092, 2014.
U.Ahmed, J. C. W.Lin, G.Srivastava, R.Yasin, and Y.Djenouri, An evolutionary model to mine high expected utility patterns from uncertain databases, IEEE Trans. Emerg. Top. Comput. Intell., vol. 5, no. 1, pp. 19-28, 2021.
E. V.Altay and B.Alatas, Differential evolution and sine cosine algorithm based novel hybrid multi-objective approaches for numerical association rule mining, Inf. Sci., vol. 554, pp. 198-221, 2021.
L.Zhang, S.Yang, X.Wu, F.Cheng, Y.Xie, and Z.Lin, An indexed set representation based multi-objective evolutionary approach for mining diversified top-k high utility patterns, Eng. Appl. Artif. Intell., vol. 77, pp. 9-20, 2019.
P.Cheng, I.Lee, C. W.Lin, and J. S.Pan, Association rule hiding based on evolutionary multi-objective optimization, Intell. Data Anal., vol. 20, no. 3, pp. 495-514, 2016.
Yang F, Liao X. An Optimized Sanitization Approach for Minable Data Publication. Big Data Mining and Analytics, 2022, 5(3): 257-269. https://doi.org/10.26599/BDMA.2022.9020007
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
10.26599/BDMA.2022.9020007.F001
The flowchart of SA-MDP.
3 SA-MDP Design
In this section, we describe SA-MDP approach in detail. SA-MDP consists of four phases, i.e., preprocessing, initialization, evolution, and termination. The flowchart of SA-MDP is shown in
Fig. 1
, and the pseudo-code of SA-MDP is given as Algorithm 1.
10.26599/BDMA.2022.9020007.F001
The flowchart of SA-MDP.
3.1 Preprocessing
In this section, we introduce the preprocessing strategy. Technically, the data publisher can protect an SAR by reducing its Support or Confidence to drop below the related mining thresholds[
5
]. To this end, the data publisher can delete/add some items from/into the original dataset. Unfortunately, the mentioned methods cause the following three side-effects.
· Hiding-Failure. Some SARs still can be discovered from the original and the sanitized datasets.
· Lost-Rule. Some NARs can be discovered from the original dataset, but they are lost in the sanitized datasets.
· Fake-Rule. Some fake association rules (FARs) cannot be mined from the original dataset, but they are generated in the sanitized datasets.
These side-effects can be measured by the following three metrics.
· Hiding-Failure Rate (HFR): ,
· Lost-Rule Rate (LRR): ,
· Fake-Rule Rate (FRR):
where and are the sets of SARs and NARs, respectively. Meanwhile, and are the sets of SARs and NARs that obtained after sanitization process, respectively. In addition, denotes the number of elements in the related set, and means the complement of the set relative to set .
It is worth pointing out that the Support of any FARs will not be increased if we remove the items[
5
]. Hence, we prefer to hide SARs by removing items than randomly modifying the dataset. In addition, it will cause more serious side-effects if the data publisher changes the LHS of SARs other than the RHS of SARs. Thus, SA-MDP sanitizes the dataset by deleting some items in the RHS of SARs. Next, we give three definitions for preprocessing.
Definition 4. (Critical transaction). A transaction is a critical transaction () if it supports one or more SARs.
Specifically, “ supports ” means contains all the items of . Then, by Definition 4, critical transactions are composed of those SAR-related transactions, and the rest transactions are filtered out, so the solution space can be reduced.
Definition 5. (Sensitive item). An item is a sensitive item () if it is the only item in the RHS of SAR or it has the highest frequency in the RHS of SAR and the lowest frequency in NAR.
SA-MDP tends to remove the to hide SARs. That is to say, only the -related NARs will be affected. Therefore, we present the third definition.
Definition 6. (Critical rule). A critical rule () is the NAR which contains any .
According to Definition 6, SA-MDP can evaluate the LRR by evaluating the loss of . To sum up, the preprocessing can be summarized in three steps.
(1) SA-MDP removes useless information and leaves .
(2) SA-MDP computes the frequency of the RHS of SARs and finds .
(3) SA-MDP finds all NARs containing and labels them as .
After the preprocessing, SA-MDP can reduce the search space to improve the convergence rate.
Example. We continue to illustrate the preprocessing, step by step. We select the following two rules as SARs that need to be protected.
Then, the preprocessing is performed as follows.
First, based on Definition 4, SA-MDP removes 4 transactions, i.e., . The filtered out transactions are given in
Table 4
. Then, according to Definition 5, SA-MDP determines the item is the for the two SARs. Next, according to Definition 6, NARs containing are identified as the . Thus, we have 8 , which are shown in
Table 5
.
10.26599/BDMA.2022.9020007.T004
The critical transactions .
TID
Items
TID
Items
10.26599/BDMA.2022.9020007.T005
The critical rules . (%)
Rules
Support
Confidence
Rules
Support
Confidence
40
100.00
40
80.00
40
100.00
40
80.00
30
100.00
30
75.00
50
83.33
50
71.43
The data publisher can obtain that and . In other words, the presented preprocessing can reduce the amount of data and further reduce the search space.
3.2 Fitness function
In this section, we show how to design the fitness function based on the preprocessing strategy. Specifically, we design the fitness function based on the above-mentioned three side-effects, in which denotes HFR and denotes LRR. Especially, increases by 1 when one SAR in is failed to be protected and increases by 1 when one NAR in is lost after the hiding process. They are two coarse-grained evaluation metrics since and are changed by modifying multiple transactions. Therefore, to provide a fine-grained evaluation for the solution quality, we present a new evaluation metric, optimal sanitation distance (OSD) , which is calculated as follows:
where is calculated as
can reflect how much at least the Confidence or the Support needs to be decreased for hiding the -th SAR. Besides, is calculated as
shows how much at most the Confidence or the Support needs to be increased for avoiding the loss of the -th NAR. Recall that SA-MDP hides SAR by removing its RHS, no FAR will be generated. Thus, the FRR of SA-MDP identically equals 0. Considering three side-effects, the fitness function can be represented as
Especially, in this paper, we set . The data publisher can adjust the weights according to the practical demands. The fitness function provides a precise quantify of the solutions’ quality.
3.3 Initialization
In this section, we introduce the initialization method, which can be divided into the following three steps.
(1) Represent the critical transaction with a binary matrix . In the preprocessing, the data publisher has filtered out the critical transaction from according to Definition 4. Let denote the set of the critical transaction, which will be transformed into a binary matrix. In such matrix, the ()-th element equals 1, which means that the item exists in transaction .
(2) Generate an initial population . According to , the data publisher generates an initial particle . Then, to generate the initial population , the data publisher randomly selects the elements in , and those selected elements compromise a new particle. This process will perform times, where is the population size and can be empirically set. are constituted by the particles.
(3) Generate the initial best-previous position and the initial best-so-far position . SA-MDP calculates the fitness of each particle in . The particle having the smallest fitness is the initial best previous position , which also is the best-so-far position .
Before the evolution is run, SA-MDP can further reduce the dimension of solution space by controlling the modifying numbers. Specifically, according to the Support and Confidence of SARs, SA-MDP can determine the upper and the lower bound of the number of transactions as and , respectively. Let denote the maximum Support which needs to be reduced, and denotes the maximum Confidence which needs to be reduced. Suppose is the SAR that needs to be protected, we have to remove the -related items to hold that
or
To sum up, and can be calculated as follows.
where is the ceiling function.
Then, is obtained as
Based on , SA-MDP decreases the number of the modified transactions such that the data utility can be guaranteed to a certain degree. In SA-MDP, is set as the number of , i.e., .
Example. According to the above discussion, the initialization is achieved in the following three steps. The first step is to represent the obtained with a binary matrix. Note that contains 4 items (i.e., , , , and ), so can be represented by 011101. Besides, , , , , and are also denoted by a binary vector, respectively. The transformed binary matrix is obtained and is shown in
Table 6
. The second step is to generate an initial population . From the obtained matrix, we pick the -related columns as the data that needs to be removed. Let the transaction identifier of the removed items as the position of the particle. Thus, we can have an initial particle , which means not to remove any items. Then, we randomly select the transactions in to sanitize. For instance, and are selected. Then, the identifier of those transactions constitutes a new position of particle . We perform times and then obtain the initial population. The third step is to calculate the fitness of these initial particles. The particle having the smallest fitness is the initial best previous position , which also is the best-so-far position .
10.26599/BDMA.2022.9020007.T006
Represent with a binary matrix.
TID
0
1
1
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
0
0
1
1
1
1
1
0
1
0
0
1
0
1
0
1
0
1
3.4 Evolution
This section briefly introduces the evolution process of SA-MDP. The evolution includes: mother particle splitting, updating the velocity and the position, and updating the best positions. Suppose the current particle that needs to be updated is , the main steps are described as follows.
(1) Mother particle splitting. Suppose each mother particle can be split into several child particles owning to the same property with . Especially, let denote the maximum number of the child particles and denote the minimum number of the child particles. Thus, the number of the child particles of each mother particle can be computed as follows.
where and are set by the data publisher, is the floor function, and is a random number.
(2) Updating the velocity and the position. SA-MDP adopts two updating methods: (a) the directional learning and (b) the random splitting.
(a) The directional learning method. For the particle , , the specific particle updating process is performed as Eqs. (
16
) and (
17
), respectively.
The velocity of particles is updated as Eq. (
16
), where is the best-previous position and is the best-so-far solution. Accordingly, the position of particles is updated as Eq. (
17
), where null is an empty dimension and means no transactions will be modified, and is a randomly generated number.
(b) The random splitting method. For the particle , , during the specific particle updating process, SA-MDP first calculates the modifying distance as Eq. (
18
).
where is the total number of child particles that can be generated at most. Then, SA-MDP selects elements from the and changes them randomly.
(3) Updating the best positions ( and ). SA-MDP calculates the fitness and generates the new best-previous position . Then, if , SA-MDP updates the best-so-far solution by . Otherwise, there is no updating during this generation, then the value of is incremented by 1.
If the mother particle can be split into multiple child particles, it will improve the diversity of the solution and the exploration ability of the algorithm. Therefore, there are more opportunities to escape from the local optimal solution and find the optimal particle. However, any particles certainly cannot split infinitely. Hence, SA-MDP has and , which denote the maximum and minimum number that a particle can be split into, respectively. The number of child particles split from the mother particle is randomly generated as Eq. (
15
).
In Step 2, the velocity and the position are updated. Specifically, the SA-MDP approach proposed in this paper adopts two updating methods: one is the directional learning of particles, which can improve the exploitation ability of the algorithm; the other is random splitting of particles, which can improve the exploration ability of the algorithm. Besides, in order to adapt to the evolutionary environment, the position of the particles will be randomly deflected during the splitting process. The number of deflection dimensions of the particle position will be determined according to and .
Example. Suppose , and, . First, we let and be the values of and , respectively. Second, we update particle. By the directional learning method, the result of is and the result of is . Then, . On the other hand, a target set . If , we randomly select one dimension from the target set, i.e., . Then, we can obtain the updated position Moreover, by the randomly splitting method, SA-MDP calculates the number of dimensions that need to be changed for each child particle and . The data publisher can set and . Suppose is 2 for , SA-MDP can obtain . Third, after the user obtains new child particles, SA-MDP calculates the relevant fitness and updates the best positions and .
3.5 Termination
The termination is determined by two parameters, i.e., and . Namely, if or , SA-MDP terminates. Specifically, counts the number of times the best-so-far solution has not been updated and denotes the number of generations. SA-MDP will repeat until at least one of these two termination conditions is satisfied.
10.26599/BDMA.2022.9020007.F002
Evolution processes of SA-MDP and COA4ARH on (a) Chess, (b) Mushroom, and (c) BMS1 datasets.
10.26599/BDMA.2022.9020007.F003
HFR comparison of SA-MDP and COA4ARH.
10.26599/BDMA.2022.9020007.F004
LRR comparison of SA-MDP and COA4ARH.
10.26599/BDMA.2022.9020007.F005
Running time comparison of SA-MDP and COA4ARH.