College of Computer Science and Engineering, Northeastern University, Shenyang110000, China.
Show Author Information
Hide Author Information
Abstract
A lot of scholars have focused on developing effective techniques for package queries, and a lot of excellent approaches have been proposed. Unfortunately, most of the existing methods focus on a small volume of data. The rapid increase in data volume means that traditional methods of package queries find it difficult to meet the increasing requirements. To solve this problem, a novel optimization method of package queries (HPPQ) is proposed in this paper. First, the data is preprocessed into regions. Data preprocessing segments the dataset into multiple subsets and the centroid of the subsets is used for package queries, this effectively reduces the volume of candidate results. Furthermore, an efficient heuristic algorithm is proposed (namely IPOL-HS) based on the preprocessing results. This improves the quality of the candidate results in the iterative stage and improves the convergence rate of the heuristic algorithm. Finally, a strategy called HPR is proposed, which relies on a greedy algorithm and parallel processing to accelerate the rate of query. The experimental results show that our method can significantly reduce time consumption compared with existing methods.
No abstract is available for this article. Click the button above to view the PDF directly.
References
[1]
M.Brucato, J. F.Beltran, A.Abouzied, and A.Meliou, Scalable package queries in relational database systems, Proc. VLDB Endow., vol. 9, no. 7, pp. 576-587, 2016.
M.De Choudhury, M.Feldman, S.Amer-Yahia, N.Golbandi, R.Lempel, and C.Yu, Automatic construction of travel itineraries using social breadcrumbs, in Proc. 21st ACM Conf. Hypertext and Hypermedia, Toronto, Canada, 2010, pp. 35-44.
M.Xie, L. V. S.Lakshmanan, and P. T.Wood, Breaking out of the box of recommendations: From items to packages, in Proc. 4th ACM Conference on Recommender Systems, Barcelona, Spain, 2010, pp. 151-158.
A.Parameswaran, P.Venetis, and H.Garcia-Molina, Recommendation systems with complex constraints: A course recommendation perspective, ACM Trans. Inf. Syst., vol. 29, no. 4, p. 20, 2011.
T.Lappas, K.Liu, and E.Terzi, Finding a team of experts in social networks, in Proc. 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Paris, France, 2009, pp. 467-476.
T.Meng and Q. K.Pan, An improved fruit fly optimization algorithm for solving the multidimensional knapsack problem, Appl. Soft Comput., vol. 50, pp. 79-93, 2017.
Z. L.Guo, S. W.Wang, X. Z.Yue, and H. G.Yang, Global harmony search with generalized opposition-based learning, Soft Comput., vol. 21, no. 8, pp. 2129-2137, 2017.
R.Sarkhel, T. M.Chowdhury, M.Das, N.Das, and M.Nasipuri, A novel harmony search algorithm embedded with metaheuristic opposition based learning, J. Intell. Fuzzy Syst., vol. 32, no. 4, pp. 3189-3199, 2017.
M. I.Andreica, A dynamic programming framework for combinatorial optimization problems on graphs with bounded pathwidth, arXiv preprint arXiv: 0806.0840, 2008.
E. I.Hsu and S. A.McIlraith, Computing equivalent transformations for combinatorial optimization by branch-and-bound search, in Proc. 3rd Annual Symposium on Combinatorial Search, Atlanta, GA, USA, 2010.
[15]
B.Haddar, M.Khemakhem, S.Hanafi, and C.Wilbaut, A hybrid quantum particle swarm optimization for the multidimensional knapsack problem, Eng. Appl. Artif. Intell., vol. 55, pp. 1-13, 2016.
Y. H.Feng, K.Jia, and Y. C.He, An improved hybrid encoding cuckoo search algorithm for 0-1 knapsack problems, Comput. Intell. Neurosci., vol. 2014, p. 970456, 2014.
A.Rezoug and D.Boughaci, A self-adaptive harmony search combined with a stochastic local search for the 0-1 multidimensional knapsack problem, Int. J. Bio-Inspired Comput., vol. 8, no. 4, pp. 234-239, 2016.
Q. K.Pan, P. N.Suganthan, M. F.Tasgetiren, and J. J.Liang, A self-adaptive global best harmony search algorithm for continuous optimization problems, Appl. Math. Comput., vol. 216, no. 3, pp. 830-848, 2010.
D. X.Zou, L. Q.Gao, J. H.Wu, and S.Li, Novel global harmony search algorithm for unconstrained problems, Neurocomputing, vol. 73, nos. 16-18, pp. 3308-3318, 2010.
X. Z.Gao, X.Wang, S. J.Ovaska, and K,Zenger, A hybrid optimization method of harmony search and opposition-based learning, Eng. Optimiz., vol. 44, no. 8, pp. 895-914, 2012.
Shi M, Shen D, Nie T, et al. HPPQ: A Parallel Package Queries Processing Approach for Large-Scale Data. Big Data Mining and Analytics, 2018, 1(2): 146-159. https://doi.org/10.26599/BDMA.2018.9020014
10.26599/BDMA.2018.9020014.F002
The solution of improved partial opposition-based learning.
10.26599/BDMA.2018.9020014.F003
Harmony search based on improved partial opposition-based learning.
4.2.2 IPOL-HS algorithm
Figure 3
illustrates the HS execution procedure, which is based on improved partial opposition-based learning to obtain the optimal solution of the centroid. The dashed box indicates the generate new candidate harmony phase. Unlike the basic HS, an improved partial opposition solution is constructed based on the current best harmony, which accelerates the convergence of the algorithm.
10.26599/BDMA.2018.9020014.F003
Harmony search based on improved partial opposition-based learning.
The framework of our method is shown in Algorithm 1.
Using an HS based on the improved partial opposition-based learning algorithm, the optimal solution based on the centroids of the sub-regions was obtained. But this was not the eventual solution to the package query. Therefore, it was necessary to replace the centroids of the sub-regions with the real data objects in the corresponding sub-regions to get the final solution.
4.3.1 Group allocation
Each sub-region where the centroid needs to be replaced is called a group. The set of groups, based on the improved partial opposition-based learning algorithm for HS is recorded as . The number of nodes in the cluster is and there evidently is . Therefore, the groups need to be assigned to the nodes in the cluster. The allocation algorithm is based on the number of combinations generated in each group, i.e., .
That is, the number of combinations of data objects is selected from the set of data objects so that, to achieve load balancing, the calculation load of each node is approximately the same.
The specific process of the group allocation algorithm is shown in Algorithm 2. The major steps are as follows.
First, eliminate sub-regions that do not require a replacement operation. That is, the centroids of these sub-regions have an element value of zero. All sub-regions with centroids that need to be replaced are composed of a set of groups and form a collection of nodes. The maximum number of combinations for each group to be replaced is calculated and arranged in descending order (line 1). The first few regions with the largest number of combinations are allocated to the nodes in the cluster in turn. Then, each time an unassigned group has the current maximum number of combinations, it is assigned to the node with the least calculation load (lines 2-7).
Figure 4
shows an example of how to allocate groups. Assume the optimal solution (2, 2, 2, 0, 0, 1, 1, 2) is obtained by using the partitioning results shown in
Fig. 1
and the HS based on the improved partial opposition-based learning algorithm. There are four nodes in the cluster. If the value of the components corresponding to the centroid of the sub-region is zero, the data objects in that sub-region will not appear in the result set, so no replacement operations are required. The remaining six groups are sorted in descending order of combinations and we get the descending sequence group1, group2, group3, group6, group4, group5. First, the first four groups are assigned to each node in turn. Next, assign group4. At this point, the node with the least number of combinations in the cluster is , so assign group4 to . Similarly, assign group5 to .
10.26599/BDMA.2018.9020014.F004
The allocation process of groups.
10.26599/BDMA.2018.9020014.F004
The allocation process of groups.
4.3.2 Parallel processing
Inferior replacement programs for sub-problems have less likelihood of appearing in the global optimal solution, and better replacement programs have a greater likelihood of appearing in the global optimal solution. Thus, some better replacement programs can be obtained by heuristic methods and any inferior replacement programs can be eliminated directly. Although this method reduces accuracy to a certain extent, it can significantly reduce the number of replacement programs that need to be adjusted, thereby reducing the run time required for the query. By using only some of the better replacement programs to consider the interaction between replacement programs in sub-problems, the run time required for the query is further reduced.
The parallel scheme can be used to generate the optimal harmony memory for each group. However, the replacement program for each group will affect the replacement programs of the remaining groups. Therefore, before presenting the parallel processing algorithm, we first introduce a number of metrics that are used in the algorithm to coordinate the replacement programs for each group.
(1) Relative change in total value:
In Eq. (
7
), represents the objective function value of the j-th replacement program in the i-th group. represents the objective function value of the -th replacement program in the i-th group. represents the relative change in the objective function when the replacement program of the i-th group changes from to .
(2) Relative change in resource consumption:
In Eq. (
8
), represents the k-th resource consumption of the j-th replacement program in the i-th group, represents the k-th resource consumption of the -th replacement program in the i-th group. represents the current consumption of the k-th resource. represents the current consumption of all resources.
(3) Scaled change of resource consumption for repairing unfeasible solutions:
In Eq. (
9
), represents the portion of the original replacement program that is exceeded on the k-th resource constraint.
(4) Scaled change of resource consumption for optimizing feasible solutions:
(5) Proportionality coefficient for repairing unfeasible solutions:
(6) Proportionality coefficient for optimized feasible solutions:
(7) Unfeasible coefficient:
In Eq. (
13
), if there is always for all resource constraints, the current solution is feasible.
A parallel algorithm is proposed to efficiently replace the centroids of the sub-regions with the real data objects in the corresponding sub-region. The main premise of this is as follows: each node uses an HS based on the improved partial opposition-based learning algorithm to search the optimal harmony memory of each group in parallel. The candidate solutions in the harmony memory are sorted in descending order of value. Therefore, the solution with the highest value in each group is in first place. The solution of the first row of each group is taken as the initial solution based on the greedy strategy. Then, the solution is upgraded or downgraded based on different metrics to find the optimal solution that satisfies the constraints.
The framework of our algorithm is shown in Algorithm 3. The major steps of the parallel processing algorithm are as follows.
Step 1. Greedily form the initial solution (lines 3-7). The solution in the first place of each group is taken as the initial solution. If the initial solution is feasible, the optimal result set is obtained and returned. Otherwise, Step 2 is executed.
Step 2. Calculate the metrics in parallel and reduce the value of the candidate solution to get a feasible solution (lines 8-17). Calculate of the replacement programs in each group, which has less value than the current replacement program in each group. Select the replacement program which is less valuable than the current replacement program in each group but has the largest global . Replace the current replacement program in the corresponding group with the program which has the largest global . If the modified solution is feasible, Step 3 is executed. Otherwise, Step 2 is executed.
Step 3. Get a better solution than the current solution (lines 19-28). For each replacement program that has more value than the current replacement program, the relative change in total resource consumption can not be negative. This means that there is no replacement program that has a greater value and consumes less total resources than the current replacement program. Therefore, we want to improve the value of the current replacement program if it ensures that the new solution obtained by the replacement operation is still feasible. If changing the current replacement program to a more valuable replacement program can make all the unfeasible coefficients for the resources still be , calculate its . Select the replacement program which has the lowest global . Replace the current replacement program in the corresponding group with the program which has the lowest global . If there is a more valuable replacement program and it does not make the feasible solution become unfeasible, Step 3 is executed. Otherwise, stop the operation and return the current optimal solution.
10.26599/BDMA.2018.9020014.F005
Accuracy rate of different sizes of real-world dataset.
10.26599/BDMA.2018.9020014.F006
Run time of different sizes of real-world dataset.
10.26599/BDMA.2018.9020014.F007
Accuracy rate of different sizes of synthetic dataset.
10.26599/BDMA.2018.9020014.F008
Run time of different sizes of synthetic dataset.
10.26599/BDMA.2018.9020014.F009
Comparison of convergence on R1K.
10.26599/BDMA.2018.9020014.F010
Comparison of convergence on R10K.
10.26599/BDMA.2018.9020014.F011
Impact of threshold on real-world dataset.
10.26599/BDMA.2018.9020014.F012
Impact of threshold on synthetic dataset.