HPPQ: A Parallel Package Queries Processing Approach for Large-Scale Data

Meihui Shi; Derong Shen; Tiezheng Nie; Yue Kou; Ge Yu

doi:10.26599/BDMA.2018.9020014

| Sign up

PDF (4.4 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (12)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Tables (5)

Table 1

Table 2

Table 3

Table 4

Table 5

Open Access

HPPQ: A Parallel Package Queries Processing Approach for Large-Scale Data

Meihui Shi, Derong Shen(), Tiezheng Nie, Yue Kou, Ge Yu

∙ College of Computer Science and Engineering, Northeastern University, Shenyang 110000, China.

Show Author Information

Abstract

A lot of scholars have focused on developing effective techniques for package queries, and a lot of excellent approaches have been proposed. Unfortunately, most of the existing methods focus on a small volume of data. The rapid increase in data volume means that traditional methods of package queries find it difficult to meet the increasing requirements. To solve this problem, a novel optimization method of package queries (HPPQ) is proposed in this paper. First, the data is preprocessed into regions. Data preprocessing segments the dataset into multiple subsets and the centroid of the subsets is used for package queries, this effectively reduces the volume of candidate results. Furthermore, an efficient heuristic algorithm is proposed (namely IPOL-HS) based on the preprocessing results. This improves the quality of the candidate results in the iterative stage and improves the convergence rate of the heuristic algorithm. Finally, a strategy called HPR is proposed, which relies on a greedy algorithm and parallel processing to accelerate the rate of query. The experimental results show that our method can significantly reduce time consumption compared with existing methods.

Keywords

package queries heuristic algorithms parallel processing opposition-based learning

References

[1]

Brucato

, J. F.

Beltran

, A.

Abouzied

, and A.

Meliou

, Scalable package queries in relational database systems, Proc. VLDB Endow., vol. 9, no. 7, pp. 576-587, 2016.

Crossref Google Scholar

[2]

De Choudhury

, M.

Feldman

, S.

Amer-Yahia

, N.

Golbandi

, R.

Lempel

, and C.

, Automatic construction of travel itineraries using social breadcrumbs, in Proc. 21st ACM Conf. Hypertext and Hypermedia, Toronto, Canada, 2010, pp. 35-44.

Crossref

[3]

Xie

, L. V. S.

Lakshmanan

, and P. T.

Wood

, Breaking out of the box of recommendations: From items to packages, in Proc. 4th ACM Conference on Recommender Systems, Barcelona, Spain, 2010, pp. 151-158.

Crossref

[4]

Parameswaran

, P.

Venetis

, and H.

Garcia-Molina

, Recommendation systems with complex constraints: A course recommendation perspective, ACM Trans. Inf. Syst., vol. 29, no. 4, p. 20, 2011.

Crossref Google Scholar

[5]

Lappas

, K.

Liu

, and E.

Terzi

, Finding a team of experts in social networks, in Proc. 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Paris, France, 2009, pp. 467-476.

Crossref

[6]

Baykasoglu

, T.

Dereli

, and S.

Das

, Project team selection using fuzzy optimization approach, Cybern. Syst., vol. 38, no. 2, pp. 155-185, 2007.

Crossref Google Scholar

[7]

Brucato

, R.

Ramakrishna

, A.

Abouzied

, and A.

Meliou

, PackageBuilder: From tuples to packages, Proc. VLDB Endow., vol. 7, no. 13, pp. 1593-1596, 2014.

Crossref Google Scholar

[8]

Y. D.

Cui

, A new dynamic programming procedure for three-staged cutting patterns, J. Global Optim., vol. 55, no. 2, pp. 349-357, 2013.

Crossref Google Scholar

[9]

Meng

and Q. K.

Pan

, An improved fruit fly optimization algorithm for solving the multidimensional knapsack problem, Appl. Soft Comput., vol. 50, pp. 79-93, 2017.

Crossref Google Scholar

[10]

Z. L.

Guo

, S. W.

Wang

, X. Z.

Yue

, and H. G.

Yang

, Global harmony search with generalized opposition-based learning, Soft Comput., vol. 21, no. 8, pp. 2129-2137, 2017.

Crossref Google Scholar

[11]

Sarkhel

, T. M.

Chowdhury

, M.

Das

, N.

Das

, and M.

Nasipuri

, A novel harmony search algorithm embedded with metaheuristic opposition based learning, J. Intell. Fuzzy Syst., vol. 32, no. 4, pp. 3189-3199, 2017.

Crossref Google Scholar

[12]

M. I.

Andreica

, A dynamic programming framework for combinatorial optimization problems on graphs with bounded pathwidth, arXiv preprint arXiv: 0806.0840, 2008.

Google Scholar

[13]

Louveaux

and S.

Mathieu

, A combinatorial branch-and-bound algorithm for box search, Discrete Optim., vol. 13, pp. 36-48, 2014.

Crossref Google Scholar

[14]

E. I.

Hsu

and S. A.

McIlraith

, Computing equivalent transformations for combinatorial optimization by branch-and-bound search, in Proc. 3rd Annual Symposium on Combinatorial Search, Atlanta, GA, USA, 2010.

[15]

Haddar

, M.

Khemakhem

, S.

Hanafi

, and C.

Wilbaut

, A hybrid quantum particle swarm optimization for the multidimensional knapsack problem, Eng. Appl. Artif. Intell., vol. 55, pp. 1-13, 2016.

Crossref Google Scholar

[16]

Y. H.

Feng

, K.

Jia

, and Y. C.

, An improved hybrid encoding cuckoo search algorithm for 0-1 knapsack problems, Comput. Intell. Neurosci., vol. 2014, p. 970456, 2014.

Crossref Google Scholar

[17]

Rezoug

and D.

Boughaci

, A self-adaptive harmony search combined with a stochastic local search for the 0-1 multidimensional knapsack problem, Int. J. Bio-Inspired Comput., vol. 8, no. 4, pp. 234-239, 2016.

Crossref Google Scholar

[18]

Z. W.

Geem

, J. H.

Kim

, and G. V.

Loganathan

, A new heuristic optimization algorithm: Harmony search, Simulation, vol. 76, no. 2, pp. 60-68, 2001.

Crossref Google Scholar

[19]

Mahdavi

, M.

Fesanghary

, and E.

Damangir

, An improved harmony search algorithm for solving optimization problems, Appl. Math. Comput., vol. 188, no. 2, pp. 1567-1579, 2007.

Crossref Google Scholar

[20]

M. G. H.

Omran

and M.

Mahdavi

, Global-best harmony search, Appl. Math. Comput., vol. 198, no. 2, pp. 643-656, 2008.

Crossref Google Scholar

[21]

Q. K.

Pan

, P. N.

Suganthan

, M. F.

Tasgetiren

, and J. J.

Liang

, A self-adaptive global best harmony search algorithm for continuous optimization problems, Appl. Math. Comput., vol. 216, no. 3, pp. 830-848, 2010.

Crossref Google Scholar

[22]

D. X.

Zou

, L. Q.

Gao

, J. H.

, and S.

, Novel global harmony search algorithm for unconstrained problems, Neurocomputing, vol. 73, nos. 16-18, pp. 3308-3318, 2010.

Crossref Google Scholar

[23]

X. Z.

Gao

, X.

Wang

, S. J.

Ovaska

, and K,

Zenger

, A hybrid optimization method of harmony search and opposition-based learning, Eng. Optimiz., vol. 44, no. 8, pp. 895-914, 2012.

Crossref Google Scholar

Big Data Mining and Analytics

Volume 1 Issue 2,
June 2018

Pages 146-159

DOI: 10.26599/BDMA.2018.9020014

Cite this article:

Shi M, Shen D, Nie T, et al. HPPQ: A Parallel Package Queries Processing Approach for Large-Scale Data. Big Data Mining and Analytics, 2018, 1(2): 146-159. https://doi.org/10.26599/BDMA.2018.9020014

10.26599/BDMA.2018.9020014.F001 Fig. 1Query processing framework.

10.26599/BDMA.2018.9020014.F002 Fig. 2The solution of improved partial opposition-based learning.

10.26599/BDMA.2018.9020014.F003 Fig. 3Harmony search based on improved partial opposition-based learning.

4.2.2 IPOL-HS algorithm

Figure 3 illustrates the HS execution procedure, which is based on improved partial opposition-based learning to obtain the optimal solution of the centroid. The dashed box indicates the generate new candidate harmony phase. Unlike the basic HS, an improved partial opposition solution is constructed based on the current best harmony, which accelerates the convergence of the algorithm.

10.26599/BDMA.2018.9020014.F003 Fig. 3Harmony search based on improved partial opposition-based learning.

The framework of our method is shown in Algorithm 1.

Using an HS based on the improved partial opposition-based learning algorithm, the optimal solution based on the centroids of the sub-regions was obtained. But this was not the eventual solution to the package query. Therefore, it was necessary to replace the centroids of the sub-regions with the real data objects in the corresponding sub-regions to get the final solution.

4.3.1 Group allocation

Each sub-region where the centroid needs to be replaced is called a group. The set of groups, based on the improved partial opposition-based learning algorithm for HS is recorded as $G = {g_{1}, g_{2}, \dots, g_{| G |}}$ . The number of nodes in the cluster is $| N |$ and there evidently is $| G | \geq | N |$ . Therefore, the $| G |$ groups need to be assigned to the $| N |$ nodes in the cluster. The allocation algorithm is based on the number of combinations generated in each group, i.e., $(\begin{matrix} | g_{i} | \\ | x_{| g_{i} |}^{b e s t} | \end{matrix})$ .

That is, the number of combinations of $| x_{| g_{i} |}^{b e s t} |$ data objects is selected from the set of $| g_{i} |$ data objects so that, to achieve load balancing, the calculation load of each node is approximately the same.

The specific process of the group allocation algorithm is shown in Algorithm 2. The major steps are as follows.

First, eliminate sub-regions that do not require a replacement operation. That is, the centroids of these sub-regions have an element value of zero. All sub-regions with centroids that need to be replaced are composed of a set of groups and form a collection of nodes. The maximum number of combinations for each group to be replaced is calculated and arranged in descending order (line 1). The first few regions with the largest number of combinations are allocated to the nodes in the cluster in turn. Then, each time an unassigned group has the current maximum number of combinations, it is assigned to the node with the least calculation load (lines 2-7).

Figure 4 shows an example of how to allocate groups. Assume the optimal solution (2, 2, 2, 0, 0, 1, 1, 2) is obtained by using the partitioning results shown in Fig. 1 and the HS based on the improved partial opposition-based learning algorithm. There are four nodes in the cluster. If the value of the components corresponding to the centroid of the sub-region is zero, the data objects in that sub-region will not appear in the result set, so no replacement operations are required. The remaining six groups are sorted in descending order of combinations and we get the descending sequence group1, group2, group3, group6, group4, group5. First, the first four groups are assigned to each node in turn. Next, assign group4. At this point, the node with the least number of combinations in the cluster is $n_{4}$ , so assign group4 to $n_{4}$ . Similarly, assign group5 to $n_{4}$ .

10.26599/BDMA.2018.9020014.F004 Fig. 4The allocation process of groups.

4.3.2 Parallel processing

Inferior replacement programs for sub-problems have less likelihood of appearing in the global optimal solution, and better replacement programs have a greater likelihood of appearing in the global optimal solution. Thus, some better replacement programs can be obtained by heuristic methods and any inferior replacement programs can be eliminated directly. Although this method reduces accuracy to a certain extent, it can significantly reduce the number of replacement programs that need to be adjusted, thereby reducing the run time required for the query. By using only some of the better replacement programs to consider the interaction between replacement programs in sub-problems, the run time required for the query is further reduced.

The parallel scheme can be used to generate the optimal harmony memory for each group. However, the replacement program for each group will affect the replacement programs of the remaining groups. Therefore, before presenting the parallel processing algorithm, we first introduce a number of metrics that are used in the algorithm to coordinate the replacement programs for each group.

(1) Relative change in total value:

(7) $△ v_{i j} = v_{i j} - v_{i j_{n o w}}$

In Eq. ( 7 ), $v_{i j}$ represents the objective function value of the j-th replacement program in the i-th group. $v_{i j n o w}$ represents the objective function value of the $j_{n o w}$ -th replacement program in the i-th group. $△ v_{i j}$ represents the relative change in the objective function when the replacement program of the i-th group changes from $v_{i j n o w}$ to $v_{i j}$ .

(2) Relative change in resource consumption:

(8) $△ c_{i j} = \sum_{k} {c_{i j k} - c_{i j_{n o w} k} \cdot r_{k} / | \vec{r} |}$

In Eq. ( 8 ), $c_{i j k}$ represents the k-th resource consumption of the j-th replacement program in the i-th group, $c_{i j_{n o w} k}$ represents the k-th resource consumption of the $j_{n o w}$ -th replacement program in the i-th group. $r_{k}$ represents the current consumption of the k-th resource. $| \vec{r} |$ represents the current consumption of all resources.

(3) Scaled change of resource consumption for repairing unfeasible solutions:

(9) $△ c_{i j}^{o v e r} = \sum_{k} \frac{c_{i j k} - c_{i j_{n o w} k}}{r_{k} - C_{k}}$

In Eq. ( 9 ), $r_{k} - C_{k}$ represents the portion of the original replacement program that is exceeded on the k-th resource constraint.

(4) Scaled change of resource consumption for optimizing feasible solutions:

(10) $△ c_{i j}^{i n} = \sum_{k} \frac{c_{i j k} - c_{i j_{n o w} k}}{C_{k} - r_{k}}$

(5) Proportionality coefficient for repairing unfeasible solutions:

(11) $u_{i j}^{o v e r} = △ c_{i j}^{o v e r} / △ v_{i j}$

(6) Proportionality coefficient for optimized feasible solutions:

(12) $u_{i j}^{i n} = △ c_{i j}^{i n} / △ v_{i j}$

(7) Unfeasible coefficient:

(13) $f_{i j k} = r_{i j k} / C_{k}$

In Eq. ( 13 ), if there is always $f_{i j k} < 1$ for all resource constraints, the current solution is feasible.

A parallel algorithm is proposed to efficiently replace the centroids of the sub-regions with the real data objects in the corresponding sub-region. The main premise of this is as follows: each node uses an HS based on the improved partial opposition-based learning algorithm to search the optimal harmony memory of each group in parallel. The candidate solutions in the harmony memory are sorted in descending order of value. Therefore, the solution with the highest value in each group is in first place. The solution of the first row of each group is taken as the initial solution based on the greedy strategy. Then, the solution is upgraded or downgraded based on different metrics to find the optimal solution that satisfies the constraints.

The framework of our algorithm is shown in Algorithm 3. The major steps of the parallel processing algorithm are as follows.

Step 1. Greedily form the initial solution (lines 3-7). The solution in the first place of each group is taken as the initial solution. If the initial solution is feasible, the optimal result set is obtained and returned. Otherwise, Step 2 is executed.

Step 2. Calculate the metrics in parallel and reduce the value of the candidate solution to get a feasible solution (lines 8-17). Calculate $u_{i j}^{o v e r}$ of the replacement programs in each group, which has less value than the current replacement program in each group. Select the replacement program which is less valuable than the current replacement program in each group but has the largest global $u_{i j}^{o v e r}$ . Replace the current replacement program in the corresponding group with the program which has the largest global $u_{i j}^{o v e r}$ . If the modified solution is feasible, Step 3 is executed. Otherwise, Step 2 is executed.

Step 3. Get a better solution than the current solution (lines 19-28). For each replacement program that has more value than the current replacement program, the relative change in total resource consumption $△ c_{i j}$ can not be negative. This means that there is no replacement program that has a greater value and consumes less total resources than the current replacement program. Therefore, we want to improve the value of the current replacement program if it ensures that the new solution obtained by the replacement operation is still feasible. If changing the current replacement program to a more valuable replacement program can make all the unfeasible coefficients for the resources still be $f_{k} \leq 1$ , calculate its $u_{i j}^{i n}$ . Select the replacement program which has the lowest global $u_{i j}^{i n}$ . Replace the current replacement program in the corresponding group with the program which has the lowest global $u_{i j}^{i n}$ . If there is a more valuable replacement program and it does not make the feasible solution become unfeasible, Step 3 is executed. Otherwise, stop the operation and return the current optimal solution.

10.26599/BDMA.2018.9020014.F005 Fig. 5Accuracy rate of different sizes of real-world dataset.

10.26599/BDMA.2018.9020014.F006 Fig. 6Run time of different sizes of real-world dataset.

10.26599/BDMA.2018.9020014.F007 Fig. 7Accuracy rate of different sizes of synthetic dataset.

10.26599/BDMA.2018.9020014.F008 Fig. 8Run time of different sizes of synthetic dataset.

10.26599/BDMA.2018.9020014.F009 Fig. 9Comparison of convergence on R1K.

10.26599/BDMA.2018.9020014.F010 Fig. 10Comparison of convergence on R10K.

10.26599/BDMA.2018.9020014.F011 Fig. 11Impact of threshold on real-world dataset.

10.26599/BDMA.2018.9020014.F012 Fig. 12Impact of threshold on synthetic dataset.

手机微信扫描二维码，点击右上角···按钮
分享到微信朋友圈

	Number of data objects	Number of attributes
Real-world dataset	10139	23
Synthetic dataset	100157	23

Dataset	Number of data objects	Number of attributes
(a) R1K	1030	23
(b) R3K	3104	23
(c) R5K	5169	23
(d) R8K	8263	23
(e) R10K	10 139	23

Dataset	Number of data objects	Number of attributes
(a) S10K	10 084	23
(b) S30K	30 253	23
(c) S50K	50 425	23
(d) S80K	80 691	23
(e) S100K	100 157	23

Partitioning coverage	Run time (s)	Time increase ratio
0.25	55.64	2.542
0.50	35.67	1.630
1.00	21.89	1.000
1.25	20.25	0.925

Partitioning coverage	Run time (s)	Time increase ratio
0.250	5652	3.053
0.500	3193	1.725
1.000	1851	1.000
1.250	1697	0.917