| Sign up

PDF (1.1 MB)

Cite

Collect

Submit Manuscript

Show Outline

Figures (5)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Tables (3)

Table 1

Table 2

Table 3

Article | Open Access

Bridging Reinforcement Learning and Planning to Solve Combinatorial Optimization Problems with Nested Sub-Tasks

Xiaohan Shan^¹, Pengjiu Wang^{¹^,²}, Mingda Wan^¹, Dong Yan^¹, Jialian Li^¹, Jun Zhu^{¹^,³}()

1Qiyuan Lab, Beijing 100095, China

2China Ship Development and Design Center, Wuhan 430064, China

3School of Life Sciences, Tsinghua University, Beijing 100084, China

Show Author Information

Abstract

Combinatorial Optimization (CO) problems have been intensively studied for decades with a wide range of applications. For some classic CO problems, e.g., the Traveling Salesman Problem (TSP), both traditional planning algorithms and the emerging reinforcement learning have made solid progress in recent years. However, for CO problems with nested sub-tasks, neither end-to-end reinforcement learning algorithms nor traditional evolutionary methods can obtain satisfactory strategies within a limited time and computational resources. In this paper, we propose an algorithmic framework for solving CO problems with nested sub-tasks, in which learning and planning algorithms can be combined in a modular way. We validate our framework in the Job-Shop Scheduling Problem (JSSP), and the experimental results show that our algorithm has good performance in both solution qualities and model generalizations.

Keywords

reinforcement learning combinatorial optimization job-shop scheduling problem

References

[1]

K. L. Hoffman, Combinatorial optimization: Current successes and directions for the future, J. Comput. Appl. Math., vol. 124, nos. 1&2, pp. 341–360, 2000.

Crossref Google Scholar

[2]

Y. Bengio, A. Lodi, and A. Prouvost, Machine learning for combinatorial optimization: A methodological tour d’horizon, Eur. J. Oper. Res., vol. 290, no. 2, pp. 405–421, 2021.

Crossref Google Scholar

[3]

R. Zhang, A. Prokhorchuk, and J. Dauwels, Deep reinforcement learning for traveling salesman problem with time windows and rejections, in Proc. 2020 Int. Joint Conf. Neural Networks (IJCNN), Glasgow, UK, 2020, pp. 1–8.

Crossref

[4]

D. Yan, J. Weng, S. Huang, C. Li, Y. Zhou, H. Su, and J. Zhu, Deep reinforcement learning with credit assignment for combinatorial optimization, Pattern Recognit., vol. 124, pp. 108466, 2022.

Crossref Google Scholar

[5]

G. B. Dantzig and J. H. Ramser, The truck dispatching problem, Manag. Sci., vol. 6, no. 1, pp. 80–91, 1959.

Crossref Google Scholar

[6]

J. F. Muth and G. L. Thompson, Industrial scheduling, Louvain Econ. Rev., vol. 32, no. 2, pp. 121–122, 1966.

Crossref Google Scholar

[7]

C. Yu, X. Zheng, H. H. Zhuo, H. Wan, and W. Luo, Reinforcement learning with knowledge representation and reasoning: A brief survey, arXiv preprint arXiv: 2304.12090, 2023.

[8]

A. S. Jain and S. Meeran, Deterministic job-shop scheduling: Past, present and future, Eur. J. Oper. Res., vol. 113, no. 2, pp. 390–434, 1999.

Crossref Google Scholar

[9]

E. Demirkol, S. Mehta, and R. Uzsoy, Benchmarks for shop scheduling problems, Eur. J. Oper. Res., vol. 109, no. 1, pp. 137–141, 1998.

Crossref Google Scholar

[10]

T. M. Moerland, J. Broekens, A. Plaat, and C. M. Jonker, A unifying framework for reinforcement learning and planning, arXiv preprint arXiv: 2006.15009, 2020.

[11]

H. Hu, X. Zhang, X. Yan, L. Wang, and Y. Xu, Solving a new 3D bin packing problem with deep reinforcement learning method, arXiv preprint arXiv: 1708.05930, 2017.

[12]

K. Lin, R. Zhao, Z. Xu, and J. Zhou, Efficient large-scale fleet management via multi-agent deep reinforcement learning, in Proc. 24th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining, London, UK, 2018, pp. 1774–1783.

Crossref

[13]

A. Mirhoseini, A. Goldie, M. Yazgan, J. Jiang, E. Songhori, S. Wang, Y. J. Lee, E. Johnson, O. Pathak, S. Bae, et al., Chip placement with deep reinforcement learning, arXiv preprint arXiv: 2004.10746, 2020.

[14]

J. Pazis and R. Parr, Generalized value functions for large action sets, in Proc. 28th Int. Conf. Machine Learning, Bellevue, WA, USA, 2011, pp. 1185–1192.

[15]

G. Dulac-Arnold, L. Denoyer, P. Preux, and P. Gallinari, Fast reinforcement learning with large action sets using error-correcting output codes for MDP factorization, in Proc. European Conf. Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2012), Bristol, UK, 2012, pp. 180–194.

Crossref

[16]

T. G. Dietterich and G. Bakiri, Solving multiclass learning problems via error-correcting output codes, J. Artif. Intell. Res., vol. 2, pp. 263–286, 1995.

Crossref Google Scholar

[17]

M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castañeda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, et al., Human-level performance in 3D multiplayer games with population-based reinforcement learning, Science, vol. 364, no. 6443, pp. 859–865, 2019.

Crossref Google Scholar

[18]

C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. , Dota 2 with large scale deep reinforcement learning, arXiv preprint arXiv: 1912.06680, 2019.

[19]

T. Pierrot, V. Macé, J. B. Sevestre, L. Monier, A. Laterre, N. Perrin, K. Beguir, and O. Sigaud, Factored action spaces in deep reinforcement learning, https://openreview.net/forum?id=naSAkn2Xo46, 2021.

[20]

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv: 1707.06347, 2017.

[21]

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. , Soft actor-critic algorithms and applications, arXiv preprint arXiv: 1812.05905, 2018.

[22]

M. R. Garey, D. S. Johnson, and R. Sethi, The complexity of flowshop and jobshop scheduling, Math. Oper. Res., vol. 1, no. 2, pp. 117–129, 1976.

Crossref Google Scholar

[23]

P. P. A. Tassel, M. Gebser, and K. Schekotihin, A reinforcement learning environment for job-shop scheduling, in Proc. 2021 PRL Workshop – Bridging the Gap Between AI Planning and Reinforcement Learning, virtual, 2021.

[24]

C. Zhang, W. Song, Z. Cao, J. Zhang, P. S. Tan, and C. Xu, Learning to dispatch for job shop scheduling via deep reinforcement learning, arXiv preprint arXiv: 2010.12367, 2020.

[25]

J. Błażewicz, E. Pesch, and M. Sterna, The disjunctive graph machine representation of the job shop scheduling problem, Eur. J. Oper. Res., vol. 127, no. 2, pp. 317–331, 2000.

Crossref Google Scholar

[26]

K. Xu, Weihua Hu, J. Leskovec, and S. Jegelka, How powerful are graph neural networks? in Proc. 7th Int. Conf. Learning Representations, New Orleans, LA, USA, 2019.

[27]

E. Taillard, Benchmarks for basic scheduling problems, Eur. J. Oper. Res., vol. 64, no. 2, pp. 278–285, 1993.

Crossref Google Scholar

[28]

M. Gen, Y. Tsujimura, and E. Kubota, Solving job-shop scheduling problems by genetic algorithm, in Proc. IEEE Int. Conf. Systems, Man and Cybernetics, San Antonio, TX, USA, 1994, pp. 1577–1582.

[29]

L. Perron and V. Furnon, OR-Tools, https://developers.google.com/optimization/, 2019.

CAAI Artificial Intelligence Research

Volume 2,
2023

Article number: 9150025

DOI: 10.26599/AIR.2023.9150025

Cite this article:

Shan X, Wang P, Wan M, et al. Bridging Reinforcement Learning and Planning to Solve Combinatorial Optimization Problems with Nested Sub-Tasks. CAAI Artificial Intelligence Research, 2023, 2: 9150025. https://doi.org/10.26599/AIR.2023.9150025

Part of a topical collection:

Decision Intelligence

Benchmark	Size	FDD/MWKR	MOPNR	GA	Zhang et al.^[24]	Ours
Generated	$15 \times 15$	1722.73	1693.33	1682.88	1504.79	1508.95
	$20 \times 20$	2328.15	2263.68	2503.68	2007.76	2004.3
	$30 \times 20$	2883.88	2809.62	3340.58	2508.27	2486.32

Benchmark	Size	FDD/ MWKR	MOPNR	GA	Zhang et al.^[24]	Ours	UB
Taillard	$15 \times 15$	1808.6	1782.3	1740.8	1547.4	1520.1	1228.9
	$20 \times 15$	2504	2015.8	2096.4	1774.7	1752.8	1364.8
	$20 \times 20$	2387.2	2309.9	2606.2	2128.1	2068.8	1617.3
	$30 \times 15$	2590.8	2601.3	2926.9	2378.8	2327.6	1790.2
	$30 \times 20$	3045	2888.1	3466.2	2603.9	2572.5	1948.5
DMU	$20 \times 15$	4555.3	4513.2	4872.4	4215.3	4057.8	3025.7
	$20 \times 20$	5298.2	5052.3	5818.3	4804.5	4687.3	3473
	$30 \times 15$	6016.5	5742.8	6687.5	5557.9	5539.3	3884.8
	$30 \times 20$	6827.3	6491.9	7865.4	5967.4	5899.7	4257

Benchmark	Size	FDD/ MWKR	MOPNR	GA	Zhang et al.^[24]	Ours	UB
Generated	$50 \times 20$	3993.45	3859.14	4913.38	3522.5	3473.33	-
Generated	$100 \times 20$	6658.17	6384.3	8480.76	6088.68	5984.97	-
Taillard	$50 \times 20$	4022.1	3920	4968.9	3593.9	3557.5	2843.9
Taillard	$100 \times 20$	6620.7	6452.3	8567.2	6097.2	6080.2	5365.7
DMU	$40 \times 15$	7420	7150.2	8500.4	6663.9	6767.7	4878.8
	$40 \times 20$	8310.9	7869.7	9952.3	7375.8	7361.9	5251.1
	$50 \times 15$	9150.2	8436.5	10388.4	8179.4	8218.8	5961.1
	$50 \times 20$	9899.6	9408	11886.6	8751.6	8802.2	6257