Optimizing Risk-Aware Task Migration Algorithm Among Multiplex UAV Groups Through Hybrid Attention Multi-Agent Reinforcement Learning

Yuanshuang Jiang; Kai Di; Ruiyi Qian; Xingyu Wu; Fulin Chen; Pan Li; Xiping Fu; Yichuan Jiang

doi:10.26599/TST.2024.9010013

| Sign up

PDF (4.6 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (6)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Tables (1)

Table 1

Open Access

Optimizing Risk-Aware Task Migration Algorithm Among Multiplex UAV Groups Through Hybrid Attention Multi-Agent Reinforcement Learning

Yuanshuang Jiang^{¹^,^Y}, Kai Di^{¹^,^Y}, Ruiyi Qian^¹, Xingyu Wu^², Fulin Chen^³, Pan Li^³, Xiping Fu^⁴, Yichuan Jiang^¹()

1School of Computer Science and Engineering, Southeast University, Nanjing 211189

2School of Software Engineering, Southeast University, Nanjing 211189

3School of Cyber Science and Engineering, Southeast University, Nanjing 211189

4PredictHQ, Auckland 1010, New Zealand

Show Author Information

Abstract

Recently, with the increasing complexity of multiplex Unmanned Aerial Vehicles (multi-UAVs) collaboration in dynamic task environments, multi-UAVs systems have shown new characteristics of inter-coupling among multiplex groups and intra-correlation within groups. However, previous studies often overlooked the structural impact of dynamic risks on agents among multiplex UAV groups, which is a critical issue for modern multi-UAVs communication to address. To address this problem, we integrate the influence of dynamic risks on agents among multiplex UAV group structures into a multi-UAVs task migration problem and formulate it as a partially observable Markov game. We then propose a Hybrid Attention Multi-agent Reinforcement Learning (HAMRL) algorithm, which uses attention structures to learn the dynamic characteristics of the task environment, and it integrates hybrid attention mechanisms to establish efficient intra- and inter-group communication aggregation for information extraction and group collaboration. Experimental results show that in this comprehensive and challenging model, our algorithm significantly outperforms state-of-the-art algorithms in terms of convergence speed and algorithm performance due to the rational design of communication mechanisms.

Keywords

Unmanned Aerial Vehicle (UAV)multiplex UAV group structures task migration multi-agent reinforcement learning

References

[1]

Y. Jiang, K. Di, Z. Hu, F. Chen, P. Li, and Y. Jiang, ε-maximum critic deep deterministic policy gradient for multi-agent reinforcement learning, in Proc. Int. Conf. on Parallel and Distributed Computing : Applications and Technologies, Singapore, 2024, pp. 180–189.

Crossref

[2]

Y. Pan, Q. Ran, Y. Zeng, B. Ma, J. Tang, and L. Cao, Symmetric Bayesian personalized ranking with softmax weight, IEEE Trans. Syst. Man Cybern, Syst., vol. 53, no. 7, pp. 4314–4323, 2023.

Algorithm 1　Hybrid Attention Multi-Agent Reinforcement Learning
1 Initialize replay buffer $D$
2 Initialize the UAV into $m$ groups according to Eq. (12)
3 for $e p i s o d e = 1 t o m a x - i t e r s$ do
4　Reset environments, and get initial $o_{i}$ for each agent $i$
5　Initialize a random process $N$ for action exploration
6　Receive initial state $x = {o_{1}, o_{2}, . . ., o_{N}}$
7　for $t$ $= 1 t o m a x - e p i s o d e - l e n g t h$ do
8　　for each agent $i$ , select action $a_{i} = μ_{θ_{i}} (o_{i}) + N_{t}$ w.r.t. the　　 current policy and exploration
9　　Execute task migration actions $a = (a_{1}, a_{2}, . . ., a_{N})$ and　　 observe reward $r$ and new state $x^{'} = {o_{1}, o_{2}, . . ., o_{N}}$
10　　Store $(x, a, r, x^{'})$ in replay buffer $D$
11　　set $x \leftarrow x^{'}$
12　　Each agent sample $j$ -th minibatch sample $(o_{i}, r, o_{i}^{'})$ from　　 $D$
13　　for $e a c h l e a d e r a g e n t$ $i$ $= 1$ to $M$ do
14　　　Compute intra-group information ${\hat{M}}_{i}$ by Eq. (14) and send to other agent
15　　for $a g e n t$ $i$ $= 1$ to $N$ do
16　　　Compute inter-group communication information ${\overset{ˇ}{M}}_{i}$ 　　　 by Eq. (18)
17　　　Update critic by minimizing the loss by Eq. (19)
18　　　Update actor using the sampled policy gradient by Eq.　　　 (20)
19　　Update target network parameters for each agent $i$ :　　　 $θ_{i}^{'} \leftarrow τ θ_{i} + (1 - τ) θ_{i}^{'}$