Institute of Intelligent Computing, School of Computer Science and Technology, Shandong University, Qingdao 266237, China
Department of Electrical and Computer Engineering, George Washington University, Washington, DC 20052, USA
Show Author Information
Hide Author Information
Abstract
Most blockchain systems currently adopt resource-consuming protocols to achieve consensus between miners; for example, the Proof-of-Work (PoW) and Practical Byzantine Fault Tolerant (PBFT) schemes, which have a high consumption of computing/communication resources and usually require reliable communications with bounded delay. However, these protocols may be unsuitable for Internet of Things (IoT) networks because the IoT devices are usually lightweight, battery-operated, and deployed in an unreliable wireless environment. Therefore, this paper studies an efficient consensus protocol for blockchain in IoT networks via reinforcement learning. Specifically, the consensus protocol in this work is designed on the basis of the Proof-of-Communication (PoC) scheme directly in a single-hop wireless network with unreliable communications. A distributed MultiAgent Reinforcement Learning (MARL) algorithm is proposed to improve the efficiency and fairness of consensus for miners in the blockchain system. In this algorithm, each agent uses a matrix to depict the efficiency and fairness of the recent consensus and tunes its actions and rewards carefully in an actor-critic framework to seek effective performance. Empirical results from the simulation show that the fairness of consensus in the proposed algorithm is guaranteed, and the efficiency nearly reaches a centralized optimal solution.
No abstract is available for this article. Click the button above to view the PDF directly.
References
[1]
W. P.Wang, Z. R.Wang, Z. F.Zhou, H. X.Deng, W. L.Zhao, C. Y.Wang, and Y. Z.Guo, Anomaly detection of industrial control systems based on transfer learning, Tsinghua Science and Technology, vol. 26, no. 6, pp. 821–832, 2021.
Z. N.Mohammad, F.Farha, A. O. M.Abuassba, S. K.Yang, and F.Zhou, Access control and authorization in smart homes: A survey, Tsinghua Science and Technology, vol. 26, no. 6, pp. 906–917, 2021.
X. L.Xu, H. Y.Li, W. J.Xu, Z. J.Liu, L.Yao, and F.Dai, Artificial intelligence for edge service optimization in Internet of Vehicles: A survey, Tsinghua Science and Technology, vol. 27, no. 2, pp. 270–287, 2022.
M. S.Ali, M.Vecchio, M.Pincheira, K.Dolui, F.Antonelli, and M. H.Rehmani, Applications of blockchains in the internet of things: A comprehensive survey, IEEE Commun. Surv. Tutorials, vol. 21, no. 2, pp. 1676–1717, 2019.
K.Biswas and V.Muthukkumarasamy, Securing smart cities using blockchain technology, in Proc. 18th Int. Conf. on High Performance Computing and Communications; IEEE 14th Int. Conf. on Smart City; IEEE 2nd Int. Conf. on Data Science and Systems, Sydney, Australia, 2016, pp. 1392–1393.
P. T. S.Liu, Medical record system using blockchain, big data and tokenization, in Proc. 18th Int. Conf. on Information and Communications Security, Singapore, 2016, pp. 254–261.
X.Yue, H. J.Wang, D. W.Jin, M. Q.Li, and W.Jiang, Healthcare data gateways: Found healthcare intelligence on blockchain with novel privacy risk control, J. Med. Syst., vol. 40, no. 10, p. 218, 2016.
I.Bentov, C.Lee, A.Mizrahi, and M.Rosenfeld, Proof of activity: Extending bitcoin’s proof of work via proof of stake, SIGMETRICS Perform. Eval. Rev., vol. 42, no. 3, pp. 34–37, 2014.
M.Castro and B.Liskov, Practical Byzantine fault tolerance, in Proc. 3rd Symp. on Operating Systems Design and Implementation, New Orleans, LA, USA, 1999, pp. 173–186.
M. F.Yin, D.Malkhi, M. K.Reiter, G. G.Gueta, and I.Abraham, HotStuff: BFT consensus with linearity and responsiveness, in Proc. 2019 ACM Symp. on Principles of Distributed Computing, Toronto, Canada, 2019, pp. 347–356.
M. H.Xu, C. C.Liu, Y. F.Zou, F.Zhao, J. G.Yu, and X. Z.Cheng, wChain: A fast fault-tolerant blockchain protocol for multihop wireless networks, IEEE Trans. Wirel. Commun., vol. 20, no. 10, pp. 6915–6926, 2021.
L.Yang, Y. F.Zou, M. H.Xu, Y. C.Xu, D. X.Yu, and X. Z.Cheng, Distributed consensus for blockchains in internet-of-things networks, Tsinghua Science and Technology, vol. 27, no. 5, pp. 817–831, 2022.
M. H.Xu, F.Zhao, Y. F.Zou, C. C.Liu, X. Z.Cheng, and F.Dressler, BLOWN: A blockchain protocol for single-hop wireless networks under adversarial SINR, IEEE Trans. Mob. Comput., .
R. S.Sutton, Temporal credit assignment in reinforcement learning, PhD dissertation, Univ. Mass. Amherst, Amherst, MA, USA, 1984.
[18]
R. S.Sutton, D.McAllester, S.Singh, and Y.Mansour, Policy gradient methods for reinforcement learning with function approximation, in Proc. 12th Int. Conf. on Neural Information Processing Systems, Denver, CO, USA, 2000, pp. 1057–1063.
A. G.Barto, R. S.Sutton, and C. W.Anderson, Neuronlike adaptive elements that can solve difficult learning control problems, IEEE Trans. Syst. Man Cybern., vol. SMC-13, no. 5, pp. 834–846, 1983.
S.Dziembowski, S.Faust, V.Kolmogorov, and K.Pietrzak, Proofs of space, in Proc. 35th Annu. Cryptology Conf., Santa Barbara, CA, USA, 2015, pp. 585–605.
A.Miller, A.Juels, E.Shi, B.Parno, and J.Katz, Permacoin: Repurposing bitcoin work for data preservation, in Proc. 2014 IEEE Symp. on Security and Privacy, Berkeley, CA, USA, 2014, pp. 475–490.
M.Tan, Multi-agent reinforcement learning: Independent vs. cooperative agents, in Machine Learning Proceedings 1993. Amsterdam, the Netherlands: Elsevier, 1993, pp. 330–337.
P.Sunehag, G.Lever, A.Gruslys, W. M.Czarnecki, V.Zambaldi, M.Jaderberg, M.Lanctot, N.Sonnerat, J. Z.Leibo, K.Tuyls, et al., Value-decomposition networks for cooperative multi-agent learning based on team reward, in Proc. 17th Int. Conf. on Autonomous Agents and MultiAgent Systems, Stockholm, Sweden, 2017, pp. 2085–2087.
T.Rashid, M.Samvelyan, C. S.de Witt, G.Farquhar, J. N.Foerster, and S.Whiteson, QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning, in Proc. 35th Int. Conf. on Machine Learning, Stockholmsmässan, Stockholm, Sweden, 2018, pp. 4292–4301.
K. Q.Zhang, Z. R.Yang, and T.Başar, Multi-agent reinforcement learning: A selective overview of theories and algorithms, in Handbook of Reinforcement Learning and Control, K. G.Vamvoudakis, Y.Wan, F. L.Lewis, and D.Cansever, eds.Cham, Germany: Springer, 2021, pp. 321–384.
J.Foerster, G.Farquhar, T.Afouras, N.Nardelli, and S.Whiteson, Counterfactual multi-agent policy gradients, in Proc. 32nd AAAI Conf. on Artificial Intelligence, Palo Alto, CA, USA, 2018, pp. 2974–2982.
R.Lowe, Y.Wu, A.Tamar, J.Harb, P.Abbeel, and I.Mordatch, Multi-agent actor-critic for mixed cooperative-competitive environments, in Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6382–6393.
J.Foerster, N.Nardelli, G.Farquhar, T.Afouras, P. H. S.Torr, P.Kohli, and S.Whiteson, Stabilising experience replay for deep multi-agent reinforcement learning, in Proc. 34th Int. Conf. on Machine Learning, Sydney, Australia, 2017, pp. 1146–1155.
I.Mordatch and P.Abbeel, Emergence of grounded compositional language in multi-agent populations, in Proc. 32nd AAAI Conf. on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conf. and 8th AAAI Symp. on Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2018, p. 183.
T.Bansal, J.Pachocki, S.Sidor, I.Sutskever, and I.Mordatch, Emergent complexity via multi-agent competition, present at Proc. 6th Int. Conf. on Learning Representations, Vancouver, Canada, 2018.
M.Raghu, A.Irpan, J.Andreas, R.Kleinberg, Q. V.Le, and J. M.Kleinberg, Can deep reinforcement learning solve Erdos-Selfridge-spencer games? in Proc. 35th Int. Conf. on Machine Learning, Stockholmsmässan, Sweden, 2018, pp. 4235–4243.
J. Z.Leibo, V.Zambaldi, M.Lanctot, and J.Marecki, Multi-agent reinforcement learning in sequential social dilemmas, in Proc. 16th Conf. on Autonomous Agents and MultiAgent Systems, São Paulo, Brazil, 2017, pp. 464–473.
A.Lerer and A.Peysakhovich, Maintaining cooperation in complex social dilemmas using deep reinforcement learning, arXiv preprint arXiv: 1707.01068, 2018.
J. Z.Leibo, J.Perolat, E.Hughes, S.Wheelwright, A. H.Marblestone, E.Duéñez-Guzmán, P.Sunehag, I.Dunning, and T.Graepel, Malthusian reinforcement learning, in Proc. 18th Int. Conf. on Autonomous Agents and MultiAgent Systems, Montreal, Canada, 2019, pp. 1099–1107.
X. F.Wang and T.Sandholm, Reinforcement learning to play an optimal Nash equilibrium in team Markov games, in Proc. 15th Int. Conf. on Neural Information Processing Systems, Cambridge, MA, USA, 2002, pp. 1603–1610.
M. L.Littman, Markov games as a framework for multi-agent reinforcement learning, in Machine Learning Proceedings 1994, W. W.Cohen and H.Hirshpp, eds.Amsterdam, the Netherlands: Elsevier, 1994, pp. 157–163.
M. G.Lagoudakis and R.Parr, Learning in zero-sum team Markov games using factored value functions, in Proc. 15th Int. Conf. on Neural Information Processing Systems, Cambridge, MA, USA, 2002, pp. 1659–1666.
C.Dwork, N.Lynch, and L.Stockmeyer, Consensus in the presence of partial synchrony (Preliminary version), in Proc. 3rd Annu. ACM Symp. on Principles of Distributed Computing, Vancouver British Columbia, Canada, 1984, pp. 103–118.
P.Marbach and J. N.Tsitsiklis, Simulation-based optimization of Markov reward processes: Implementation issues, in Proc. 38th IEEE Conf. on Decision and Control, Phoenix, AZ, USA, 1999, pp. 1769–1774.
R. S.Sutton, Generalization in reinforcement learning: Successful examples using sparse coarse coding, in Proc. 8th Int. Conf. on Neural Information Processing Systems, Denver, CO, USA, 1995, pp. 1038–1044.
Zou Y, Jin Z, Zheng Y, et al. Optimized Consensus for Blockchain in Internet of Things Networks via Reinforcement Learning. Tsinghua Science and Technology, 2023, 28(6): 1009-1022. https://doi.org/10.26599/TST.2022.9010045
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
10.26599/TST.2022.9010045.F001
Example of how the index changes.
10.26599/TST.2022.9010045.F002
Actor-critic framework of our MAAC algorithm.
Simulation Result
The experimental results from the simulation are presented in this section to evaluate the efficiency of the proposed algorithm in achieving the consensus. As mentioned in the framework of the consensus protocol, the LE process directly impacts the efficiency of consensus. Once a leader is elected, completing the BP, BC, and CU stages takes three additional rounds. Thus, in the following, the efficiency of the LE algorithm is observed in each episode, which contains main and auxiliary rounds.
Parameter setting. This simulation has agents randomly and uniformly distributed in a circular area with a radius of m, and the minimum distance between agents is 1 m. Ambient noise is normalized to 1 dB, the transmission range of the equipment is set to 200 m, and the maximum transmission power . If unspecified, then all agent parameters are set as follows: policy network learning rate is set to 0.0003, value network learning rate to 0.0005, , and . All results are provided on TensorFlow and Python platforms. All simulation results are generated on a computer configured with Intel Core i7-8565U@1.80 GHz.
Simulated results. First, as shown in
Fig. 3
, the convergence of the Deep Reinforcement Learning (DRL) algorithm with different numbers of agents is evaluated using the variation of the total reward obtained from a single agent in each episode during the training process. The results reveal that the agent can always gradually obtain the maximum total reward and converge after sufficient episodes. With the increase in , raising episodes (i.e., training time) facilitates the convergence of total reward to the maximum value. This advantage facilitates the easy deployment of the proposed algorithm in large-scale distributed network scenarios without degrading performance.
Figure 3
demonstrates that the proposed algorithm converges well, which directly proves the “termination” property. Combined with the four aforementioned stages of consensus that guarantee “validity” and “agreement”, the proposed algorithm fits well with the three properties of consensus algorithms.
10.26599/TST.2022.9010045.F003
Total reward obtained from a single agent in each episode.
Second, as shown in
Fig. 4
, the success rate (the ratio of successful rounds to the total rounds) of the introduced learning algorithm in each episode and that of the previous randomized algorithm are compared[
13
]. The results show that despite the randomized algorithm with a stable success rate, the proposed learning algorithm can outperform the stochastic algorithm after sufficient training. Notably, the success rate after the algorithm has sufficiently converged can reach almost , which is approximately times larger than that of the randomized algorithm. Moreover, the success rate of the randomized algorithm decreases when increases. The success rate has been proven to decrease to the lower bound of 1/e in Ref. [
13
] when is infinite. By contrast, the performance of the proposed algorithm is insensitive to the increase in . When increases, only the corresponding episode must be increased to reach a success rate. Therefore, the efficiency of the proposed algorithm on consensus is optimized.
10.26599/TST.2022.9010045.F004
Comparison of the success rate in each episode between the learning and randomized algorithms.
Finally, as shown in
Fig. 5
, the number of times that the agents are elected as the leader when the proposed learning algorithm reaches the maximum reward is compared with the counts of the randomized algorithm in Ref. [
13
]. The results show that the randomized algorithm only guarantees the fairness of the probability. However, a considerable amount of uncertainty will always remain in the actual situation due to the randomization, which is illustrated by the slightly equal blue bars in the figure. In the case of convergence, the proposed algorithm can finally ensure that the number of times each agent is elected as the leader is exactly the same, as shown by the absolute equality of the red bars in the figure, thus ensuring actual fairness.
10.26599/TST.2022.9010045.F005
Comparison of the number of times that each agent is elected between the learning algorithm and the randomized algorithm.
The experimental results generally show that when our algorithm is well trained to the final convergence, each agent will be elected as the leader in turn, i.e., each agent is determinately re-elected as a leader if it has elapsed an interval of successful rounds since it was last elected as leader. Thus, the efficiency of the proposed algorithm on consensus is optimal.
10.26599/TST.2022.9010045.F003
Total reward obtained from a single agent in each episode.
10.26599/TST.2022.9010045.F004
Comparison of the success rate in each episode between the learning and randomized algorithms.
10.26599/TST.2022.9010045.F005
Comparison of the number of times that each agent is elected between the learning algorithm and the randomized algorithm.