School of Computer Science and Engineering, Southeast University, Nanjing211189, China
Department of Computer Science and Engineering, Hangzhou Dianzi University, Hangzhou310018, China
Show Author Information
Hide Author Information
Abstract
Road pricing is an urban traffic management mechanism to reduce traffic congestion. Currently, most of the road pricing systems based on predefined charging tolls fail to consider the dynamics of urban traffic flows and travelers’ demands on the arrival time. In this paper, we propose a method to dynamically adjust online road toll based on traffic conditions and travelers’ demands to resolve the above-mentioned problems. The method, based on deep reinforcement learning, automatically allocates the optimal toll for each road during peak hours and guides vehicles to roads with lower toll charges. Moreover, it further considers travelers’ demands to ensure that more vehicles arrive at their destinations before their estimated arrival time. Our method can increase the traffic volume effectively, as compared to the existing static mechanisms.
No abstract is available for this article. Click the button above to view the PDF directly.
References
[1]
S.Liu, W. T.Zhang, X. J.Wu, S.Feng, X.Pei, and D. Y.Yao, A simulation system and speed guidance algorithms for intersection traffic control using connected vehicle technology, Tsinghua Science and Technology, vol. 24, no. 2, pp. 160-170, 2019.
N.Wang, G.Guo, B.Wang, and C.Wang, Traffic clustering algorithm of urban data brain based on a hybrid-augmented architecture of quantum annealing and brain-inspired cognitive computing, Tsinghua Science and Technology, vol. 25, no. 6, pp. 813-825, 2020.
J. J.Li, H. H.Jiao, J.Wang, Z. G.Liu, and J.Wu, Online real-time trajectory analysis based on adaptive time interval clustering algorithm, Big Data Mining and Analytics, vol. 3, no. 2, pp. 131-142, 2020.
J. R.Gan, B.An, H. Z.Wang, X. M.Sun, and Z. Z.Shi, Optimal pricing for improving efficiency of taxi systems, in Proc. 23rd Int. Joint Conf. Artificial Intelligence, Beijing, China, 2013, pp. 2811-2818.
[5]
M. J.Hausknecht and P.Stone, Deep reinforcement learning in parameterized action space, arXiv preprint arXiv:1511.04143, 2016.
D.Joksimovic, M. C. J.Bliemer, P. H. L.Bovy, and Z.Verwater-Lukszo, Dynamic road pricing for optimizing network performance with heterogeneous users, in Proc. Networking, Sensing and Control, Tucson, AZ, USA, 2005, pp. 407-412.
[7]
H. P.Chen, B.An, G.Sharon, J. P.Hanna, P.Stone, C. Y.Miao, and Y. C.Soh, DyETC: Dynamic electronic toll collection for traffic congestion alleviation, in Proc. 32nd AAAI Conf. Artificial Intelligence, New Orleans, Lousiana, USA, 2018, pp.757-765.
[8]
R. S.Sutton, D. A.McAllester, S. P.Singh, and Y.Mansour, Policy gradient methods for reinforcement learning with function approximation, in Proc. 12th Int. Conf. Neural Information Processing Systems, Cambridge, MA, USA, 1999, pp. 1057-1063.
[9]
D.Joksimovic, M. C. J.Bliemer, and P. H. L.Bovy, Optimal toll design problem in dynamic traffic networks with joint route and departure time choice, Transportation Research Record, vol. 1923, no. 1, pp. 61-72, 2005.
D. Y.Lin, A.Unnikrishnan, and S. T.Waller, A dual variable approximation based heuristic for dynamic congestion pricing, Networks and Spatial Economics, vol. 11, no. 2, pp. 271-293, 2011.
B. J.Zhou, M.Bliemer, H.Yang, and J.He, A trial-and-error congestion pricing scheme for networks with elastic demand and link capacity constraints, Transportation Research Part B: Methodological, vol. 72, pp. 77-92, 2015.
G.Sharon, J.Hanna, T.Rambha, M.Albert, P.Stone, and S. D.Boyles, Delta-tolling: Adaptive tolling for optimizing traffic throughput, in Proc. 9th Int. Workshop on Agents in Traffic and Transportation, New York, NY, USA, 2016.
[13]
G.Sharon, J. P.Hanna, T.Rambha, M. W.Levin, M.Albert, S. D.Boyles, and P.Stone, Real-time adaptive tolling scheme for optimized social welfare in traffic networks, in Proc. 16th Conf. Autonomous Agents and MultiAgent Systems, Richland, SC, USA, 2017, pp. 828-836.
[14]
K. T.Bui, V. A.Huynh, and E.Frazzoli, Dynamic traffic congestion pricing mechanism with user-centric considerations, in Proc. 2012 15th Int. IEEE Conf. Intelligent Transportation Systems, Anchorage, AK, USA, 2012, pp. 147-154.
H.Mirzaei, G.Sharon, S. D.Boyles, T.Givargis, and P.Stone, Link-based parameterized micro-tolling scheme for optimal traffic management, in Proc. 17th Int. Conf. Autonomous Agents and MultiAgent Systems, Richland, SC, USA, 2018, pp. 2013-2015.
[16]
F.Soylemezgiller, M.Kuscu, and D.Kilinc, A traffic congestion avoidance algorithm with dynamic road pricing for smart cities, in Proc. 2013 IEEE 24th Annual Int. Symp. Personal, Indoor, and Mobile Radio Communications, London, UK, 2013, pp. 2571-2575.
W.Qiu, H. P.Chen, and B.An, Dynamic electronic toll collection via multi-agent deep reinforcement learning with edge-based graph convolutional networks, in Proc. 28th Int. Joint Conf. Artificial Intelligence, Macao, China, 2019, pp. 4568-4574.
N.Aung, W. D.Zhang, S.Dhelim, and Y. B.Ai, T-Coin: Dynamic traffic congestion pricing system for the internet of vehicles in smart cities, Information, vol. 11, no. 3, p. 149, 2020.
H. K.Lo and W. Y.Szeto, A methodology for sustainable traveler information services, Transportation Research Part B: Methodological, vol. 36, no. 2, pp. 113-130, 2002.
K. L.Hong, C. W.Yip, and K. H.Wan, Modeling transfer and non-linear fare structure in multi-modal network, Transportation Research Part B: Methodological, vol. 37, no. 2, pp. 149-170, 2003.
H. J.Huang and Z. C.Li, A multiclass, multicriteria logit-based traffic equilibrium assignment model under ATIS, Eur. J. Oper. Res., vol. 176, no. 3, pp. 1464-1477, 2007.
K.Zhu and T.Zhang, Deep reinforcement learning based mobile robot navigation: A review, Tsinghua Science and Technology, vol. 26, no. 5, pp. 674-691, 2021.
B. J.Zhou, M.Bliemer, H.Yang, and J.He, A trial-and-error congestion pricing scheme for networks with elastic demand and link capacity constraints, Transportation Research Part B : Methodological, vol. 72, pp. 77-92, 2015.
O.Anschel, N.Baram, and N.Shimkin, Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning, in Proc. 34th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 176-185.
[25]
T. P.Lillicrap, J. J.Hunt, A.Pritzel, N.Heess, T.Erez, Y.Tassa, D.Silver, and D.Wierstra, Continuous control with deep reinforcement learning, in Proc. 4th Int. Conf. Learning Representations, San Juan, Puerto Rico, 2016.
[26]
P. A.Barter, A vehicle quota integrated with road usage pricing: A mechanism to complete the phase-out of high fixed vehicle taxes in Singapore, Transport Policy, vol. 12, no. 6, pp. 525-536, 2005.
Jin J, Zhu X, Wu B, et al. A Dynamic and Deadline-Oriented Road Pricing Mechanism for Urban Traffic Management. Tsinghua Science and Technology, 2022, 27(1): 91-102. https://doi.org/10.26599/TST.2020.9010062
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
10.26599/TST.2020.9010062.F001
An example of road pricing.
10.26599/TST.2020.9010062.F003
Architecture of the proposed DADO.
4 DADO Algorithm
Due to the huge scale of the problem, great challenges are faced in finding the optimal policy function. Traditional machine learning methods require a comprehensive understanding of the environment. However, in our defined problem, traffic environment is complicated and constantly changing; it cannot be expressed by mathematical formulas. In addition, the characteristics of samples used by traditional machine learning usually need to be designed by human experts. As known, features have a crucial influence on the generalization of the model; however, traffic environment features cannot be expressed as the environment changes with time. Generally, with the increase of state and action space, the complexity of the reinforcement learning increases exponentially. Traditional reinforcement learning has a two-dimensional Q table presenting the value of state-action pair, but this problem has multidimensional and continuous state space and action space[
22
].
Hence, function approximators are generally adopted. There are many kinds of approximators such as linear function approximation or nonlinear function approximation. Deep neural networks as nonlinear function approximation have also been used for large reinforcement learning tasks. Deep neural networks have the ability of automatic feature extraction; thus, the use of deep learning is an advantage to represent the agent’s observation as an abstract representation in learning an optimal control policy. When faced with the high-dimensional state space and bounded and continuous[
23
] action space, deep learning will have convergence problems. The policy gradient methods have advantages when dealing with this situation[
24
,
25
]. The policy gradient uses gradient descent methods in finding the optimal policy. Policy gradient does not estimate the value of state-action functions; it learns the policy directly. Based on the above factors, we present our solution algorithm, which is the DADO policy gradient algorithm.
DADO is different from the Monte Carlo based policy gradient algorithm that observes the whole process of an episode and calculates the accumulative reward until the end of the episode. It uses the average reward of all episodes to estimate the reward of the current policy. The total reward from an episode is a random variable, and the Monte Carlo based algorithm uses the sum of the reward, leading to the large variance of the actual cumulative reward obtained by the Monte Carlo algorithm.
DADO adopts temporal difference learning, which has lower variance compared to the Monte Carlo algorithm. More specifically, DADO adopts a structure named “actor-critic”. Actor is the policy function aiming to approximate the optimal policy and choose the optimal action, while critic is the value function that evaluates the actor’s choice and guides the next choice. We use the advantage function that has a small variance compared with accumulative reward as the evaluation indicator of critic. The formula of advantage function is as follows:
where is the sum of the value of all possible actions taken in this state. is the action value function corresponding to the action in this state. means the advantage of action value function over the current state value function.
The actor aims to maximize the sum of discounted rewards . is represented as
where is the logarithm of the probability of taking action given the state . is the action value when performing the action given the state . denotes the state value, and it is the output of the critic. The critic learns the state value, and we use Eq. (
9
) as loss function. The critic learns to minimize the difference between real action value and estimated value:
The state value of the traditional MDP will not change with time, but in the problem we studied, the amount of vehicles arriving at the destination depends on the specific time step and OD demand of the future time step. In addition, the action also depends on an exact time period. Hence, we updated its value function and policy function for each time period :
where is the action value of executing action under a given state.
The actor training network of DADO uses a deep neural network with three full connection layers to approximate the policy function . The critic network consists of two full connection layers. Connected parameters are trained separately and updated with the stochastic gradient descent method.
The data network required is independent and distributed. An asynchronous method is adopted that does not produce data at the same time to break the correlation between data and make the convergence easier. Each worker directly takes parameters from the global network and interacts with the environment to output the action. The gradient of each worker is used to update the parameters of the global network.
Figure 3
depicts the architecture of the proposed DADO.
10.26599/TST.2020.9010062.F003
Architecture of the proposed DADO.
At each time step, the local agent extracts state representation from traffic dynamics and puts the state to the actor, and the actor performs an action based on the input state and policy. After that, the critic makes an evaluation of the action and updates the local parameters. After the update, the local agent will push the parameters to the global agent and pull the latest parameter from the global agent.
Algorithm 1 illustrates the processing of a single actor. In Algorithm 1, we assume the global shared network parameters are vectors and . The thread-specific parameter vectors are and . The input requires the decision-making horizon and the global shared counter . We need to initialize the step counter to perform an advanced update. Later, the algorithm starts the loop. In the loop, the agent resets the gradients, pulls the global parameter assignments to the local network, and starts a new episode. Post that, the agent interacts with the environment, and the action is selected based on the policy function . The agent will receive immediate rewards which are computed using Eqs. (
3
)-(
5
), and the environment will transform to the next state. If an episode has ended or the number of steps calculated in advance has been reached (Line 15), the algorithm then begins to calculate the total discount reward and updates the parameters at each time period with stochastic gradient descent method. means the sum of rewards from time to obtained from the interactions, representing the “real” value based on the policy. is the “estimated” sum of rewards approximated by the critic. After updating the local parameters, the local parameters will be pushed to the global network to update the global net. The whole process iterates until the number of episodes reaches a predefined global counter .
10.26599/TST.2020.9010062.F004
Road network used by simulations.
10.26599/TST.2020.9010062.F005
Different mechanism of different measuring standards.
10.26599/TST.2020.9010062.F006
Amount of vehicles arriving destinations before deadlines.
10.26599/TST.2020.9010062.F007
Different policies trained by different rewards.