Статті в журналах з теми "Soft Actor-Critic"

Щоб переглянути інші типи публікацій з цієї теми, перейдіть за посиланням: Soft Actor-Critic.

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся з топ-50 статей у журналах для дослідження на тему "Soft Actor-Critic".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Переглядайте статті в журналах для різних дисциплін та оформлюйте правильно вашу бібліографію.

1

Hyeon, Soo-Jong, Tae-Young Kang, and Chang-Kyung Ryoo. "A Path Planning for Unmanned Aerial Vehicles Using SAC (Soft Actor Critic) Algorithm." Journal of Institute of Control, Robotics and Systems 28, no. 2 (February 28, 2022): 138–45. http://dx.doi.org/10.5302/j.icros.2022.21.0220.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
2

Ding, Feng, Guanfeng Ma, Zhikui Chen, Jing Gao, and Peng Li. "Averaged Soft Actor-Critic for Deep Reinforcement Learning." Complexity 2021 (April 1, 2021): 1–16. http://dx.doi.org/10.1155/2021/6658724.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
With the advent of the era of artificial intelligence, deep reinforcement learning (DRL) has achieved unprecedented success in high-dimensional and large-scale artificial intelligence tasks. However, the insecurity and instability of the DRL algorithm have an important impact on its performance. The Soft Actor-Critic (SAC) algorithm uses advanced functions to update the policy and value network to alleviate some of these problems. However, SAC still has some problems. In order to reduce the error caused by the overestimation of SAC, we propose a new SAC algorithm called Averaged-SAC. By averaging the previously learned action-state estimates, it reduces the overestimation problem of soft Q-learning, thereby contributing to a more stable training process and improving performance. We evaluate the performance of Averaged-SAC through some games in the MuJoCo environment. The experimental results show that the Averaged-SAC algorithm effectively improves the performance of the SAC algorithm and the stability of the training process.
3

Qin, Chenjie, Lijun Zhang, Dawei Yin, Dezhong Peng, and Yongzhong Zhuang. "Some effective tricks are used to improve Soft Actor Critic." Journal of Physics: Conference Series 2010, no. 1 (September 1, 2021): 012061. http://dx.doi.org/10.1088/1742-6596/2010/1/012061.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
4

Yang, Qisong, Thiago D. Simão, Simon H. Tindemans, and Matthijs T. J. Spaan. "WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 10639–46. http://dx.doi.org/10.1609/aaai.v35i12.17272.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods.
5

Wong, Ching-Chang, Shao-Yu Chien, Hsuan-Ming Feng, and Hisasuki Aoyama. "Motion Planning for Dual-Arm Robot Based on Soft Actor-Critic." IEEE Access 9 (2021): 26871–85. http://dx.doi.org/10.1109/access.2021.3056903.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
6

Wu, Xiongwei, Xiuhua Li, Jun Li, P. C. Ching, Victor C. M. Leung, and H. Vincent Poor. "Caching Transient Content for IoT Sensing: Multi-Agent Soft Actor-Critic." IEEE Transactions on Communications 69, no. 9 (September 2021): 5886–901. http://dx.doi.org/10.1109/tcomm.2021.3086535.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
7

Ali, Hamid, Hammad Majeed, Imran Usman, and Khaled A. Almejalli. "Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network." Wireless Communications and Mobile Computing 2021 (June 10, 2021): 1–13. http://dx.doi.org/10.1155/2021/9920591.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In reinforcement learning (RL), an agent learns an environment through hit and trail. This behavior allows the agent to learn in complex and difficult environments. In RL, the agent normally learns the given environment by exploring or exploiting. Most of the algorithms suffer from under exploration in the latter stage of the episodes. Recently, an off-policy algorithm called soft actor critic (SAC) is proposed that overcomes this problem by maximizing entropy as it learns the environment. In it, the agent tries to maximize entropy along with the expected discounted rewards. In SAC, the agent tries to be as random as possible while moving towards the maximum reward. This randomness allows the agent to explore the environment and stops it from getting stuck into local optima. We believe that maximizing the entropy causes the overestimation of entropy term which results in slow policy learning. This is because of the drastic change in action distribution whenever agent revisits the similar states. To overcome this problem, we propose a dual policy optimization framework, in which two independent policies are trained. Both the policies try to maximize entropy by choosing actions against the minimum entropy to reduce the overestimation. The use of two policies result in better and faster convergence. We demonstrate our approach on different well known continuous control simulated environments. Results show that our proposed technique achieves better results against state of the art SAC algorithm and learns better policies.
8

Sola, Yoann, Gilles Le Chenadec, and Benoit Clement. "Simultaneous Control and Guidance of an AUV Based on Soft Actor–Critic." Sensors 22, no. 16 (August 14, 2022): 6072. http://dx.doi.org/10.3390/s22166072.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The marine environment is a hostile setting for robotics. It is strongly unstructured, uncertain, and includes many external disturbances that cannot be easily predicted or modeled. In this work, we attempt to control an autonomous underwater vehicle (AUV) to perform a waypoint tracking task, using a machine learning-based controller. There has been great progress in machine learning (in many different domains) in recent years; in the subfield of deep reinforcement learning, several algorithms suitable for the continuous control of dynamical systems have been designed. We implemented the soft actor–critic (SAC) algorithm, an entropy-regularized deep reinforcement learning algorithm that allows fulfilling a learning task and encourages the exploration of the environment simultaneously. We compared a SAC-based controller with a proportional integral derivative (PID) controller on a waypoint tracking task using specific performance metrics. All tests were simulated via the UUV simulator. We applied these two controllers to the RexROV 2, a six degrees of freedom cube-shaped remotely operated underwater Vehicle (ROV) converted in an AUV. We propose several interesting contributions as a result of these tests, such as making the SAC control and guiding the AUV simultaneously, outperforming the PID controller in terms of energy saving, and reducing the amount of information needed by the SAC algorithm inputs. Moreover, our implementation of this controller allows facilitating the transfer towards real-world robots. The code corresponding to this work is available on GitHub.
9

Yu, Xin, Yushan Sun, Xiangbin Wang, and Guocheng Zhang. "End-to-End AUV Motion Planning Method Based on Soft Actor-Critic." Sensors 21, no. 17 (September 1, 2021): 5893. http://dx.doi.org/10.3390/s21175893.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This study aims to solve the problems of poor exploration ability, single strategy, and high training cost in autonomous underwater vehicle (AUV) motion planning tasks and to overcome certain difficulties, such as multiple constraints and a sparse reward environment. In this research, an end-to-end motion planning system based on deep reinforcement learning is proposed to solve the motion planning problem of an underactuated AUV. The system directly maps the state information of the AUV and the environment into the control instructions of the AUV. The system is based on the soft actor–critic (SAC) algorithm, which enhances the exploration ability and robustness to the AUV environment. We also use the method of generative adversarial imitation learning (GAIL) to assist its training to overcome the problem that learning a policy for the first time is difficult and time-consuming in reinforcement learning. A comprehensive external reward function is then designed to help the AUV smoothly reach the target point, and the distance and time are optimized as much as possible. Finally, the end-to-end motion planning algorithm proposed in this research is tested and compared on the basis of the Unity simulation platform. Results show that the algorithm has an optimal decision-making ability during navigation, a shorter route, less time consumption, and a smoother trajectory. Moreover, GAIL can speed up the AUV training speed and minimize the training time without affecting the planning effect of the SAC algorithm.
10

Al Younes, Younes Al, and Martin Barczyk. "Adaptive Nonlinear Model Predictive Horizon Using Deep Reinforcement Learning for Optimal Trajectory Planning." Drones 6, no. 11 (October 27, 2022): 323. http://dx.doi.org/10.3390/drones6110323.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This paper presents an adaptive trajectory planning approach for nonlinear dynamical systems based on deep reinforcement learning (DRL). This methodology is applied to the authors’ recently published optimization-based trajectory planning approach named nonlinear model predictive horizon (NMPH). The resulting design, which we call `adaptive NMPH’, generates optimal trajectories for an autonomous vehicle based on the system’s states and its environment. This is done by tuning the NMPH’s parameters online using two different actor-critic DRL-based algorithms, deep deterministic policy gradient (DDPG) and soft actor-critic (SAC). Both adaptive NMPH variants are trained and evaluated on an aerial drone inside a high-fidelity simulation environment. The results demonstrate the learning curves, sample complexity, and stability of the DRL-based adaptation scheme and show the superior performance of adaptive NMPH relative to our earlier designs.
11

刘, 雨. "Coordinated Optimization of Integrated Electricity-Heat Energy System Based on Soft Actor-Critic." Smart Grid 11, no. 02 (2021): 107–17. http://dx.doi.org/10.12677/sg.2021.112011.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
12

Tang, Hengliang, Anqi Wang, Fei Xue, Jiaxin Yang, and Yang Cao. "A Novel Hierarchical Soft Actor-Critic Algorithm for Multi-Logistics Robots Task Allocation." IEEE Access 9 (2021): 42568–82. http://dx.doi.org/10.1109/access.2021.3062457.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
13

Li, Tao, Wei Cui, and Naxin Cui. "Soft Actor-Critic Algorithm-Based Energy Management Strategy for Plug-In Hybrid Electric Vehicle." World Electric Vehicle Journal 13, no. 10 (October 18, 2022): 193. http://dx.doi.org/10.3390/wevj13100193.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Plug-in hybrid electric vehicles (PHEVs) are equipped with more than one power source, providing additional degrees of freedom to meet the driver’s power demand. Therefore, the reasonable allocation of the power demand of each power source by the energy management strategy (EMS) to keep each power source operating in the efficiency zone is essential for improving fuel economy. This paper proposes a novel model-free EMS based on the soft actor-critic (SAC) algorithm with automatic entropy tuning to balance the optimization of energy efficiency with the adaptability of driving cycles. The maximum entropy framework is introduced into deep reinforcement learning-based energy management to improve the performance of exploring the internal combustion engine (ICE) as well as the electric motor (EM) efficiency interval. Specifically, the automatic entropy adjustment framework improves the adaptability to driving cycles. In addition, the simulation is verified by the data collected from the real vehicle. The results show that the introduction of automatic entropy adjustment can effectively improve vehicle equivalent fuel economy. Compared with traditional EMS, the proposed EMS can save energy by 4.37%. Moreover, it is able to adapt to different driving cycles and can keep the state of charge to the reference value.
14

Chen, Rusi, Haiguang Liu, Chengquan Liu, Guangzheng Yu, Xuan Yang, and Yue Zhou. "System Frequency Control Method Driven by Deep Reinforcement Learning and Customer Satisfaction for Thermostatically Controlled Load." Energies 15, no. 21 (October 24, 2022): 7866. http://dx.doi.org/10.3390/en15217866.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The intermittence and fluctuation of renewable energy aggravate the power fluctuation of the power grid and pose a severe challenge to the frequency stability of the power system. Thermostatically controlled loads can participate in the frequency regulation of the power grid due to their flexibility. Aiming to solve the problem of the traditional control methods, which have limited adjustment ability, and to have a positive influence on customers, a deep reinforcement learning control strategy based on the framework of soft actor–critic is proposed, considering customer satisfaction. Firstly, the energy storage index and the discomfort index of different users are defined. Secondly, the fuzzy comprehensive evaluation method is applied to evaluate customer satisfaction. Then, the multi-agent models of thermostatically controlled loads are established based on the soft actor–critic algorithm. The models are trained by using the local information of thermostatically controlled loads, and the comprehensive evaluation index fed back by users and the frequency deviation. After training, each agent can realize the cooperative response of thermostatically controlled loads to the system frequency only by relying on the local information. The simulation results show that the proposed strategy can not only reduce the frequency fluctuation, but also improve customer satisfaction.
15

Chen, Shaotao, Xihe Qiu, Xiaoyu Tan, Zhijun Fang, and Yaochu Jin. "A model-based hybrid soft actor-critic deep reinforcement learning algorithm for optimal ventilator settings." Information Sciences 611 (September 2022): 47–64. http://dx.doi.org/10.1016/j.ins.2022.08.028.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
16

Wu, Tao, Jianhui Wang, Xiaonan Lu, and Yuhua Du. "AC/DC hybrid distribution network reconfiguration with microgrid formation using multi-agent soft actor-critic." Applied Energy 307 (February 2022): 118189. http://dx.doi.org/10.1016/j.apenergy.2021.118189.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
17

Tang, Hengliang, Anqi Wang, Fei Xue, Jiaxin Yang, and Yang Cao. "Corrections to “A Novel Hierarchical Soft Actor-Critic Algorithm for Multi-Logistics Robots Task Allocation”." IEEE Access 9 (2021): 71090. http://dx.doi.org/10.1109/access.2021.3078911.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
18

Haklidir, Mehmet, and Hakan Temeltas. "Guided Soft Actor Critic: A Guided Deep Reinforcement Learning Approach for Partially Observable Markov Decision Processes." IEEE Access 9 (2021): 159672–83. http://dx.doi.org/10.1109/access.2021.3131772.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
19

Zheng, Yuemin, Jin Tao, Hao Sun, Qinglin Sun, Zengqiang Chen, Matthias Dehmer, and Quan Zhou. "Load Frequency Active Disturbance Rejection Control for Multi-Source Power System Based on Soft Actor-Critic." Energies 14, no. 16 (August 6, 2021): 4804. http://dx.doi.org/10.3390/en14164804.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
To ensure the safe operation of an interconnected power system, it is necessary to maintain the stability of the frequency and the tie-line exchanged power. This is one of the hottest issues in the power system field and is usually called load frequency control. To overcome the influences of load disturbances on multi-source power systems containing thermal power plants, hydropower plants, and gas turbine plants, we design a linear active disturbance rejection control (LADRC) based on the tie-line bias control mode. For LADRC, the parameter selection of the controller directly affects the response performance of the entire system, and it is usually not feasible to manually adjust parameters. Therefore, to obtain the optimal controller parameters, we use the Soft Actor-Critic algorithm in reinforcement learning to obtain the controller parameters in real time, and we design the reward function according to the needs of the power system. We carry out simulation experiments to verify the effectiveness of the proposed method. Compared with the results of other proportional–integral–derivative control techniques using optimization algorithms and LADRC with constant parameters, the proposed method shows significant advantages in terms of overshoot, undershoot, and settling time. In addition, by adding different disturbances to different areas of the multi-source power system, we demonstrate the robustness of the proposed control strategy.
20

Xu, Dezhou, Yunduan Cui, Jiaye Ye, Suk Won Cha, Aimin Li, and Chunhua Zheng. "A soft actor-critic-based energy management strategy for electric vehicles with hybrid energy storage systems." Journal of Power Sources 524 (March 2022): 231099. http://dx.doi.org/10.1016/j.jpowsour.2022.231099.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
21

Gamolped, Prem, Sakmongkon Chumkamon, Chanapol Piyavichyanon, Eiji Hayashi, and Abbe Mowshowitz. "Online Deep Reinforcement Learning on Assigned Weight Spaghetti Grasping in One Time using Soft Actor-Critic." Proceedings of International Conference on Artificial Life and Robotics 27 (January 20, 2022): 554–58. http://dx.doi.org/10.5954/icarob.2022.os19-1.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
22

Prianto, Evan, MyeongSeop Kim, Jae-Han Park, Ji-Hun Bae, and Jung-Su Kim. "Path Planning for Multi-Arm Manipulators Using Deep Reinforcement Learning: Soft Actor–Critic with Hindsight Experience Replay." Sensors 20, no. 20 (October 19, 2020): 5911. http://dx.doi.org/10.3390/s20205911.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Since path planning for multi-arm manipulators is a complicated high-dimensional problem, effective and fast path generation is not easy for the arbitrarily given start and goal locations of the end effector. Especially, when it comes to deep reinforcement learning-based path planning, high-dimensionality makes it difficult for existing reinforcement learning-based methods to have efficient exploration which is crucial for successful training. The recently proposed soft actor–critic (SAC) is well known to have good exploration ability due to the use of the entropy term in the objective function. Motivated by this, in this paper, a SAC-based path planning algorithm is proposed. The hindsight experience replay (HER) is also employed for sample efficiency and configuration space augmentation is used in order to deal with complicated configuration space of the multi-arms. To show the effectiveness of the proposed algorithm, both simulation and experiment results are given. By comparing with existing results, it is demonstrated that the proposed method outperforms the existing results.
23

Gupta, Abhishek, Ahmed Shaharyar Khwaja, Alagan Anpalagan, Ling Guan, and Bala Venkatesh. "Policy-Gradient and Actor-Critic Based State Representation Learning for Safe Driving of Autonomous Vehicles." Sensors 20, no. 21 (October 22, 2020): 5991. http://dx.doi.org/10.3390/s20215991.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In this paper, we propose an environment perception framework for autonomous driving using state representation learning (SRL). Unlike existing Q-learning based methods for efficient environment perception and object detection, our proposed method takes the learning loss into account under deterministic as well as stochastic policy gradient. Through a combination of variational autoencoder (VAE), deep deterministic policy gradient (DDPG), and soft actor-critic (SAC), we focus on uninterrupted and reasonably safe autonomous driving without steering off the track for a considerable driving distance. Our proposed technique exhibits learning in autonomous vehicles under complex interactions with the environment, without being explicitly trained on driving datasets. To ensure the effectiveness of the scheme over a sustained period of time, we employ a reward-penalty based system where a negative reward is associated with an unfavourable action and a positive reward is awarded for favourable actions. The results obtained through simulations on DonKey simulator show the effectiveness of our proposed method by examining the variations in policy loss, value loss, reward function, and cumulative reward for ‘VAE+DDPG’ and ‘VAE+SAC’ over the learning process.
24

Xu, Xibao, Yushen Chen, and Chengchao Bai. "Deep Reinforcement Learning-Based Accurate Control of Planetary Soft Landing." Sensors 21, no. 23 (December 6, 2021): 8161. http://dx.doi.org/10.3390/s21238161.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Planetary soft landing has been studied extensively due to its promising application prospects. In this paper, a soft landing control algorithm based on deep reinforcement learning (DRL) with good convergence property is proposed. First, the soft landing problem of the powered descent phase is formulated and the theoretical basis of Reinforcement Learning (RL) used in this paper is introduced. Second, to make it easier to converge, a reward function is designed to include process rewards like velocity tracking reward, solving the problem of sparse reward. Then, by including the fuel consumption penalty and constraints violation penalty, the lander can learn to achieve velocity tracking goal while saving fuel and keeping attitude angle within safe ranges. Then, simulations of training are carried out under the frameworks of Deep deterministic policy gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor Critic (SAC), respectively, which are of the classical RL frameworks, and all converged. Finally, the trained policy is deployed into velocity tracking and soft landing experiments, results of which demonstrate the validity of the algorithm proposed.
25

Mollahasani, Shahram, Turgay Pamuklu, Rodney Wilson, and Melike Erol-Kantarci. "Energy-Aware Dynamic DU Selection and NF Relocation in O-RAN Using Actor–Critic Learning." Sensors 22, no. 13 (July 3, 2022): 5029. http://dx.doi.org/10.3390/s22135029.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Open radio access network (O-RAN) is one of the promising candidates for fulfilling flexible and cost-effective goals by considering openness and intelligence in its architecture. In the O-RAN architecture, a central unit (O-CU) and a distributed unit (O-DU) are virtualized and executed on processing pools of general-purpose processors that can be placed at different locations. Therefore, it is challenging to choose a proper location for executing network functions (NFs) over these entities by considering propagation delay and computational capacity. In this paper, we propose a Soft Actor–Critic Energy-Aware Dynamic DU Selection algorithm (SA2C-EADDUS) by integrating two nested actor–critic agents in the O-RAN architecture. In addition, we formulate an optimization model that minimizes delay and energy consumption. Then, we solve that problem with an MILP solver and use that solution as a lower bound comparison for our SA2C-EADDUS algorithm. Moreover, we compare that algorithm with recent works, including RL- and DRL-based resource allocation algorithms and a heuristic method. We show that by collaborating A2C agents in different layers and by dynamic relocation of NFs, based on service requirements, our schemes improve the energy efficiency by 50% with respect to other schemes. Moreover, we reduce the mean delay by a significant amount with our novel SA2C-EADDUS approach.
26

Litwynenko, Karina, and Małgorzata Plechawska-Wójcik. "Analysis of the possibilities for using machine learning algorithms in the Unity environment." Journal of Computer Sciences Institute 20 (September 30, 2021): 197–204. http://dx.doi.org/10.35784/jcsi.2680.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Reinforcement learning algorithms are gaining popularity, and their advancement is made possible by the presence of tools to evaluate them. This paper concerns the applicability of machine learning algorithms on the Unity platform using the Unity ML-Agents Toolkit library. The purpose of the study was to compare two algorithms: Proximal Policy Optimization and Soft Actor-Critic. The possibility of improving the learning results by combining these algorithms with Generative Adversarial Imitation Learning was also verified. The results of the study showed that the PPO algorithm can perform better in uncomplicated environments with non-immediate rewards, while the additional use of GAIL can improve learning performance.
27

Coraci, Davide, Silvio Brandi, Marco Savino Piscitelli, and Alfonso Capozzoli. "Online Implementation of a Soft Actor-Critic Agent to Enhance Indoor Temperature Control and Energy Efficiency in Buildings." Energies 14, no. 4 (February 14, 2021): 997. http://dx.doi.org/10.3390/en14040997.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Recently, a growing interest has been observed in HVAC control systems based on Artificial Intelligence, to improve comfort conditions while avoiding unnecessary energy consumption. In this work, a model-free algorithm belonging to the Deep Reinforcement Learning (DRL) class, Soft Actor-Critic, was implemented to control the supply water temperature to radiant terminal units of a heating system serving an office building. The controller was trained online, and a preliminary sensitivity analysis on hyperparameters was performed to assess their influence on the agent performance. The DRL agent with the best performance was compared to a rule-based controller assumed as a baseline during a three-month heating season. The DRL controller outperformed the baseline after two weeks of deployment, with an overall performance improvement related to control of indoor temperature conditions. Moreover, the adaptability of the DRL agent was tested for various control scenarios, simulating changes of external weather conditions, indoor temperature setpoint, building envelope features and occupancy patterns. The agent dynamically deployed, despite a slight increase in energy consumption, led to an improvement of indoor temperature control, reducing the cumulative sum of temperature violations on average for all scenarios by 75% and 48% compared to the baseline and statically deployed agent respectively.
28

Park, Kwan-Woo, MyeongSeop Kim, Jung-Su Kim, and Jae-Han Park. "Path Planning for Multi-Arm Manipulators Using Soft Actor-Critic Algorithm with Position Prediction of Moving Obstacles via LSTM." Applied Sciences 12, no. 19 (September 29, 2022): 9837. http://dx.doi.org/10.3390/app12199837.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This paper presents a deep reinforcement learning-based path planning algorithm for the multi-arm robot manipulator when there are both fixed and moving obstacles in the workspace. Considering the problem properties such as high dimensionality and continuous action, the proposed algorithm employs the SAC (soft actor-critic). Moreover, in order to predict explicitly the future position of the moving obstacle, LSTM (long short-term memory) is used. The SAC-based path planning algorithm is developed using the LSTM. In order to show the performance of the proposed algorithm, simulation results using GAZEBO and experimental results using real manipulators are presented. The simulation and experiment results show that the success ratio of path generation for arbitrary starting and goal points converges to 100%. It is also confirmed that the LSTM successfully predicts the future position of the obstacle.
29

Kathirgamanathan, Anjukan, Eleni Mangina, and Donal P. Finn. "Development of a Soft Actor Critic deep reinforcement learning approach for harnessing energy flexibility in a Large Office building." Energy and AI 5 (September 2021): 100101. http://dx.doi.org/10.1016/j.egyai.2021.100101.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
30

Zhang, Bin, Weihao Hu, Di Cao, Tao Li, Zhenyuan Zhang, Zhe Chen, and Frede Blaabjerg. "Soft actor-critic –based multi-objective optimized energy conversion and management strategy for integrated energy systems with renewable energy." Energy Conversion and Management 243 (September 2021): 114381. http://dx.doi.org/10.1016/j.enconman.2021.114381.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
31

Zheng, Yuemin, Jin Tao, Qinglin Sun, Hao Sun, Zengqiang Chen, Mingwei Sun, and Guangming Xie. "Soft Actor–Critic based active disturbance rejection path following control for unmanned surface vessel under wind and wave disturbances." Ocean Engineering 247 (March 2022): 110631. http://dx.doi.org/10.1016/j.oceaneng.2022.110631.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
32

Zhao, Xiaohu, Hanli Jiang, Chenyang An, Ruocheng Wu, Yijun Guo, and Daquan Yang. "A Method of Multi-UAV Cooperative Task Assignment Based on Reinforcement Learning." Mobile Information Systems 2022 (August 12, 2022): 1–9. http://dx.doi.org/10.1155/2022/1147819.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
With the increasing complexity of UAV application scenarios, the performance of a single UAV cannot meet the mission requirements. Many complex tasks need the cooperation of multiple UAVs. How to coordinate UAV resources becomes the key to mission completion. In this paper, a task model including multiple UAVs and unknown obstacles is constructed, and the model is transformed into a Markov decision process (MDP). In addition, considering the influence of strategies among UAVs, a multiagent reinforcement learning algorithm based on SAC algorithm and centralized training and decentralized execution framework, MA-SAC (Multi-Agent Soft Actor-Critic), is proposed to solve the MDP. Simulation results show that the algorithm can effectively deal with the task allocation problem of multiple UAVs in this scenario, and its performance is better than other multiagent reinforcement learning algorithms.
33

Jurj, Sorin Liviu, Dominik Grundt, Tino Werner, Philipp Borchers, Karina Rothemann, and Eike Möhlmann. "Increasing the Safety of Adaptive Cruise Control Using Physics-Guided Reinforcement Learning." Energies 14, no. 22 (November 12, 2021): 7572. http://dx.doi.org/10.3390/en14227572.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This paper presents a novel approach for improving the safety of vehicles equipped with Adaptive Cruise Control (ACC) by making use of Machine Learning (ML) and physical knowledge. More exactly, we train a Soft Actor-Critic (SAC) Reinforcement Learning (RL) algorithm that makes use of physical knowledge such as the jam-avoiding distance in order to automatically adjust the ideal longitudinal distance between the ego- and leading-vehicle, resulting in a safer solution. In our use case, the experimental results indicate that the physics-guided (PG) RL approach is better at avoiding collisions at any selected deceleration level and any fleet size when compared to a pure RL approach, proving that a physics-informed ML approach is more reliable when developing safe and efficient Artificial Intelligence (AI) components in autonomous vehicles (AVs).
34

Backman, Sofi, Daniel Lindmark, Kenneth Bodin, Martin Servin, Joakim Mörk, and Håkan Löfgren. "Continuous Control of an Underground Loader Using Deep Reinforcement Learning." Machines 9, no. 10 (September 27, 2021): 216. http://dx.doi.org/10.3390/machines9100216.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The reinforcement learning control of an underground loader was investigated in a simulated environment by using a multi-agent deep neural network approach. At the start of each loading cycle, one agent selects the dig position from a depth camera image of a pile of fragmented rock. A second agent is responsible for continuous control of the vehicle, with the goal of filling the bucket at the selected loading point while avoiding collisions, getting stuck, or losing ground traction. This relies on motion and force sensors, as well as on a camera and lidar. Using a soft actor–critic algorithm, the agents learn policies for efficient bucket filling over many subsequent loading cycles, with a clear ability to adapt to the changing environment. The best results—on average, 75% of the max capacity—were obtained when including a penalty for energy usage in the reward.
35

Phan Bui, Khoi, Giang Nguyen Truong, and Dat Nguyen Ngoc. "GCTD3: Modeling of Bipedal Locomotion by Combination of TD3 Algorithms and Graph Convolutional Network." Applied Sciences 12, no. 6 (March 14, 2022): 2948. http://dx.doi.org/10.3390/app12062948.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In recent years, there has been a lot of research using reinforcement learning algorithms to train 2-legged robots to move, but there are still many challenges. The authors propose the GCTD3 method, which takes the idea of using Graph Convolutional Networks to represent the kinematic link features of the robot, and combines this with the Twin-Delayed Deep Deterministic Policy Gradient algorithm to train the robot to move. Graph Convolutional Networks are very effective in graph-structured problems such as the connection of the joints of the human-like robots. The GCTD3 method shows better results on the motion trajectories of the bipedal robot joints compared with other reinforcement learning algorithms such as Twin-Delayed Deep Deterministic Policy Gradient, Deep Deterministic Policy Gradient and Soft Actor Critic. This research is implemented on a 2-legged robot model with six independent joint coordinates through the Robot Operating System and Gazebo simulator.
36

Qi, Qi, Wenbin Lin, Boyang Guo, Jinshan Chen, Chaoping Deng, Guodong Lin, Xin Sun, and Youjia Chen. "Augmented Lagrangian-Based Reinforcement Learning for Network Slicing in IIoT." Electronics 11, no. 20 (October 19, 2022): 3385. http://dx.doi.org/10.3390/electronics11203385.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Network slicing enables the multiplexing of independent logical networks on the same physical network infrastructure to provide different network services for different applications. The resource allocation problem involved in network slicing is typically a decision-making problem, falling within the scope of reinforcement learning. The advantage of adapting to dynamic wireless environments makes reinforcement learning a good candidate for problem solving. In this paper, to tackle the constrained mixed integer nonlinear programming problem in network slicing, we propose an augmented Lagrangian-based soft actor–critic (AL-SAC) algorithm. In this algorithm, a hierarchical action selection network is designed to handle the hybrid action space. More importantly, inspired by the augmented Lagrangian method, both neural networks for Lagrange multipliers and a penalty item are introduced to deal with the constraints. Experiment results show that the proposed AL-SAC algorithm can strictly satisfy the constraints, and achieve better performance than other benchmark algorithms.
37

Prianto, Evan, Jae-Han Park, Ji-Hun Bae, and Jung-Su Kim. "Deep Reinforcement Learning-Based Path Planning for Multi-Arm Manipulators with Periodically Moving Obstacles." Applied Sciences 11, no. 6 (March 14, 2021): 2587. http://dx.doi.org/10.3390/app11062587.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In the workspace of robot manipulators in practice, it is common that there are both static and periodic moving obstacles. Existing results in the literature have been focusing mainly on the static obstacles. This paper is concerned with multi-arm manipulators with periodically moving obstacles. Due to the high-dimensional property and the moving obstacles, existing results suffer from finding the optimal path for given arbitrary starting and goal points. To solve the path planning problem, this paper presents a SAC-based (Soft actor–critic) path planning algorithm for multi-arm manipulators with periodically moving obstacles. In particular, the deep neural networks in the SAC are designed such that they utilize the position information of the moving obstacles over the past finite time horizon. In addition, the hindsight experience replay (HER) technique is employed to use the training data efficiently. In order to show the performance of the proposed SAC-based path planning, both simulation and experiment results using open manipulators are given.
38

Wen, Wen, Yuyu Yuan, and Jincui Yang. "Reinforcement Learning for Options Trading." Applied Sciences 11, no. 23 (November 25, 2021): 11208. http://dx.doi.org/10.3390/app112311208.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Reinforcement learning has been applied to various types of financial assets trading, such as stocks, futures, and cryptocurrencies. Options, as a novel kind of derivative, have their characteristics. Because there are too many option contracts for one underlying asset and their price behavior is different. Besides, the validity period of an option contract is relatively short. To apply reinforcement learning to options trading, we propose the options trading reinforcement learning (OTRL) framework. We use options’ underlying asset data to train the reinforcement learning model. Candle data in different time intervals are utilized, respectively. The protective closing strategy is added to the model to prevent unbearable losses. Our experiments demonstrate that the most stable algorithm for obtaining high returns is proximal policy optimization (PPO) with the protective closing strategy. The deep Q network (DQN) can exceed the buy and hold strategy in options trading, as can soft actor critic (SAC). The OTRL framework is verified effectively.
39

Yuan, Yuyu, Wen Wen, and Jincui Yang. "Using Data Augmentation Based Reinforcement Learning for Daily Stock Trading." Electronics 9, no. 9 (August 27, 2020): 1384. http://dx.doi.org/10.3390/electronics9091384.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In algorithmic trading, adequate training data set is key to making profits. However, stock trading data in units of a day can not meet the great demand for reinforcement learning. To address this problem, we proposed a framework named data augmentation based reinforcement learning (DARL) which uses minute-candle data (open, high, low, close) to train the agent. The agent is then used to guide daily stock trading. In this way, we can increase the instances of data available for training in hundreds of folds, which can substantially improve the reinforcement learning effect. But not all stocks are suitable for this kind of trading. Therefore, we propose an access mechanism based on skewness and kurtosis to select stocks that can be traded properly using this algorithm. In our experiment, we find proximal policy optimization (PPO) is the most stable algorithm to achieve high risk-adjusted returns. Deep Q-learning (DQN) and soft actor critic (SAC) can beat the market in Sharp Ratio.
40

Tovarnov, M. S., and N. V. Bykov. "Reinforcement learning reward function in unmanned aerial vehicle control tasks." Journal of Physics: Conference Series 2308, no. 1 (July 1, 2022): 012004. http://dx.doi.org/10.1088/1742-6596/2308/1/012004.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Abstract This paper presents a new reward function that can be used for deep reinforcement learning in unmanned aerial vehicle (UAV) control and navigation problems. The reward function is based on the construction and estimation of the time of simplified trajectories to the target, which are third-order Bezier curves. This reward function can be applied unchanged to solve problems in both two-dimensional and three-dimensional virtual environments. The effectiveness of the reward function was tested in a newly developed virtual environment, namely, a simplified two-dimensional environment describing the dynamics of UAV control and flight, taking into account the forces of thrust, inertia, gravity, and aerodynamic drag. In this formulation, three tasks of UAV control and navigation were successfully solved: UAV flight to a given point in space, avoidance of interception by another UAV, and organization of interception of one UAV by another. The three most relevant modern deep reinforcement learning algorithms, Soft actor-critic, Deep Deterministic Policy Gradient, and Twin Delayed Deep Deterministic Policy Gradient were used. All three algorithms performed well, indicating the effectiveness of the selected reward function.
41

Xu, Yuting, Chao Wang, Jiakai Liang, Keqiang Yue, Wenjun Li, Shilian Zheng, and Zhijin Zhao. "Deep Reinforcement Learning Based Decision Making for Complex Jamming Waveforms." Entropy 24, no. 10 (October 10, 2022): 1441. http://dx.doi.org/10.3390/e24101441.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
With the development of artificial intelligence, intelligent communication jamming decision making is an important research direction of cognitive electronic warfare. In this paper, we consider a complex intelligent jamming decision scenario in which both communication parties choose to adjust physical layer parameters to avoid jamming in a non-cooperative scenario and the jammer achieves accurate jamming by interacting with the environment. However, when the situation becomes complex and large in number, traditional reinforcement learning suffers from the problems of failure to converge and a high number of interactions, which are fatal and unrealistic in a real warfare environment. To solve this problem, we propose a deep reinforcement learning based and maximum-entropy-based soft actor-critic (SAC) algorithm. In the proposed algorithm, we add an improved Wolpertinger architecture to the original SAC algorithm in order to reduce the number of interactions and improve the accuracy of the algorithm. The results show that the proposed algorithm shows excellent performance in various scenarios of jamming and achieves accurate, fast, and continuous jamming for both sides of the communication.
42

Choi, Hongrok, and Sangheon Pack. "Cooperative Downloading for LEO Satellite Networks: A DRL-Based Approach." Sensors 22, no. 18 (September 10, 2022): 6853. http://dx.doi.org/10.3390/s22186853.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In low earth orbit (LEO) satellite-based applications (e.g., remote sensing and surveillance), it is important to efficiently transmit collected data to ground stations (GS). However, LEO satellites’ high mobility and resultant insufficient time for downloading make this challenging. In this paper, we propose a deep-reinforcement-learning (DRL)-based cooperative downloading scheme, which utilizes inter-satellite communication links (ISLs) to fully utilize satellites’ downloading capabilities. To this end, we formulate a Markov decision problem (MDP) with the objective to maximize the amount of downloaded data. To learn the optimal approach to the formulated problem, we adopt a soft-actor-critic (SAC)-based DRL algorithm in discretized action spaces. Moreover, we design a novel neural network consisting of a graph attention network (GAT) layer to extract latent features from the satellite network and parallel fully connected (FC) layers to control individual satellites of the network. Evaluation results demonstrate that the proposed DRL-based cooperative downloading scheme can enhance the average utilization of contact time by up to 17.8% compared with independent downloading and randomly offloading schemes.
43

Zhang, Jian, and Fengge Wu. "A Novel Model-Based Reinforcement Learning Attitude Control Method for Virtual Reality Satellite." Wireless Communications and Mobile Computing 2021 (July 1, 2021): 1–11. http://dx.doi.org/10.1155/2021/7331894.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Observing the universe with virtual reality satellite is an amazing experience. An intelligent method of attitude control is the core object of research to achieve this goal. Attitude control is essentially one of the goal-state reaching tasks under constraints. Using reinforcement learning methods in real-world systems faces many challenges, such as insufficient samples, exploration safety issues, unknown actuator delays, and noise in the raw sensor data. In this work, a mixed model with different input sizes was proposed to represent the environmental dynamics model. The predication accuracy of the environmental dynamics model and the performance of the policy trained in this paper were gradually improved. Our method reduces the impact of noisy data on the model’s accuracy and improves the sampling efficiency. The experiments showed that the agent trained with our method completed a goal-state reaching task in a real-world system under wireless circumstances whose actuators were reaction wheels, whereas the soft actor-critic method failed in the same training process. The method’s effectiveness is ensured theoretically under given conditions.
44

Shahid, Asad Ali, Dario Piga, Francesco Braghin, and Loris Roveda. "Continuous control actions learning and adaptation for robotic manipulation through reinforcement learning." Autonomous Robots 46, no. 3 (February 9, 2022): 483–98. http://dx.doi.org/10.1007/s10514-022-10034-z.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
AbstractThis paper presents a learning-based method that uses simulation data to learn an object manipulation task using two model-free reinforcement learning (RL) algorithms. The learning performance is compared across on-policy and off-policy algorithms: Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). In order to accelerate the learning process, the fine-tuning procedure is proposed that demonstrates the continuous adaptation of on-policy RL to new environments, allowing the learned policy to adapt and execute the (partially) modified task. A dense reward function is designed for the task to enable an efficient learning of the agent. A grasping task involving a Franka Emika Panda manipulator is considered as the reference task to be learned. The learned control policy is demonstrated to be generalizable across multiple object geometries and initial robot/parts configurations. The approach is finally tested on a real Franka Emika Panda robot, showing the possibility to transfer the learned behavior from simulation. Experimental results show 100% of successful grasping tasks, making the proposed approach applicable to real applications.
45

Yatawatta, Sarod, and Ian M. Avruch. "Deep reinforcement learning for smart calibration of radio telescopes." Monthly Notices of the Royal Astronomical Society 505, no. 2 (May 17, 2021): 2141–50. http://dx.doi.org/10.1093/mnras/stab1401.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
ABSTRACT Modern radio telescopes produce unprecedented amounts of data, which are passed through many processing pipelines before the delivery of scientific results. Hyperparameters of these pipelines need to be tuned by hand to produce optimal results. Because many thousands of observations are taken during a lifetime of a telescope and because each observation will have its unique settings, the fine tuning of pipelines is a tedious task. In order to automate this process of hyperparameter selection in data calibration pipelines, we introduce the use of reinforcement learning. We test two reinforcement learning techniques, twin delayed deep deterministic policy gradient (TD3), and soft actor-critic, to train an autonomous agent to perform this fine tuning. For the sake of generalization, we consider the pipeline to be a black-box system where the summarized state of the performance of the pipeline is used by the autonomous agent. The autonomous agent trained in this manner is able to determine optimal settings for diverse observations and is therefore able to perform smart calibration, minimizing the need for human intervention.
46

Sun, Haoran, Tingting Fu, Yuanhuai Ling, and Chaoming He. "Adaptive Quadruped Balance Control for Dynamic Environments Using Maximum-Entropy Reinforcement Learning." Sensors 21, no. 17 (September 2, 2021): 5907. http://dx.doi.org/10.3390/s21175907.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
External disturbance poses the primary threat to robot balance in dynamic environments. This paper provides a learning-based control architecture for quadrupedal self-balancing, which is adaptable to multiple unpredictable scenes of external continuous disturbance. Different from conventional methods which construct analytical models which explicitly reason the balancing process, our work utilized reinforcement learning and artificial neural network to avoid incomprehensible mathematical modeling. The control policy is composed of a neural network and a Tanh Gaussian policy, which implicitly establishes the fuzzy mapping from proprioceptive signals to action commands. During the training process, the maximum-entropy method (soft actor-critic algorithm) is employed to endow the policy with powerful exploration and generalization ability. The trained policy is validated in both simulations and realistic experiments with a customized quadruped robot. The results demonstrate that the policy can be easily transferred to the real world without elaborate configurations. Moreover, although this policy is trained in merely one specific vibration condition, it demonstrates robustness under conditions that were never encountered during training.
47

Huang, Jianbin, Longji Huang, Meijuan Liu, He Li, Qinglin Tan, Xiaoke Ma, Jiangtao Cui, and De-Shuang Huang. "Deep Reinforcement Learning-based Trajectory Pricing on Ride-hailing Platforms." ACM Transactions on Intelligent Systems and Technology 13, no. 3 (June 30, 2022): 1–19. http://dx.doi.org/10.1145/3474841.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Dynamic pricing plays an important role in solving the problems such as traffic load reduction, congestion control, and revenue improvement. Efficient dynamic pricing strategies can increase capacity utilization, total revenue of service providers, and the satisfaction of both passengers and drivers. Many proposed dynamic pricing technologies focus on short-term optimization and face poor scalability in modeling long-term goals for the limitations of solution optimality and prohibitive computation. In this article, a deep reinforcement learning framework is proposed to tackle the dynamic pricing problem for ride-hailing platforms. A soft actor-critic (SAC) algorithm is adopted in the reinforcement learning framework. First, the dynamic pricing problem is translated into a Markov Decision Process (MDP) and is set up in continuous action spaces, which is no need for the discretization of action space. Then, a new reward function is obtained by the order response rate and the KL-divergence between supply distribution and demand distribution. Experiments and case studies demonstrate that the proposed method outperforms the baselines in terms of order response rate and total revenue.
48

Xu, Haotian, Qi Fang, Cong Hu, Yue Hu, and Quanjun Yin. "MIRA: Model-Based Imagined Rollouts Augmentation for Non-Stationarity in Multi-Agent Systems." Mathematics 10, no. 17 (August 25, 2022): 3059. http://dx.doi.org/10.3390/math10173059.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
One of the challenges in multi-agent systems comes from the environmental non-stationarity that policies of all agents are evolving individually over time. Many existing multi-agent reinforcement learning (MARL) methods have been proposed to address this problem. However, these methods rely on a large amount of training data and some of them require agents to intensely communicate, which is often impractical in real-world applications. To better tackle the non-stationarity problem, this article combines model-based reinforcement learning (MBRL) and meta-learning and proposes a method called Model-based Imagined Rollouts Augmentation (MIRA). Based on an environment dynamics model, distributed agents can independently perform multi-agent rollouts with opponent models during exploitation and learn to infer the environmental non-stationarity as a latent variable using the rollouts. Based on the world model and latent-variable inference module, we perform multi-agent soft actor-critic implementation for centralized training and decentralized decision making. Empirical results on the Multi-agent Particle Environment (MPE) have proved that the algorithm has a very considerable improvement in sample efficiency as well as better convergent rewards than state-of-the-art MARL methods, including COMA, MAAC, MADDPG, and VDN.
49

Kim, MyeongSeop, Jung-Su Kim, Myoung-Su Choi, and Jae-Han Park. "Adaptive Discount Factor for Deep Reinforcement Learning in Continuing Tasks with Uncertainty." Sensors 22, no. 19 (September 25, 2022): 7266. http://dx.doi.org/10.3390/s22197266.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Reinforcement learning (RL) trains an agent by maximizing the sum of a discounted reward. Since the discount factor has a critical effect on the learning performance of the RL agent, it is important to choose the discount factor properly. When uncertainties are involved in the training, the learning performance with a constant discount factor can be limited. For the purpose of obtaining acceptable learning performance consistently, this paper proposes an adaptive rule for the discount factor based on the advantage function. Additionally, how to use the advantage function in both on-policy and off-policy algorithms is presented. To demonstrate the performance of the proposed adaptive rule, it is applied to PPO (Proximal Policy Optimization) for Tetris in order to validate the on-policy case, and to SAC (Soft Actor-Critic) for the motion planning of a robot manipulator to validate the off-policy case. In both cases, the proposed method results in a better or similar performance compared with cases using the best constant discount factors found by exhaustive search. Hence, the proposed adaptive discount factor automatically finds a discount factor that leads to comparable training performance, and that can be applied to representative deep reinforcement learning problems.
50

Singh, Arambam James, Akshat Kumar, and Hoong Chuin Lau. "Learning and Exploiting Shaped Reward Models for Large Scale Multiagent RL." Proceedings of the International Conference on Automated Planning and Scheduling 31 (May 17, 2021): 588–96. http://dx.doi.org/10.1609/icaps.v31i1.16007.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Many real world systems involve interaction among large number of agents to achieve a common goal, for example, air traffic control. Several model-free RL algorithms have been proposed for such settings. A key limitation is that the empirical reward signal in model-free case is not very effective in addressing the multiagent credit assignment problem, which determines an agent's contribution to the team's success. This results in lower solution quality and high sample complexity. To address this, we contribute (a) an approach to learn a differentiable reward model for both continuous and discrete action setting by exploiting the collective nature of interactions among agents, a feature commonly present in large scale multiagent applications; (b) a shaped reward model analytically derived from the learned reward model to address the key challenge of credit assignment; (c) a model-based multiagent RL approach that integrates shaped rewards into well known RL algorithms such as policy gradient, soft-actor critic. Compared to previous methods, our learned reward models are more accurate, and our approaches achieve better solution quality on synthetic and real world instances of air traffic control, and cooperative navigation with large agent population.

До бібліографії