Journal articles: 'Policy gradient methods'

1

Peters, Jan. "Policy gradient methods." Scholarpedia 5, no. 11 (2010): 3698. http://dx.doi.org/10.4249/scholarpedia.3698.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Cai, Qingpeng, Ling Pan, and Pingzhong Tang. "Deterministic Value-Policy Gradients." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 3316–23. http://dx.doi.org/10.1609/aaai.v34i04.5732.

Full text

Abstract:

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

APA, Harvard, Vancouver, ISO, and other styles

3

Zhang, Matthew S., Murat A. Erdogdu, and Animesh Garg. "Convergence and Optimality of Policy Gradient Methods in Weakly Smooth Settings." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 9066–73. http://dx.doi.org/10.1609/aaai.v36i8.20891.

Full text

Abstract:

Policy gradient methods have been frequently applied to problems in control and reinforcement learning with great success, yet existing convergence analysis still relies on non-intuitive, impractical and often opaque conditions. In particular, existing rates are achieved in limited settings, under strict regularity conditions. In this work, we establish explicit convergence rates of policy gradient methods, extending the convergence regime to weakly smooth policy classes with L2 integrable gradient. We provide intuitive examples to illustrate the insight behind these new conditions. Notably, our analysis also shows that convergence rates are achievable for both the standard policy gradient and the natural policy gradient algorithms under these assumptions. Lastly we provide performance guarantees for the converged policies.

APA, Harvard, Vancouver, ISO, and other styles

4

Akella, Ravi Tej, Kamyar Azizzadenesheli, Mohammad Ghavamzadeh, Animashree Anandkumar, and Yisong Yue. "Deep Bayesian Quadrature Policy Optimization." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 6600–6608. http://dx.doi.org/10.1609/aaai.v35i8.16817.

Full text

Abstract:

We study the problem of obtaining accurate policy gradient estimates using a finite number of samples. Monte-Carlo methods have been the default choice for policy gradient estimation, despite suffering from high variance in the gradient estimates. On the other hand, more sample efficient alternatives like Bayesian quadrature methods have received little attention due to their high computational complexity. In this work, we propose deep Bayesian quadrature policy gradient (DBQPG), a computationally efficient high-dimensional generalization of Bayesian quadrature, for policy gradient estimation. We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks. In comparison to Monte-Carlo estimation, DBQPG provides (i) more accurate gradient estimates with a significantly lower variance, (ii) a consistent improvement in the sample complexity and average return for several deep policy gradient algorithms, and, (iii) the uncertainty in gradient estimation that can be incorporated to further improve the performance.

APA, Harvard, Vancouver, ISO, and other styles

5

Wang, Lin, Xingang Xu, Xuhui Zhao, Baozhu Li, Ruijuan Zheng, and Qingtao Wu. "A randomized block policy gradient algorithm with differential privacy in Content Centric Networks." International Journal of Distributed Sensor Networks 17, no. 12 (December 2021): 155014772110599. http://dx.doi.org/10.1177/15501477211059934.

Full text

Abstract:

Policy gradient methods are effective means to solve the problems of mobile multimedia data transmission in Content Centric Networks. Current policy gradient algorithms impose high computational cost in processing high-dimensional data. Meanwhile, the issue of privacy disclosure has not been taken into account. However, privacy protection is important in data training. Therefore, we propose a randomized block policy gradient algorithm with differential privacy. In order to reduce computational complexity when processing high-dimensional data, we randomly select a block coordinate to update the gradients at each round. To solve the privacy protection problem, we add a differential privacy protection mechanism to the algorithm, and we prove that it preserves the [Formula: see text]-privacy level. We conduct extensive simulations in four environments, which are CartPole, Walker, HalfCheetah, and Hopper. Compared with the methods such as important-sampling momentum-based policy gradient, Hessian-Aided momentum-based policy gradient, REINFORCE, the experimental results of our algorithm show a faster convergence rate than others in the same environment.

APA, Harvard, Vancouver, ISO, and other styles

6

Le, Hung, Majid Abdolshah, Thommen K. George, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Episodic Policy Gradient Training." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 7 (June 28, 2022): 7317–25. http://dx.doi.org/10.1609/aaai.v36i7.20694.

Full text

Abstract:

We introduce a novel training procedure for policy gradient methods wherein episodic memory is used to optimize the hyperparameters of reinforcement learning algorithms on-the-fly. Unlike other hyperparameter searches, we formulate hyperparameter scheduling as a standard Markov Decision Process and use episodic memory to store the outcome of used hyperparameters and their training contexts. At any policy update step, the policy learner refers to the stored experiences, and adaptively reconfigures its learning algorithm with the new hyperparameters determined by the memory. This mechanism, dubbed as Episodic Policy Gradient Training (EPGT), enables an episodic learning process, and jointly learns the policy and the learning algorithm's hyperparameters within a single run. Experimental results on both continuous and discrete environments demonstrate the advantage of using the proposed method in boosting the performance of various policy gradient algorithms.

APA, Harvard, Vancouver, ISO, and other styles

7

Cohen, Andrew, Xingye Qiao, Lei Yu, Elliot Way, and Xiangrong Tong. "Diverse Exploration via Conjugate Policies for Policy Gradient Methods." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 3404–11. http://dx.doi.org/10.1609/aaai.v33i01.33013404.

Full text

Abstract:

We address the challenge of effective exploration while maintaining good performance in policy gradient methods. As a solution, we propose diverse exploration (DE) via conjugate policies. DE learns and deploys a set of conjugate policies which can be conveniently generated as a byproduct of conjugate gradient descent. We provide both theoretical and empirical results showing the effectiveness of DE at achieving exploration, improving policy performance, and the advantage of DE over exploration by random policy perturbations.

APA, Harvard, Vancouver, ISO, and other styles

8

Zhang, Junzi, Jongho Kim, Brendan O'Donoghue, and Stephen Boyd. "Sample Efficient Reinforcement Learning with REINFORCE." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 10887–95. http://dx.doi.org/10.1609/aaai.v35i12.17300.

Full text

Abstract:

Policy gradient methods are among the most effective methods for large-scale reinforcement learning, and their empirical success has prompted several works that develop the foundation of their global convergence theory. However, prior works have either required exact gradients or state-action visitation measure based mini-batch stochastic gradients with a diverging batch size, which limit their applicability in practical scenarios. In this paper, we consider classical policy gradient methods that compute an approximate gradient with a single trajectory or a fixed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization, along with the widely-used REINFORCE gradient estimation procedure. By controlling the number of "bad" episodes and resorting to the classical doubling trick, we establish an anytime sub-linear high probability regret bound as well as almost sure global convergence of the average regret with an asymptotically sub-linear rate. These provide the first set of global convergence and sample efficiency results for the well-known REINFORCE algorithm and contribute to a better understanding of its performance in practice.

APA, Harvard, Vancouver, ISO, and other styles

9

Yu, Hai-Tao, Degen Huang, Fuji Ren, and Lishuang Li. "Diagnostic Evaluation of Policy-Gradient-Based Ranking." Electronics 11, no. 1 (December 23, 2021): 37. http://dx.doi.org/10.3390/electronics11010037.

Full text

Abstract:

Learning-to-rank has been intensively studied and has shown significantly increasing values in a wide range of domains, such as web search, recommender systems, dialogue systems, machine translation, and even computational biology, to name a few. In light of recent advances in neural networks, there has been a strong and continuing interest in exploring how to deploy popular techniques, such as reinforcement learning and adversarial learning, to solve ranking problems. However, armed with the aforesaid popular techniques, most studies tend to show how effective a new method is. A comprehensive comparison between techniques and an in-depth analysis of their deficiencies are somehow overlooked. This paper is motivated by the observation that recent ranking methods based on either reinforcement learning or adversarial learning boil down to policy-gradient-based optimization. Based on the widely used benchmark collections with complete information (where relevance labels are known for all items), such as MSLRWEB30K and Yahoo-Set1, we thoroughly investigate the extent to which policy-gradient-based ranking methods are effective. On one hand, we analytically identify the pitfalls of policy-gradient-based ranking. On the other hand, we experimentally compare a wide range of representative methods. The experimental results echo our analysis and show that policy-gradient-based ranking methods are, by a large margin, inferior to many conventional ranking methods. Regardless of whether we use reinforcement learning or adversarial learning, the failures are largely attributable to the gradient estimation based on sampled rankings, which significantly diverge from ideal rankings. In particular, the larger the number of documents per query and the more fine-grained the ground-truth labels, the greater the impact policy-gradient-based ranking suffers. Careful examination of this weakness is highly recommended for developing enhanced methods based on policy gradient.

APA, Harvard, Vancouver, ISO, and other styles

10

Baxter, J., and P. L. Bartlett. "Infinite-Horizon Policy-Gradient Estimation." Journal of Artificial Intelligence Research 15 (November 1, 2001): 319–50. http://dx.doi.org/10.1613/jair.806.

Full text

Abstract:

Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes POMDPs controlled by parameterized stochastic policies. A similar algorithm was proposed by (Kimura et al. 1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free beta (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter beta is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter et al., this volume) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward.

APA, Harvard, Vancouver, ISO, and other styles

11

Zhang, Kaiqing, Alec Koppel, Hao Zhu, and Tamer Başar. "Global Convergence of Policy Gradient Methods to (Almost) Locally Optimal Policies." SIAM Journal on Control and Optimization 58, no. 6 (January 2020): 3586–612. http://dx.doi.org/10.1137/19m1288012.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Zhang, Chuheng, Yuanqi Li, and Jian Li. "Policy Search by Target Distribution Learning for Continuous Control." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 6770–77. http://dx.doi.org/10.1609/aaai.v34i04.6156.

Full text

Abstract:

It is known that existing policy gradient methods (such as vanilla policy gradient, PPO, A2C) may suffer from overly large gradients when the current policy is close to deterministic, leading to an unstable training process. We show that such instability can happen even in a very simple environment. To address this issue, we propose a new method, called target distribution learning (TDL), for policy improvement in reinforcement learning. TDL alternates between proposing a target distribution and training the policy network to approach the target distribution. TDL is more effective in constraining the KL divergence between updated policies, and hence leads to more stable policy improvements over iterations. Our experiments show that TDL algorithms perform comparably to (or better than) state-of-the-art algorithms for most continuous control tasks in the MuJoCo environment while being more stable in training.

APA, Harvard, Vancouver, ISO, and other styles

13

Yang, Long, Yu Zhang, Gang Zheng, Qian Zheng, Pengfei Li, Jianhang Huang, and Gang Pan. "Policy Optimization with Stochastic Mirror Descent." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8823–31. http://dx.doi.org/10.1609/aaai.v36i8.20863.

Full text

Abstract:

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(ε−3) sample trajectories to achieve an ε-approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMP outperforms the state-of-the-art policy gradient methods in various settings.

APA, Harvard, Vancouver, ISO, and other styles

14

Ying, Donghao, Mengzi Amy Guo, Yuhao Ding, Javad Lavaei, and Zuo-Jun Shen. "Policy-Based Primal-Dual Methods for Convex Constrained Markov Decision Processes." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 9 (June 26, 2023): 10963–71. http://dx.doi.org/10.1609/aaai.v37i9.26299.

Full text

Abstract:

We study convex Constrained Markov Decision Processes (CMDPs) in which the objective is concave and the constraints are convex in the state-action occupancy measure. We propose a policy-based primal-dual algorithm that updates the primal variable via policy gradient ascent and updates the dual variable via projected sub-gradient descent. Despite the loss of additivity structure and the nonconvex nature, we establish the global convergence of the proposed algorithm by leveraging a hidden convexity in the problem, and prove the O(T^-1/3) convergence rate in terms of both optimality gap and constraint violation. When the objective is strongly concave in the occupancy measure, we prove an improved convergence rate of O(T^-1/2). By introducing a pessimistic term to the constraint, we further show that a zero constraint violation can be achieved while preserving the same convergence rate for the optimality gap. This work is the first one in the literature that establishes non-asymptotic convergence guarantees for policy-based primal-dual methods for solving infinite-horizon discounted convex CMDPs.

APA, Harvard, Vancouver, ISO, and other styles

15

Jiang, Zhanhong, Xian Yeow Lee, Sin Yong Tan, Kai Liang Tan, Aditya Balu, Young M. Lee, Chinmay Hegde, and Soumik Sarkar. "MDPGT: Momentum-Based Decentralized Policy Gradient Tracking." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 9 (June 28, 2022): 9377–85. http://dx.doi.org/10.1609/aaai.v36i9.21169.

Full text

Abstract:

We propose a novel policy gradient method for multi-agent reinforcement learning, which leverages two different variance-reduction techniques and does not require large batches over iterations. Specifically, we propose a momentum-based decentralized policy gradient tracking (MDPGT) where a new momentum-based variance reduction technique is used to approximate the local policy gradient surrogate with importance sampling, and an intermediate parameter is adopted to track two consecutive policy gradient surrogates. MDPGT provably achieves the best available sample complexity of O(N -1 e -3) for converging to an e-stationary point of the global average of N local performance functions (possibly nonconcave). This outperforms the state-of-the-art sample complexity in decentralized model-free reinforcement learning and when initialized with a single trajectory, the sample complexity matches those obtained by the existing decentralized policy gradient methods. We further validate the theoretical claim for the Gaussian policy function. When the required error tolerance e is small enough, MDPGT leads to a linear speed up, which has been previously established in decentralized stochastic optimization, but not for reinforcement learning. Lastly, we provide empirical results on a multi-agent reinforcement learning benchmark environment to support our theoretical findings.

APA, Harvard, Vancouver, ISO, and other styles

16

Melo, Francisco. "Differential Eligibility Vectors for Advantage Updating and Gradient Methods." Proceedings of the AAAI Conference on Artificial Intelligence 25, no. 1 (August 4, 2011): 441–46. http://dx.doi.org/10.1609/aaai.v25i1.7938.

Full text

Abstract:

In this paper we propose differential eligibility vectors (DEV) for temporal-difference (TD) learning, a new class of eligibility vectors designed to bring out the contribution of each action in the TD-error at each state. Specifically, we use DEV in TD-Q(lambda) to more accurately learn the relative value of the actions, rather than their absolute value. We identify conditions that ensure convergence w.p.1 of TD-Q(lambda) with DEV and show that this algorithm can also be used to directly approximate the advantage function associated with a given policy, without the need to compute an auxiliary function - something that, to the extent of our knowledge, was not known possible. Finally, we discuss the integration of DEV in LSTDQ and actor-critic algorithms.

APA, Harvard, Vancouver, ISO, and other styles

17

Herrera-Martí, David A. "Policy Gradient Approach to Compilation of Variational Quantum Circuits." Quantum 6 (September 8, 2022): 797. http://dx.doi.org/10.22331/q-2022-09-08-797.

Full text

Abstract:

We propose a method for finding approximate compilations of quantum unitary transformations, based on techniques from policy gradient reinforcement learning. The choice of a stochastic policy allows us to rephrase the optimization problem in terms of probability distributions, rather than variational gates. In this framework, the optimal configuration is found by optimizing over distribution parameters, rather than over free angles. We show numerically that this approach can be more competitive than gradient-free methods, for a comparable amount of resources, both for noiseless and noisy circuits. Another interesting feature of this approach to variational compilation is that it does not need a separate register and long-range interactions to estimate the end-point fidelity, which is an improvement over methods which rely on the Hilbert-Schmidt test. We expect these techniques to be relevant for training variational circuits in other contexts.

APA, Harvard, Vancouver, ISO, and other styles

18

Hambly, Ben, Renyuan Xu, and Huining Yang. "Policy Gradient Methods for the Noisy Linear Quadratic Regulator over a Finite Horizon." SIAM Journal on Control and Optimization 59, no. 5 (January 2021): 3359–91. http://dx.doi.org/10.1137/20m1382386.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Zhao, Feiran, Xingyun Fu, and Keyou You. "Globally Convergent Policy Gradient Methods for Linear Quadratic Control of Partially Observed Systems." IFAC-PapersOnLine 56, no. 2 (2023): 5506–11. http://dx.doi.org/10.1016/j.ifacol.2023.10.208.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Chen, Yan, and Tao Li. "Convergence of Policy Gradient Methods for Nash Equilibria in General-sum Stochastic Games." IFAC-PapersOnLine 56, no. 2 (2023): 3435–40. http://dx.doi.org/10.1016/j.ifacol.2023.10.1494.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Giegrich, Michael, Christoph Reisinger, and Yufei Zhang. "Convergence of Policy Gradient Methods for Finite-Horizon Exploratory Linear-Quadratic Control Problems." SIAM Journal on Control and Optimization 62, no. 2 (March 22, 2024): 1060–92. http://dx.doi.org/10.1137/22m1533517.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Ecoffet, Paul, Nicolas Fontbonne, Jean-Baptiste André, and Nicolas Bredeche. "Policy search with rare significant events: Choosing the right partner to cooperate with." PLOS ONE 17, no. 4 (April 26, 2022): e0266841. http://dx.doi.org/10.1371/journal.pone.0266841.

Full text

Abstract:

This paper focuses on a class of reinforcement learning problems where significant events are rare and limited to a single positive reward per episode. A typical example is that of an agent who has to choose a partner to cooperate with, while a large number of partners are simply not interested in cooperating, regardless of what the agent has to offer. We address this problem in a continuous state and action space with two different kinds of search methods: a gradient policy search method and a direct policy search method using an evolution strategy. We show that when significant events are rare, gradient information is also scarce, making it difficult for policy gradient search methods to find an optimal policy, with or without a deep neural architecture. On the other hand, we show that direct policy search methods are invariant to the rarity of significant events, which is yet another confirmation of the unique role evolutionary algorithms has to play as a reinforcement learning method.

APA, Harvard, Vancouver, ISO, and other styles

23

Ecoffet, Paul, Nicolas Fontbonne, Jean-Baptiste André, and Nicolas Bredeche. "Policy search with rare significant events: Choosing the right partner to cooperate with." PLOS ONE 17, no. 4 (April 26, 2022): e0266841. http://dx.doi.org/10.1371/journal.pone.0266841.

Full text

Abstract:

This paper focuses on a class of reinforcement learning problems where significant events are rare and limited to a single positive reward per episode. A typical example is that of an agent who has to choose a partner to cooperate with, while a large number of partners are simply not interested in cooperating, regardless of what the agent has to offer. We address this problem in a continuous state and action space with two different kinds of search methods: a gradient policy search method and a direct policy search method using an evolution strategy. We show that when significant events are rare, gradient information is also scarce, making it difficult for policy gradient search methods to find an optimal policy, with or without a deep neural architecture. On the other hand, we show that direct policy search methods are invariant to the rarity of significant events, which is yet another confirmation of the unique role evolutionary algorithms has to play as a reinforcement learning method.

APA, Harvard, Vancouver, ISO, and other styles

24

Ecoffet, Paul, Nicolas Fontbonne, Jean-Baptiste André, and Nicolas Bredeche. "Policy search with rare significant events: Choosing the right partner to cooperate with." PLOS ONE 17, no. 4 (April 26, 2022): e0266841. http://dx.doi.org/10.1371/journal.pone.0266841.

Full text

Abstract:

This paper focuses on a class of reinforcement learning problems where significant events are rare and limited to a single positive reward per episode. A typical example is that of an agent who has to choose a partner to cooperate with, while a large number of partners are simply not interested in cooperating, regardless of what the agent has to offer. We address this problem in a continuous state and action space with two different kinds of search methods: a gradient policy search method and a direct policy search method using an evolution strategy. We show that when significant events are rare, gradient information is also scarce, making it difficult for policy gradient search methods to find an optimal policy, with or without a deep neural architecture. On the other hand, we show that direct policy search methods are invariant to the rarity of significant events, which is yet another confirmation of the unique role evolutionary algorithms has to play as a reinforcement learning method.

APA, Harvard, Vancouver, ISO, and other styles

25

Li, Shilei, Meng Li, Jiongming Su, Shaofei Chen, Zhimin Yuan, and Qing Ye. "PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning." ACM Transactions on Intelligent Systems and Technology 12, no. 3 (May 16, 2021): 1–21. http://dx.doi.org/10.1145/3452008.

Full text

Abstract:

Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.

APA, Harvard, Vancouver, ISO, and other styles

26

Chen, Haokun, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, and Yong Yu. "Large-Scale Interactive Recommendation with Tree-Structured Policy Gradient." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 3312–20. http://dx.doi.org/10.1609/aaai.v33i01.33013312.

Full text

Abstract:

Reinforcement learning (RL) has recently been introduced to interactive recommender systems (IRS) because of its nature of learning from dynamic interactions and planning for long-run performance. As IRS is always with thousands of items to recommend (i.e., thousands of actions), most existing RL-based methods, however, fail to handle such a large discrete action space problem and thus become inefficient. The existing work that tries to deal with the large discrete action space problem by utilizing the deep deterministic policy gradient framework suffers from the inconsistency between the continuous action representation (the output of the actor network) and the real discrete action. To avoid such inconsistency and achieve high efficiency and recommendation effectiveness, in this paper, we propose a Tree-structured Policy Gradient Recommendation (TPGR) framework, where a balanced hierarchical clustering tree is built over the items and picking an item is formulated as seeking a path from the root to a certain leaf of the tree. Extensive experiments on carefully-designed environments based on two real-world datasets demonstrate that our model provides superior recommendation performance and significant efficiency improvement over state-of-the-art methods.

APA, Harvard, Vancouver, ISO, and other styles

27

Li, Chengzhengxu, Xiaoming Liu, Yichen Wang, Duyi Li, Yu Lan, and Chao Shen. "Dialogue for Prompting: A Policy-Gradient-Based Discrete Prompt Generation for Few-Shot Learning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (March 24, 2024): 18481–89. http://dx.doi.org/10.1609/aaai.v38i16.29809.

Full text

Abstract:

Prompt-based pre-trained language models (PLMs) paradigm has succeeded substantially in few-shot natural language processing (NLP) tasks. However, prior discrete prompt optimization methods require expert knowledge to design the base prompt set and identify high-quality prompts, which is costly, inefficient, and subjective. Meanwhile, existing continuous prompt optimization methods improve the performance by learning the ideal prompts through the gradient information of PLMs, whose high computational cost, and low readability and generalizability are often concerning. To address the research gap, we propose a Dialogue-comprised Policy-gradient-based Discrete Prompt Optimization (DP_2O) method. We first design a multi-round dialogue alignment strategy for readability prompt set generation based on GPT-4. Furthermore, we propose an efficient prompt screening metric to identify high-quality prompts with linear complexity. Finally, we construct a reinforcement learning (RL) framework based on policy gradients to match the prompts to inputs optimally. By training a policy network with only 0.62M parameters on the tasks in the few-shot setting, DP_2O outperforms the state-of-the-art (SOTA) method by 1.52% in accuracy on average on four open-source datasets. Moreover, subsequent experiments also demonstrate that DP_2O has good universality, robustness and generalization ability.

APA, Harvard, Vancouver, ISO, and other styles

28

Chung, Hoon, Sung Joo Lee, Hyeong Bae Jeon, and Jeon Gue Park. "Semi-Supervised Speech Recognition Acoustic Model Training Using Policy Gradient." Applied Sciences 10, no. 10 (May 20, 2020): 3542. http://dx.doi.org/10.3390/app10103542.

Full text

Abstract:

In this paper, we propose a policy gradient-based semi-supervised speech recognition acoustic model training. In practice, self-training and teacher/student learning are one of the widely used semi-supervised training methods due to their scalability and effectiveness. These methods are based on generating pseudo labels for unlabeled samples using a pre-trained model and selecting reliable samples using confidence measure. However, there are some considerations in this approach. The generated pseudo labels can be biased depending on which pre-trained model is used, and the training process can be complicated because the confidence measure is usually carried out in post-processing using external knowledge. Therefore, to address these issues, we propose a policy gradient method-based approach. Policy gradient is a reinforcement learning algorithm to find an optimal behavior strategy for an agent to obtain optimal rewards. The policy gradient-based approach provides a framework for exploring unlabeled data as well as exploiting labeled data, and it also provides a way to incorporate external knowledge in the same training cycle. The proposed approach was evaluated on an in-house non-native Korean recognition domain. The experimental results show that the method is effective in semi-supervised acoustic model training.

APA, Harvard, Vancouver, ISO, and other styles

29

Hu, Bin, Kaiqing Zhang, Na Li, Mehran Mesbahi, Maryam Fazel, and Tamer Başar. "Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies." Annual Review of Control, Robotics, and Autonomous Systems 6, no. 1 (May 3, 2023): 123–58. http://dx.doi.org/10.1146/annurev-control-042920-020021.

Full text

Abstract:

Gradient-based methods have been widely used for system design and optimization in diverse application domains. Recently, there has been a renewed interest in studying theoretical properties of these methods in the context of control and reinforcement learning. This article surveys some of the recent developments on policy optimization, a gradient-based iterative approach for feedback control synthesis that has been popularized by successes of reinforcement learning. We take an interdisciplinary perspective in our exposition that connects control theory, reinforcement learning, and large-scale optimization. We review a number of recently developed theoretical results on the optimization landscape, global convergence, and sample complexityof gradient-based methods for various continuous control problems, such as the linear quadratic regulator (LQR), [Formula: see text] control, risk-sensitive control, linear quadratic Gaussian (LQG) control, and output feedback synthesis. In conjunction with these optimization results, we also discuss how direct policy optimization handles stability and robustness concerns in learning-based control, two main desiderata in control engineering. We conclude the survey by pointing out several challenges and opportunities at the intersection of learning and control.

APA, Harvard, Vancouver, ISO, and other styles

30

Guo, Xin, Anran Hu, and Junzi Zhang. "Theoretical Guarantees of Fictitious Discount Algorithms for Episodic Reinforcement Learning and Global Convergence of Policy Gradient Methods." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 6 (June 28, 2022): 6774–82. http://dx.doi.org/10.1609/aaai.v36i6.20633.

Full text

Abstract:

When designing algorithms for finite-time-horizon episodic reinforcement learning problems, a common approach is to introduce a fictitious discount factor and use stationary policies for approximations. Empirically, it has been shown that the fictitious discount factor helps reduce variance, and stationary policies serve to save the per-iteration computational cost. Theoretically, however, there is no existing work on convergence analysis for algorithms with this fictitious discount recipe. This paper takes the first step towards analyzing these algorithms. It focuses on two vanilla policy gradient (VPG) variants: the first being a widely used variant with discounted advantage estimations (DAE), the second with an additional fictitious discount factor in the score functions of the policy gradient estimators. Non-asymptotic convergence guarantees are established for both algorithms, and the additional discount factor is shown to reduce the bias introduced in DAE and thus improve the algorithm convergence asymptotically. A key ingredient of our analysis is to connect three settings of Markov decision processes (MDPs): the finite-time-horizon, the average reward and the discounted settings. To our best knowledge, this is the first theoretical guarantee on fictitious discount algorithms for the episodic reinforcement learning of finite-time-horizon MDPs, which also leads to the (first) global convergence of policy gradient methods for finite-time-horizon episodic reinforcement learning.

APA, Harvard, Vancouver, ISO, and other styles

31

Lou, Xingzhou, Junge Zhang, Timothy J. Norman, Kaiqi Huang, and Yali Du. "TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (March 24, 2024): 17496–504. http://dx.doi.org/10.1609/aaai.v38i16.29699.

Full text

Abstract:

Multi-Agent Policy Gradient (MAPG) has made significant progress in recent years. However, centralized critics in state-of-the-art MAPG methods still face the centralized-decentralized mismatch (CDM) issue, which means sub-optimal actions by some agents will affect other agent's policy learning. While using individual critics for policy updates can avoid this issue, they severely limit cooperation among agents. To address this issue, we propose an agent topology framework, which decides whether other agents should be considered in policy gradient and achieves compromise between facilitating cooperation and alleviating the CDM issue. The agent topology allows agents to use coalition utility as learning objective instead of global utility by centralized critics or local utility by individual critics. To constitute the agent topology, various models are studied. We propose Topology-based multi-Agent Policy gradiEnt (TAPE) for both stochastic and deterministic MAPG methods. We prove the policy improvement theorem for stochastic TAPE and give a theoretical explanation for the improved cooperation among agents. Experiment results on several benchmarks show the agent topology is able to facilitate agent cooperation and alleviate CDM issue respectively to improve performance of TAPE. Finally, multiple ablation studies and a heuristic graph search algorithm are devised to show the efficacy of the agent topology.

APA, Harvard, Vancouver, ISO, and other styles

32

Zeng, Fanyu, and Chen Wang. "Visual Navigation with Asynchronous Proximal Policy Optimization in Artificial Agents." Journal of Robotics 2020 (October 14, 2020): 1–7. http://dx.doi.org/10.1155/2020/8702962.

Full text

Abstract:

Vanilla policy gradient methods suffer from high variance, leading to unstable policies during training, where the policy’s performance fluctuates drastically between iterations. To address this issue, we analyze the policy optimization process of the navigation method based on deep reinforcement learning (DRL) that uses asynchronous gradient descent for optimization. A variant navigation (asynchronous proximal policy optimization navigation, appoNav) is presented that can guarantee the policy monotonic improvement during the process of policy optimization. Our experiments are tested in DeepMind Lab, and the experimental results show that the artificial agents with appoNav perform better than the compared algorithm.

APA, Harvard, Vancouver, ISO, and other styles

33

Doya, Kenji. "Reinforcement Learning in Continuous Time and Space." Neural Computation 12, no. 1 (January 1, 2000): 219–45. http://dx.doi.org/10.1162/089976600300015961.

Full text

Abstract:

This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Basedonthe Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD (0), and TD (λ) algorithms are shown. For policy improvement, two methods—a continuous actor-critic method and a value-gradient-based greedy policy—are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework. The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.

APA, Harvard, Vancouver, ISO, and other styles

34

Morimura, Tetsuro, Eiji Uchibe, Junichiro Yoshimoto, Jan Peters, and Kenji Doya. "Derivatives of Logarithmic Stationary Distributions for Policy Gradient Reinforcement Learning." Neural Computation 22, no. 2 (February 2010): 342–76. http://dx.doi.org/10.1162/neco.2009.12-08-922.

Full text

Abstract:

Most conventional policy gradient reinforcement learning (PGRL) algorithms neglect (or do not explicitly make use of) a term in the average reward gradient with respect to the policy parameter. That term involves the derivative of the stationary state distribution that corresponds to the sensitivity of its distribution to changes in the policy parameter. Although the bias introduced by this omission can be reduced by setting the forgetting rate γ for the value functions close to 1, these algorithms do not permit γ to be set exactly at γ = 1. In this article, we propose a method for estimating the log stationary state distribution derivative (LSD) as a useful form of the derivative of the stationary state distribution through backward Markov chain formulation and a temporal difference learning framework. A new policy gradient (PG) framework with an LSD is also proposed, in which the average reward gradient can be estimated by setting γ = 0, so it becomes unnecessary to learn the value functions. We also test the performance of the proposed algorithms using simple benchmark tasks and show that these can improve the performances of existing PG methods.

APA, Harvard, Vancouver, ISO, and other styles

35

Zhou, Zixian, Mengda Huang, Feiyang Pan, Jia He, Xiang Ao, Dandan Tu, and Qing He. "Gradient-Adaptive Pareto Optimization for Constrained Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 9 (June 26, 2023): 11443–51. http://dx.doi.org/10.1609/aaai.v37i9.26353.

Full text

Abstract:

Constrained Reinforcement Learning (CRL) burgeons broad interest in recent years, which pursues maximizing long-term returns while constraining costs. Although CRL can be cast as a multi-objective optimization problem, it is still facing the key challenge that gradient-based Pareto optimization methods tend to stick to known Pareto-optimal solutions even when they yield poor returns (e.g., the safest self-driving car that never moves) or violate the constraints (e.g., the record-breaking racer that crashes the car). In this paper, we propose Gradient-adaptive Constrained Policy Optimization (GCPO for short), a novel Pareto optimization method for CRL with two adaptive gradient recalibration techniques. First, to find Pareto-optimal solutions with balanced performance over all targets, we propose gradient rebalancing which forces the agent to improve more on under-optimized objectives at every policy iteration. Second, to guarantee that the cost constraints are satisfied, we propose gradient perturbation that can temporarily sacrifice the returns for costs. Experiments on the SafetyGym benchmarks show that our method consistently outperforms previous CRL methods in reward while satisfying the constraints.

APA, Harvard, Vancouver, ISO, and other styles

36

Kong, Rui, Chenyang Wu, and Zongzhang Zhang. "Generalizable Policy Improvement via Reinforcement Sampling (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 21 (March 24, 2024): 23546–47. http://dx.doi.org/10.1609/aaai.v38i21.30466.

Full text

Abstract:

Current policy gradient techniques excel in refining policies over sampled states but falter when generalizing to unseen states. To address this, we introduce Reinforcement Sampling (RS), a novel method leveraging a generalizable action value function to sample improved decisions. RS is able to improve the decision quality whenever the action value estimation is accurate. It works by improving the agent's decision on the fly on the states the agent is visiting. Compared with the historically experienced states in which conventional policy gradient methods improve the policy, the currently visited states are more relevant to the agent. Our method sufficiently exploits the generalizability of the value function on unseen states and sheds new light on the future development of generalizable reinforcement learning.

APA, Harvard, Vancouver, ISO, and other styles

37

Chen, Tianjian, Zhanpeng He, and Matei Ciocarlie. "Co-designing hardware and control for robot hands." Science Robotics 6, no. 54 (May 12, 2021): eabg2133. http://dx.doi.org/10.1126/scirobotics.abg2133.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Vasilaki, Eleni, Nicolas Frémaux, Robert Urbanczik, Walter Senn, and Wulfram Gerstner. "Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail." PLoS Computational Biology 5, no. 12 (December 4, 2009): e1000586. http://dx.doi.org/10.1371/journal.pcbi.1000586.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Lincoln, Richard, Stuart Galloway, Bruce Stephen, and Graeme Burt. "Comparing Policy Gradient and Value Function Based Reinforcement Learning Methods in Simulated Electrical Power Trade." IEEE Transactions on Power Systems 27, no. 1 (February 2012): 373–80. http://dx.doi.org/10.1109/tpwrs.2011.2166091.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Zhang, Haifei, Jian Xu, and Jianlin Qiu. "An Automatic Driving Control Method Based on Deep Deterministic Policy Gradient." Wireless Communications and Mobile Computing 2022 (January 24, 2022): 1–9. http://dx.doi.org/10.1155/2022/7739440.

Full text

Abstract:

The traditional automatic driving behavior decision algorithm needs to manually set complex rules, resulting in long vehicle decision-making time, poor decision-making effect, and no adaptability to the new environment. As one of the main methods in the field of machine learning and intelligent control in recent years, reinforcement learning can learn reasonable and effective policies only by interacting with the environment. Firstly, this paper introduces the current research status of automatic driving technology and the current mainstream automatic driving control methods. Then, it analyzes the characteristics of convolutional neural network, reinforcement learning method ( Q -learning), and deep Q network (DQN) and deep deterministic policy gradient (DDPG). Compared with the DQN algorithm based on value function, the DDPG algorithm based on action policy can well solve the continuity problem of action space. Finally, the DDPG algorithm is used to solve the control problem of automatic driving. By designing a reasonable reward function, deep convolution network, and exploration policy, the intelligent vehicle can avoid obstacles and, finally, achieve the purpose of avoiding obstacles and running the whole process in a 2D environment.

APA, Harvard, Vancouver, ISO, and other styles

41

Yang, Long, Qian Zheng, and Gang Pan. "Sample Complexity of Policy Gradient Finding Second-Order Stationary Points." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 10630–38. http://dx.doi.org/10.1609/aaai.v35i12.17271.

Full text

Abstract:

The policy-based reinforcement learning (RL) can be considered as maximization of its objective. However, due to the inherent non-concavity of its objective, the policy gradient method to a first-order stationary point (FOSP) cannot guar- antee a maximal point. A FOSP can be a minimal or even a saddle point, which is undesirable for RL. It has be found that if all the saddle points are strict, all the second-order station- ary points (SOSP) are exactly equivalent to local maxima. Instead of FOSP, we consider SOSP as the convergence criteria to characterize the sample complexity of policy gradient. Our result shows that policy gradient converges to an (ε, √εχ)-SOSP with probability at least 1 − O(δ) after the total cost of O(ε−9/2)sinificantly improves the state of the art cost O(ε−9).Our analysis is based on the key idea that decomposes the parameter space Rp into three non-intersected regions: non-stationary point region, saddle point region, and local optimal region, then making a local improvement of the objective of RL in each region. This technique can be potentially generalized to extensive policy gradient methods. For the complete proof, please refer to https://arxiv.org/pdf/2012.01491.pdf.

APA, Harvard, Vancouver, ISO, and other styles

42

Wu, Xiaoxia, Yuege Xie, Simon Shaolei Du, and Rachel Ward. "AdaLoss: A Computationally-Efficient and Provably Convergent Adaptive Gradient Method." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8691–99. http://dx.doi.org/10.1609/aaai.v36i8.20848.

Full text

Abstract:

We propose a computationally-friendly adaptive learning rate schedule, ``AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods. We prove that this schedule enjoys linear convergence in linear regression. Moreover, we extend the to the non-convex regime, in the context of two-layer over-parameterized neural networks. If the width is sufficiently large (polynomially), then AdaLoss converges robustly to the global minimum in polynomial time. We numerically verify the theoretical results and extend the scope of the numerical experiments by considering applications in LSTM models for text clarification and policy gradients for control problems.

APA, Harvard, Vancouver, ISO, and other styles

43

Sanghvi, Navyata, Shinnosuke Usami, Mohit Sharma, Joachim Groeger, and Kris Kitani. "Inverse Reinforcement Learning with Explicit Policy Estimates." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 11 (May 18, 2021): 9472–80. http://dx.doi.org/10.1609/aaai.v35i11.17141.

Full text

Abstract:

Various methods for solving the inverse reinforcement learning (IRL) problem have been developed independently in machine learning and economics. In particular, the method of Maximum Causal Entropy IRL is based on the perspective of entropy maximization, while related advances in the field of economics instead assume the existence of unobserved action shocks to explain expert behavior (Nested Fixed Point Algorithm, Conditional Choice Probability method, Nested Pseudo-Likelihood Algorithm). In this work, we make previously unknown connections between these related methods from both fields. We achieve this by showing that they all belong to a class of optimization problems, characterized by a common form of the objective, the associated policy and the objective gradient. We demonstrate key computational and algorithmic differences which arise between the methods due to an approximation of the optimal soft value function, and describe how this leads to more efficient algorithms. Using insights which emerge from our study of this class of optimization problems, we identify various problem scenarios and investigate each method's suitability for these problems.

APA, Harvard, Vancouver, ISO, and other styles

44

Farsang, Mónika, and Luca Szegletes. "Controlling Agents by Constrained Policy Updates." SYSTEM THEORY, CONTROL AND COMPUTING JOURNAL 1, no. 2 (December 31, 2021): 33–39. http://dx.doi.org/10.52846/stccj.2021.1.2.24.

Full text

Abstract:

Learning the optimal behavior is the ultimate goal in reinforcement learning. This can be achieved by many different approaches, the most successful of them are policy gradient methods. However, they can suffer from undesirably large updates of policies, leading to poor performance. In recent years there has been a clear trend toward designing more reliable algorithms. This paper addresses to examine different restriction strategies applied to the widely used Proximal Policy Optimization (PPO-Clip) technique. We also question whether the analyzed methods are able to adapt not only to low-dimensional tasks but also to complex, high-dimensional problems in control and robotic domains. The analysis of the learned behavior shows that these methods can lead to better performance compared to the original PPO-Clip algorithm, moreover, they are also able to achieve complex behavior and policies in high-dimensional environments.

APA, Harvard, Vancouver, ISO, and other styles

45

Mutti, Mirco, Lorenzo Pratissoli, and Marcello Restelli. "Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 10 (May 18, 2021): 9028–36. http://dx.doi.org/10.1609/aaai.v35i10.17091.

Full text

Abstract:

In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy? In this paper, we argue that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target. Especially, we present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of the state distribution entropy. In contrast to known methods, MEPOL is completely model-free as it requires neither to estimate the state distribution of any policy nor to model transition dynamics. Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning meaningful reward-based tasks downstream.

APA, Harvard, Vancouver, ISO, and other styles

46

Fosse, E., M. K. Helgesen, S. Hagen, and S. Torp. "Addressing the social determinants of health at the local level: Opportunities and challenges." Scandinavian Journal of Public Health 46, no. 20_suppl (February 2018): 47–52. http://dx.doi.org/10.1177/1403494817743896.

Full text

Abstract:

Aims: The gradient in health inequalities reflects a relationship between health and social circumstance, demonstrating that health worsens as you move down the socio-economic scale. For more than a decade, the Norwegian National government has developed policies to reduce social inequalities in health by levelling the social gradient. The adoption of the Public Health Act in 2012 was a further movement towards a comprehensive policy. The main aim of the act is to reduce social health inequalities by adopting a Health in All Policies approach. The municipalities are regarded key in the implementation of the act. The SODEMIFA project aimed to study the development of the new public health policy, with a particular emphasis on its implementation in municipalities. Methods: In the SODEMIFA project, a mixed-methods approach was applied, and the data consisted of surveys as well as qualitative interviews. The informants were policymakers at the national and local level. Results: Our findings indicate that the municipalities had a rather vague understanding of the concept of health inequalities, and even more so, the concept of the social gradient in health. The most common understanding was that policy to reduce social inequalities concerned disadvantaged groups. Accordingly, policies and measures would be directed at these groups, rather than addressing the social gradient. Conclusions: A movement towards an increased understanding and adoption of the new, comprehensive public health policy was observed. However, to continue this process, both local and national levels must stay committed to the principles of the act.

APA, Harvard, Vancouver, ISO, and other styles

47

Wu, Runjia, Fangqing Gu, Hai-lin Liu, and Hongjian Shi. "UAV Path Planning Based on Multicritic-Delayed Deep Deterministic Policy Gradient." Wireless Communications and Mobile Computing 2022 (March 14, 2022): 1–12. http://dx.doi.org/10.1155/2022/9017079.

Full text

Abstract:

Deep deterministic policy gradient (DDPG) algorithm is a reinforcement learning method, which has been widely used in UAV path planning. However, the critic network of DDPG is frequently updated in the training process. It leads to an inevitable overestimation problem and increases the training computational complexity. Therefore, this paper presents a multicritic-delayed DDPG method for solving the UAV path planning. It uses multicritic networks and delayed learning methods to reduce the overestimation problem of DDPG and adds noise to improve the robustness in the real environment. Moreover, a UAV mission platform is built to train and evaluate the effectiveness and robustness of the proposed method. Simulation results show that the proposed algorithm has a higher convergence speed, a better convergence effect, and stability. It indicates that UAV can learn more knowledge from the complex environment.

APA, Harvard, Vancouver, ISO, and other styles

48

Zhou, Conghang, Jianxing Li, Yujing Shi, and Zhirui Lin. "Research on Multi-Robot Formation Control Based on MATD3 Algorithm." Applied Sciences 13, no. 3 (January 31, 2023): 1874. http://dx.doi.org/10.3390/app13031874.

Full text

Abstract:

This paper investigates the problem of multi-robot formation control strategies in environments with obstacles based on deep reinforcement learning methods. To solve the problem of value function overestimation in the deep deterministic policy gradient (DDPG) algorithm, this paper proposes an improved multi-agent twin delayed deep deterministic policy gradient (MATD3) algorithm under the CTDE framework combined with the twin delayed deep deterministic policy gradient (TD3) algorithm, which adopts a prioritized experience replay strategy to improve the learning efficiency. For the problem of difficult obstacle avoidance for a robot formation, a hybrid reward mechanism is designed to use different formation maintenance strategies in obstacle areas and obstacle-free areas to achieve the control goal of obstacle avoidance by reasonably changing the formation. The simulation experiments verified the effectiveness of the multi-robot formation control strategy designed in this paper, and comparative simulations verified that the algorithm has a faster convergence speed and more stable performance.

APA, Harvard, Vancouver, ISO, and other styles

49

Gao, Tianhan, Shen Gao, Jun Xu, and Qihui Zhao. "DDRCN: Deep Deterministic Policy Gradient Recommendation Framework Fused with Deep Cross Networks." Applied Sciences 13, no. 4 (February 16, 2023): 2555. http://dx.doi.org/10.3390/app13042555.

Full text

Abstract:

As an essential branch of artificial intelligence, recommendation systems have gradually penetrated people’s daily lives. It is the active recommendation of goods or services of potential interest to users based on their preferences. Many recommendation methods have been proposed in both industry and academia. However, there are some limitations of previous recommendation methods: (1) Most of them do not consider the cross-correlation between data. (2) Many treat the recommendation process as a one-time act and do not consider the continuity of the recommendation system. To overcome these limitations, we propose a recommendation framework based on deep reinforcement learning techniques, known as DDRCN: a deep deterministic policy gradient recommendation framework incorporating deep cross networks. We use a Deep network and a Cross network to fit the cross relationships between the data, to obtain a representation of the user interaction data. The Actor-Critic network is designed to simulate the continuous interaction behavior of users through a greedy strategy. A deep deterministic policy gradient network is also used to train the recommendation model. Finally, we conduct experiments with two publicly available datasets and find that our proposed recommendation framework outperforms the baseline approach in the recall and ranking phases of recommendations.

APA, Harvard, Vancouver, ISO, and other styles

50

Long, Yun, Youfei Lu, Hongwei Zhao, Renbo Wu, Tao Bao, and Jun Liu. "Multilayer Deep Deterministic Policy Gradient for Static Safety and Stability Analysis of Novel Power Systems." International Transactions on Electrical Energy Systems 2023 (April 21, 2023): 1–14. http://dx.doi.org/10.1155/2023/4295384.

Full text

Abstract:

More and more renewable energy sources are integrated into novel power systems. The randomness and fluctuation of such renewable energy sources bring challenges to the static stability and safety analysis of novel power systems. In this work, a multilayer deep deterministic policy gradient is proposed to address the fluctuation of renewable energy sources. The proposed method is stacked with multilayer deep reinforcement learning methods that can be continuously updated online. The proposed multilayer deep deterministic policy gradient is compared with other deep learning algorithms. The feasibility, effectiveness, and superiority of the proposed method are verified by numerical simulations.

APA, Harvard, Vancouver, ISO, and other styles

Journal articles on the topic 'Policy gradient methods'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles