Academic literature on the topic 'Policy gradient methods'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Policy gradient methods.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Policy gradient methods"

1

Peters, Jan. "Policy gradient methods." Scholarpedia 5, no. 11 (2010): 3698. http://dx.doi.org/10.4249/scholarpedia.3698.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Cai, Qingpeng, Ling Pan, and Pingzhong Tang. "Deterministic Value-Policy Gradients." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 3316–23. http://dx.doi.org/10.1609/aaai.v34i04.5732.

Full text
Abstract:
Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.
APA, Harvard, Vancouver, ISO, and other styles
3

Zhang, Matthew S., Murat A. Erdogdu, and Animesh Garg. "Convergence and Optimality of Policy Gradient Methods in Weakly Smooth Settings." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 9066–73. http://dx.doi.org/10.1609/aaai.v36i8.20891.

Full text
Abstract:
Policy gradient methods have been frequently applied to problems in control and reinforcement learning with great success, yet existing convergence analysis still relies on non-intuitive, impractical and often opaque conditions. In particular, existing rates are achieved in limited settings, under strict regularity conditions. In this work, we establish explicit convergence rates of policy gradient methods, extending the convergence regime to weakly smooth policy classes with L2 integrable gradient. We provide intuitive examples to illustrate the insight behind these new conditions. Notably, our analysis also shows that convergence rates are achievable for both the standard policy gradient and the natural policy gradient algorithms under these assumptions. Lastly we provide performance guarantees for the converged policies.
APA, Harvard, Vancouver, ISO, and other styles
4

Akella, Ravi Tej, Kamyar Azizzadenesheli, Mohammad Ghavamzadeh, Animashree Anandkumar, and Yisong Yue. "Deep Bayesian Quadrature Policy Optimization." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 6600–6608. http://dx.doi.org/10.1609/aaai.v35i8.16817.

Full text
Abstract:
We study the problem of obtaining accurate policy gradient estimates using a finite number of samples. Monte-Carlo methods have been the default choice for policy gradient estimation, despite suffering from high variance in the gradient estimates. On the other hand, more sample efficient alternatives like Bayesian quadrature methods have received little attention due to their high computational complexity. In this work, we propose deep Bayesian quadrature policy gradient (DBQPG), a computationally efficient high-dimensional generalization of Bayesian quadrature, for policy gradient estimation. We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks. In comparison to Monte-Carlo estimation, DBQPG provides (i) more accurate gradient estimates with a significantly lower variance, (ii) a consistent improvement in the sample complexity and average return for several deep policy gradient algorithms, and, (iii) the uncertainty in gradient estimation that can be incorporated to further improve the performance.
APA, Harvard, Vancouver, ISO, and other styles
5

Wang, Lin, Xingang Xu, Xuhui Zhao, Baozhu Li, Ruijuan Zheng, and Qingtao Wu. "A randomized block policy gradient algorithm with differential privacy in Content Centric Networks." International Journal of Distributed Sensor Networks 17, no. 12 (December 2021): 155014772110599. http://dx.doi.org/10.1177/15501477211059934.

Full text
Abstract:
Policy gradient methods are effective means to solve the problems of mobile multimedia data transmission in Content Centric Networks. Current policy gradient algorithms impose high computational cost in processing high-dimensional data. Meanwhile, the issue of privacy disclosure has not been taken into account. However, privacy protection is important in data training. Therefore, we propose a randomized block policy gradient algorithm with differential privacy. In order to reduce computational complexity when processing high-dimensional data, we randomly select a block coordinate to update the gradients at each round. To solve the privacy protection problem, we add a differential privacy protection mechanism to the algorithm, and we prove that it preserves the [Formula: see text]-privacy level. We conduct extensive simulations in four environments, which are CartPole, Walker, HalfCheetah, and Hopper. Compared with the methods such as important-sampling momentum-based policy gradient, Hessian-Aided momentum-based policy gradient, REINFORCE, the experimental results of our algorithm show a faster convergence rate than others in the same environment.
APA, Harvard, Vancouver, ISO, and other styles
6

Le, Hung, Majid Abdolshah, Thommen K. George, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Episodic Policy Gradient Training." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 7 (June 28, 2022): 7317–25. http://dx.doi.org/10.1609/aaai.v36i7.20694.

Full text
Abstract:
We introduce a novel training procedure for policy gradient methods wherein episodic memory is used to optimize the hyperparameters of reinforcement learning algorithms on-the-fly. Unlike other hyperparameter searches, we formulate hyperparameter scheduling as a standard Markov Decision Process and use episodic memory to store the outcome of used hyperparameters and their training contexts. At any policy update step, the policy learner refers to the stored experiences, and adaptively reconfigures its learning algorithm with the new hyperparameters determined by the memory. This mechanism, dubbed as Episodic Policy Gradient Training (EPGT), enables an episodic learning process, and jointly learns the policy and the learning algorithm's hyperparameters within a single run. Experimental results on both continuous and discrete environments demonstrate the advantage of using the proposed method in boosting the performance of various policy gradient algorithms.
APA, Harvard, Vancouver, ISO, and other styles
7

Cohen, Andrew, Xingye Qiao, Lei Yu, Elliot Way, and Xiangrong Tong. "Diverse Exploration via Conjugate Policies for Policy Gradient Methods." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 3404–11. http://dx.doi.org/10.1609/aaai.v33i01.33013404.

Full text
Abstract:
We address the challenge of effective exploration while maintaining good performance in policy gradient methods. As a solution, we propose diverse exploration (DE) via conjugate policies. DE learns and deploys a set of conjugate policies which can be conveniently generated as a byproduct of conjugate gradient descent. We provide both theoretical and empirical results showing the effectiveness of DE at achieving exploration, improving policy performance, and the advantage of DE over exploration by random policy perturbations.
APA, Harvard, Vancouver, ISO, and other styles
8

Zhang, Junzi, Jongho Kim, Brendan O'Donoghue, and Stephen Boyd. "Sample Efficient Reinforcement Learning with REINFORCE." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 10887–95. http://dx.doi.org/10.1609/aaai.v35i12.17300.

Full text
Abstract:
Policy gradient methods are among the most effective methods for large-scale reinforcement learning, and their empirical success has prompted several works that develop the foundation of their global convergence theory. However, prior works have either required exact gradients or state-action visitation measure based mini-batch stochastic gradients with a diverging batch size, which limit their applicability in practical scenarios. In this paper, we consider classical policy gradient methods that compute an approximate gradient with a single trajectory or a fixed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization, along with the widely-used REINFORCE gradient estimation procedure. By controlling the number of "bad" episodes and resorting to the classical doubling trick, we establish an anytime sub-linear high probability regret bound as well as almost sure global convergence of the average regret with an asymptotically sub-linear rate. These provide the first set of global convergence and sample efficiency results for the well-known REINFORCE algorithm and contribute to a better understanding of its performance in practice.
APA, Harvard, Vancouver, ISO, and other styles
9

Yu, Hai-Tao, Degen Huang, Fuji Ren, and Lishuang Li. "Diagnostic Evaluation of Policy-Gradient-Based Ranking." Electronics 11, no. 1 (December 23, 2021): 37. http://dx.doi.org/10.3390/electronics11010037.

Full text
Abstract:
Learning-to-rank has been intensively studied and has shown significantly increasing values in a wide range of domains, such as web search, recommender systems, dialogue systems, machine translation, and even computational biology, to name a few. In light of recent advances in neural networks, there has been a strong and continuing interest in exploring how to deploy popular techniques, such as reinforcement learning and adversarial learning, to solve ranking problems. However, armed with the aforesaid popular techniques, most studies tend to show how effective a new method is. A comprehensive comparison between techniques and an in-depth analysis of their deficiencies are somehow overlooked. This paper is motivated by the observation that recent ranking methods based on either reinforcement learning or adversarial learning boil down to policy-gradient-based optimization. Based on the widely used benchmark collections with complete information (where relevance labels are known for all items), such as MSLRWEB30K and Yahoo-Set1, we thoroughly investigate the extent to which policy-gradient-based ranking methods are effective. On one hand, we analytically identify the pitfalls of policy-gradient-based ranking. On the other hand, we experimentally compare a wide range of representative methods. The experimental results echo our analysis and show that policy-gradient-based ranking methods are, by a large margin, inferior to many conventional ranking methods. Regardless of whether we use reinforcement learning or adversarial learning, the failures are largely attributable to the gradient estimation based on sampled rankings, which significantly diverge from ideal rankings. In particular, the larger the number of documents per query and the more fine-grained the ground-truth labels, the greater the impact policy-gradient-based ranking suffers. Careful examination of this weakness is highly recommended for developing enhanced methods based on policy gradient.
APA, Harvard, Vancouver, ISO, and other styles
10

Baxter, J., and P. L. Bartlett. "Infinite-Horizon Policy-Gradient Estimation." Journal of Artificial Intelligence Research 15 (November 1, 2001): 319–50. http://dx.doi.org/10.1613/jair.806.

Full text
Abstract:
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes POMDPs controlled by parameterized stochastic policies. A similar algorithm was proposed by (Kimura et al. 1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free beta (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter beta is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter et al., this volume) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Policy gradient methods"

1

Greensmith, Evan, and evan greensmith@gmail com. "Policy Gradient Methods: Variance Reduction and Stochastic Convergence." The Australian National University. Research School of Information Sciences and Engineering, 2005. http://thesis.anu.edu.au./public/adt-ANU20060106.193712.

Full text
Abstract:
In a reinforcement learning task an agent must learn a policy for performing actions so as to perform well in a given environment. Policy gradient methods consider a parameterized class of policies, and using a policy from the class, and a trajectory through the environment taken by the agent using this policy, estimate the performance of the policy with respect to the parameters. Policy gradient methods avoid some of the problems of value function methods, such as policy degradation, where inaccuracy in the value function leads to the choice of a poor policy. However, the estimates produced by policy gradient methods can have high variance.¶ In Part I of this thesis we study the estimation variance of policy gradient algorithms, in particular, when augmenting the estimate with a baseline, a common method for reducing estimation variance, and when using actor-critic methods. A baseline adjusts the reward signal supplied by the environment, and can be used to reduce the variance of a policy gradient estimate without adding any bias. We find the baseline that minimizes the variance. We also consider the class of constant baselines, and find the constant baseline that minimizes the variance. We compare this to the common technique of adjusting the rewards by an estimate of the performance measure. Actor-critic methods usually attempt to learn a value function accurate enough to be used in a gradient estimate without adding much bias. In this thesis we propose that in learning the value function we should also consider the variance. We show how considering the variance of the gradient estimate when learning a value function can be beneficial, and we introduce a new optimization criterion for selecting a value function.¶ In Part II of this thesis we consider online versions of policy gradient algorithms, where we update our policy for selecting actions at each step in time, and study the convergence of the these online algorithms. For such online gradient-based algorithms, convergence results aim to show that the gradient of the performance measure approaches zero. Such a result has been shown for an algorithm which is based on observing trajectories between visits to a special state of the environment. However, the algorithm is not suitable in a partially observable setting, where we are unable to access the full state of the environment, and its variance depends on the time between visits to the special state, which may be large even when only few samples are needed to estimate the gradient. To date, convergence results for algorithms that do not rely on a special state are weaker. We show that, for a certain algorithm that does not rely on a special state, the gradient of the performance measure approaches zero. We show that this continues to hold when using certain baseline algorithms suggested by the results of Part I.
APA, Harvard, Vancouver, ISO, and other styles
2

Greensmith, Evan. "Policy gradient methods : variance reduction and stochastic convergence /." View thesis entry in Australian Digital Theses Program, 2005. http://thesis.anu.edu.au/public/adt-ANU20060106.193712/index.html.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Yuan, Rui. "Stochastic Second Order Methods and Finite Time Analysis of Policy Gradient Methods." Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAT010.

Full text
Abstract:
Pour résoudre les problèmes de machine learning à grande échelle, les méthodes de premier ordre telles que la descente du gradient stochastique et l'ADAM sont les méthodes de choix en raison de leur coût pas cher par itération. Le problème des méthodes du premier ordre est qu'elles peuvent nécessiter un réglage important des paramètres et/ou une connaissance des paramètres du problème. Il existe aujourd'hui un effort considérable pour développer des méthodes du second ordre stochastiques efficaces afin de résoudre des problèmes de machine learning à grande échelle. La motivation est qu'elles demandent moins de réglage des paramètres et qu'elles convergent pour une plus grande variété de modèles et de datasets. Dans la première partie de la thèse, nous avons présenté une approche de principe pour désigner des méthodes de Newton stochastiques à fin de résoudre à la fois des équations non linéaires et des problèmes d'optimisation d'une manière efficace. Notre approche comporte deux étapes. Premièrement, nous pouvons réécrire les équations non linéaires ou le problème d'optimisation sous forme d'équations non linéaires souhaitées. Ensuite, nous appliquons de nouvelles méthodes du second ordre stochastiques pour résoudre ce système d'équations non linéaires. Grâce à notre approche générale, nous présentons de nombreux nouveaux algorithmes spécifiques du second ordre qui peuvent résoudre efficacement les problèmes de machine learning à grande échelle sans nécessiter de connaissance du problème ni de réglage des paramètres. Dans la deuxième partie de la thèse, nous nous concentrons sur les algorithmes d'optimisation appliqués à un domaine spécifique : l'apprentissage par renforcement (RL). Cette partie est indépendante de la première partie de la thèse. Pour atteindre de telles performances dans les problèmes de RL, le policie gradient (PG) et sa variante, le policie gradient naturel (NPG), sont les fondements de plusieurs algorithmes de l'état de l'art (par exemple, TRPO et PPO) utilisés dans le RL profond. Malgré le succès empirique des méthodes de RL et de PG, une compréhension théorique solide du PG de "vanille" a longtemps fait défaut. En utilisant la structure du RL du problème et des techniques modernes de preuve d'optimisation, nous obtenons nouvelles analyses en temps fini de la PG et de la NPG. Grâce à notre analyse, nous apportons également de nouvelles perspectives aux méthodes avec de meilleurs choix d'hyperparamètres
To solve large scale machine learning problems, first-order methods such as stochastic gradient descent and ADAM are the methods of choice because of their low cost per iteration. The issue with first order methods is that they can require extensive parameter tuning, and/or knowledge of the parameters of the problem. There is now a concerted effort to develop efficient stochastic second order methods to solve large scale machine learning problems. The motivation is that they require less parameter tuning and converge for wider variety of models and datasets. In the first part of the thesis, we presented a principled approach for designing stochastic Newton methods for solving both nonlinear equations and optimization problems in an efficient manner. Our approach has two steps. First, we can re-write the nonlinear equations or the optimization problem as desired nonlinear equations. Second, we apply new stochastic second order methods to solve this system of nonlinear equations. Through our general approach, we showcase many specific new second-order algorithms that can solve the large machine learning problems efficiently without requiring knowledge of the problem nor parameter tuning. In the second part of the thesis, we then focus on optimization algorithms applied in a specific domain: reinforcement learning (RL). This part is independent to the first part of the thesis. To achieve such high performance of RL problems, policy gradient (PG) and its variant, natural policy gradient (NPG), are the foundations of the several state of the art algorithms (e.g., TRPO and PPO) used in deep RL. In spite of the empirical success of RL and PG methods, a solid theoretical understanding of even the “vanilla” PG has long been elusive. By leveraging the RL structure of the problem together with modern optimization proof techniques, we derive new finite time analysis of both PG and NPG. Through our analysis, we also bring new insights to the methods with better hyperparameter choices
APA, Harvard, Vancouver, ISO, and other styles
4

Pianazzi, Enrico. "A deep reinforcement learning approach based on policy gradient for mobile robot navigation." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2022.

Find full text
Abstract:
Reinforcement learning is a model-free technique to solve decision-making problems by learning the best behavior to solve a specific task in a given environment. This thesis work focuses on state-of-the-art reinforcement learning methods and their application to mobile robotics navigation and control. Our work is inspired by the recent developments in deep reinforcement learning and from the ever-growing need for complex control and navigation capabilities from autonomous mobile robots. We propose a reinforcement learning controller based on an actor-critic approach to navigate a mobile robot in an initially unknown environment. The task is to navigate the robot from a random initial point on the map to a fixed goal point, while trying to stay within the environment limits and to avoid obstacles on the path. The agent has no initial knowledge of the environment's characteristic, including the goal and obstacles positions. The adopted algorithm is the so-called Deep Deterministic Policy Gradient (DDPG), which is able to deal with continuous states and inputs thanks to the use of neural networks in the actor-critic architecture and of the policy gradient to update the neural network representing the control policy. The learned controller directly outputs velocity commands to the robot, basing its decisions on the robot's position, without the need of additional sensory data. The robot is simulated as a unicycle kinematic model, and we present an implementation of the learning algorithm and robot simulation developed in Python that is able to solve the goal-reaching task while avoiding obstacles with a success rate above 95%.
APA, Harvard, Vancouver, ISO, and other styles
5

Greensmith, Evan. "Policy Gradient Methods: Variance Reduction and Stochastic Convergence." Phd thesis, 2005. http://hdl.handle.net/1885/47105.

Full text
Abstract:
In a reinforcement learning task an agent must learn a policy for performing actions so as to perform well in a given environment. Policy gradient methods consider a parameterized class of policies, and using a policy from the class, and a trajectory through the environment taken by the agent using this policy, estimate the performance of the policy with respect to the parameters. Policy gradient methods avoid some of the problems of value function methods, such as policy degradation, where inaccuracy in the value function leads to the choice of a poor policy. However, the estimates produced by policy gradient methods can have high variance. ¶ ...
APA, Harvard, Vancouver, ISO, and other styles
6

"Adaptive Curvature for Stochastic Optimization." Master's thesis, 2019. http://hdl.handle.net/2286/R.I.53675.

Full text
Abstract:
abstract: This thesis presents a family of adaptive curvature methods for gradient-based stochastic optimization. In particular, a general algorithmic framework is introduced along with a practical implementation that yields an efficient, adaptive curvature gradient descent algorithm. To this end, a theoretical and practical link between curvature matrix estimation and shrinkage methods for covariance matrices is established. The use of shrinkage improves estimation accuracy of the curvature matrix when data samples are scarce. This thesis also introduce several insights that result in data- and computation-efficient update equations. Empirical results suggest that the proposed method compares favorably with existing second-order techniques based on the Fisher or Gauss-Newton and with adaptive stochastic gradient descent methods on both supervised and reinforcement learning tasks.
Dissertation/Thesis
Masters Thesis Computer Science 2019
APA, Harvard, Vancouver, ISO, and other styles
7

Pereira, Bruno Alexandre Barbosa. "Deep reinforcement learning for robotic manipulation tasks." Master's thesis, 2021. http://hdl.handle.net/10773/33654.

Full text
Abstract:
The recent advances in Artificial Intelligence (AI) present new opportunities for robotics on many fronts. Deep Reinforcement Learning (DRL) is a sub-field of AI which results from the combination of Deep Learning (DL) and Reinforcement Learning (RL). It categorizes machine learning algorithms which learn directly from experience and offers a comprehensive framework for studying the interplay among learning, representation and decision-making. It has already been successfully used to solve tasks in many domains. Most notably, DRL agents learned to play Atari 2600 video games directly from pixels and achieved human comparable performance in 49 of those games. Additionally, recent efforts using DRL in conjunction with other techniques produced agents capable of playing the board game of Go at a professional level, which has long been viewed as an intractable problem due to its enormous search space. In the context of robotics, DRL is often applied to planning, navigation, optimal control and others. Here, the powerful function approximation and representation learning properties of Deep Neural Networks enable RL to scale up to problems with highdimensional state and action spaces. Additionally, inherent properties of DRL make transfer learning useful when moving from simulation to the real world. This dissertation aims to investigate the applicability and effectiveness of DRL to learn successful policies on the domain of robot manipulator tasks. Initially, a set of three classic RL problems were solved using RL and DRL algorithms in order to explore their practical implementation and arrive at class of algorithms appropriate for these robotic tasks. Afterwards, a task in simulation is defined such that an agent is set to control a 6 DoF manipulator to reach a target with its end effector. This is used to evaluate the effects on performance of different state representations, hyperparameters and state-of-the-art DRL algorithms, resulting in agents with high success rates. The emphasis is then placed on the speed and time restrictions of the end effector's positioning. To this end, different reward systems were tested for an agent learning a modified version of the previous reaching task with faster joint speeds. In this setting, a number of improvements were verified in relation to the original reward system. Finally, an application of the best reaching agent obtained from the previous experiments is demonstrated on a simplified ball catching scenario.
Os avanços recentes na Inteligência Artificial (IA) demonstram um conjunto de novas oportunidades para a robótica. A Aprendizagem Profunda por Reforço (DRL) é uma subárea da IA que resulta da combinação de Aprendizagem Profunda (DL) com Aprendizagem por Reforço (RL). Esta subárea define algoritmos de aprendizagem automática que aprendem diretamente por experiência e oferece uma abordagem compreensiva para o estudo da interação entre aprendizagem, representação e a decisão. Estes algoritmos já têm sido utilizados com sucesso em diferentes domínios. Nomeadamente, destaca-se a aplicação de agentes de DRL que aprenderam a jogar vídeo jogos da consola Atari 2600 diretamente a partir de pixels e atingiram um desempenho comparável a humanos em 49 desses jogos. Mais recentemente, a DRL em conjunto com outras técnicas originou agentes capazes de jogar o jogo de tabuleiro Go a um nível profissional, algo que até ao momento era visto como um problema demasiado complexo para ser resolvido devido ao seu enorme espaço de procura. No âmbito da robótica, a DRL tem vindo a ser utilizada em problemas de planeamento, navegação, controlo ótimo e outros. Nestas aplicações, as excelentes capacidades de aproximação de funções e aprendizagem de representação das Redes Neuronais Profundas permitem à RL escalar a problemas com espaços de estado e ação multidimensionais. Adicionalmente, propriedades inerentes à DRL fazem a transferência de aprendizagem útil ao passar da simulação para o mundo real. Esta dissertação visa investigar a aplicabilidade e eficácia de técnicas de DRL para aprender políticas de sucesso no domínio das tarefas de manipulação robótica. Inicialmente, um conjunto de três problemas clássicos de RL foram resolvidos utilizando algoritmos de RL e DRL de forma a explorar a sua implementação prática e chegar a uma classe de algoritmos apropriados para estas tarefas de robótica. Posteriormente, foi definida uma tarefa em simulação onde um agente tem como objetivo controlar um manipulador com 6 graus de liberdade de forma a atingir um alvo com o seu terminal. Esta é utilizada para avaliar o efeito no desempenho de diferentes representações do estado, hiperparâmetros e algoritmos do estado da arte de DRL, o que resultou em agentes com taxas de sucesso elevadas. O foco é depois colocado na velocidade e restrições de tempo do posicionamento do terminal. Para este fim, diferentes sistemas de recompensa foram testados para que um agente possa aprender uma versão modificada da tarefa anterior para velocidades de juntas superiores. Neste cenário, foram verificadas várias melhorias em relação ao sistema de recompensa original. Finalmente, uma aplicação do melhor agente obtido nas experiências anteriores é demonstrada num cenário implicado de captura de bola.
Mestrado em Engenharia de Computadores e Telemática
APA, Harvard, Vancouver, ISO, and other styles
8

Kiah-YangChong and 張家揚. "Design and Implementation of Fuzzy Policy Gradient Gait Learning Method for Humanoid Robot." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/90100127378597192142.

Full text
Abstract:
碩士
國立成功大學
電機工程學系碩博士班
98
The design and implementation of Fuzzy Policy Gradient Learning (FPGL) method for small-sized humanoid robot is proposed in this thesis. This thesis not only introduces the mechanism structure of the humanoid robot and the hardware system adapted on the robot, which is named as aiRobots-V, but also improves and parameterizes the gait pattern of the robot. The movement of arms is added to the gait pattern to reduce the tilt of trunk while walking. FPGL method is an integrated machine learning method based on Policy Gradient Reinforcement Learning (PGRL) method and fuzzy logic concept in order to improve the efficiency and speed of gait learning computation. The humanoid robot is trained with FPGL method which is using the walking distance in constant walking cycles as the reward to learn faster and stable gait automatically. The tilt degree of trunk is chosen as the reward to learn the movement of arms in the walking cycle. The result of the experiment shows that FPGL method could train the gait pattern from 9.26 mm/s walking speed to 162.27 mm/s in about an hour. The training data of experiments also shows that this method could improve the efficiency of basic PGRL method up to 13%. The effect of arm movement to reduce the tilt degree of trunk is also proved by the experimental results. This robot is also applied to participate in the throw-in technical challenge of RoboCup 2010.
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Policy gradient methods"

1

Deep Reinforcement Learning Hands-On: Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more. Packt Publishing, 2018.

Find full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Policy gradient methods"

1

Zeugmann, Thomas, Pascal Poupart, James Kennedy, Xin Jin, Jiawei Han, Lorenza Saitta, Michele Sebag, et al. "Policy Gradient Methods." In Encyclopedia of Machine Learning, 774–76. Boston, MA: Springer US, 2011. http://dx.doi.org/10.1007/978-0-387-30164-8_640.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Hu, Michael. "Policy Gradient Methods." In The Art of Reinforcement Learning, 177–96. Berkeley, CA: Apress, 2023. http://dx.doi.org/10.1007/978-1-4842-9606-6_9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Peters, Jan, and J. Andrew Bagnell. "Policy Gradient Methods." In Encyclopedia of Machine Learning and Data Mining, 1–4. Boston, MA: Springer US, 2016. http://dx.doi.org/10.1007/978-1-4899-7502-7_646-1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Peters, Jan, and J. Andrew Bagnell. "Policy Gradient Methods." In Encyclopedia of Machine Learning and Data Mining, 982–85. Boston, MA: Springer US, 2017. http://dx.doi.org/10.1007/978-1-4899-7687-1_646.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Hu, Michael. "Advanced Policy Gradient Methods." In The Art of Reinforcement Learning, 205–20. Berkeley, CA: Apress, 2023. http://dx.doi.org/10.1007/978-1-4842-9606-6_11.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Semmler, Markus. "Fisher Information Approximations in Policy Gradient Methods." In Reinforcement Learning Algorithms: Analysis and Applications, 59–67. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-41188-6_6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Hansel, Kay, Janosch Moos, and Cedric Derstroff. "Benchmarking the Natural Gradient in Policy Gradient Methods and Evolution Strategies." In Reinforcement Learning Algorithms: Analysis and Applications, 69–84. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-41188-6_7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Jiang, Xuesong, Zhipeng Li, and Xiumei Wei. "Asynchronous Methods for Multi-agent Deep Deterministic Policy Gradient." In Neural Information Processing, 711–21. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-04179-3_63.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Levy, Kfir Y., and Nahum Shimkin. "Unified Inter and Intra Options Learning Using Policy Gradient Methods." In Lecture Notes in Computer Science, 153–64. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. http://dx.doi.org/10.1007/978-3-642-29946-9_17.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Sabbioni, Luca, Francesco Corda, and Marcello Restelli. "Stepsize Learning for Policy Gradient Methods in Contextual Markov Decision Processes." In Machine Learning and Knowledge Discovery in Databases: Research Track, 506–23. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-43421-1_30.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Policy gradient methods"

1

Peters, Jan, and Stefan Schaal. "Policy Gradient Methods for Robotics." In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2006. http://dx.doi.org/10.1109/iros.2006.282564.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Ståhlberg, Simon, Blai Bonet, and Hector Geffner. "Learning General Policies with Policy Gradient Methods." In 20th International Conference on Principles of Knowledge Representation and Reasoning {KR-2023}. California: International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/kr.2023/63.

Full text
Abstract:
While reinforcement learning methods have delivered remarkable results in a number of settings, generalization, i.e., the ability to produce policies that generalize in a reliable and systematic way, has remained a challenge. The problem of generalization has been addressed formally in classical planning where provable correct policies that generalize over all instances of a given domain have been learned using combinatorial methods. The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches, and in particular, policy optimization methods, can be used to learn policies that generalize like combinatorial methods do. We draw on lessons learned from previous combinatorial and deep learning approaches, and extend them in a convenient way. From the former, we model policies as state transition classifiers, as (ground) actions are not general and change from instance to instance. From the latter, we use graph neural networks (GNNs) adapted to deal with relational structures for representing value functions over planning states, and in our case, policies. With these ingredients in place, we find that actor-critic methods can be used to learn policies that generalize almost as well as those obtained using combinatorial approaches while avoiding the scalability bottleneck and the use of feature pools. Moreover, the limitations of the DRL methods on the benchmarks considered have little to do with deep learning or reinforcement learning algorithms, and result from the well-understood expressive limitations of GNNs, and the tradeoff between optimality and generalization (general policies cannot be optimal in some domains). Both of these limitations are addressed without changing the basic DRL methods by adding derived predicates and an alternative cost structure to optimize.
APA, Harvard, Vancouver, ISO, and other styles
3

Li, Dong, Dongbin Zhao, Qichao Zhang, and Chaomin Luo. "Policy gradient methods with Gaussian process modelling acceleration." In 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2017. http://dx.doi.org/10.1109/ijcnn.2017.7966065.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Shi, Wenjie, Shiji Song, and Cheng Wu. "Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning." In Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}. California: International Joint Conferences on Artificial Intelligence Organization, 2019. http://dx.doi.org/10.24963/ijcai.2019/475.

Full text
Abstract:
Maximum entropy deep reinforcement learning (RL) methods have been demonstrated on a range of challenging continuous tasks. However, existing methods either suffer from severe instability when training on large off-policy data or cannot scale to tasks with very high state and action dimensionality such as 3D humanoid locomotion. Besides, the optimality of desired Boltzmann policy set for non-optimal soft value function is not persuasive enough. In this paper, we first derive soft policy gradient based on entropy regularized expected reward objective for RL with continuous actions. Then, we present an off-policy actor-critic, model-free maximum entropy deep RL algorithm called deep soft policy gradient (DSPG) by combining soft policy gradient with soft Bellman equation. To ensure stable learning while eliminating the need of two separate critics for soft value functions, we leverage double sampling approach to making the soft Bellman equation tractable. The experimental results demonstrate that our method outperforms in performance over off-policy prior methods.
APA, Harvard, Vancouver, ISO, and other styles
5

Ma, Xiaobai, Katherine Driggs-Campbell, Zongzhang Zhang, and Mykel J. Kochenderfer. "Monte Carlo Tree Search for Policy Optimization." In Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}. California: International Joint Conferences on Artificial Intelligence Organization, 2019. http://dx.doi.org/10.24963/ijcai.2019/432.

Full text
Abstract:
Gradient-based methods are often used for policy optimization in deep reinforcement learning, despite being vulnerable to local optima and saddle points. Although gradient-free methods (e.g., genetic algorithms or evolution strategies) help mitigate these issues, poor initialization and local optima are still concerns in highly nonconvex spaces. This paper presents a method for policy optimization based on Monte-Carlo tree search and gradient-free optimization. Our method, called Monte-Carlo tree search for policy optimization (MCTSPO), provides a better exploration-exploitation trade-off through the use of the upper confidence bound heuristic. We demonstrate improved performance on reinforcement learning tasks with deceptive or sparse reward functions compared to popular gradient-based and deep genetic algorithm baselines.
APA, Harvard, Vancouver, ISO, and other styles
6

Ziemann, Ingvar, Anastasios Tsiamis, Henrik Sandberg, and Nikolai Matni. "How are policy gradient methods affected by the limits of control?" In 2022 IEEE 61st Conference on Decision and Control (CDC). IEEE, 2022. http://dx.doi.org/10.1109/cdc51059.2022.9992612.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Ding, Yuhao, Junzi Zhang, and Javad Lavaei. "Local Analysis of Entropy-Regularized Stochastic Soft-Max Policy Gradient Methods." In 2023 European Control Conference (ECC). IEEE, 2023. http://dx.doi.org/10.23919/ecc57647.2023.10178123.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Peng, Zilun, Ahmed Touati, Pascal Vincent, and Doina Precup. "SVRG for Policy Evaluation with Fewer Gradient Evaluations." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/374.

Full text
Abstract:
Stochastic variance-reduced gradient (SVRG) is an optimization method originally designed for tackling machine learning problems with a finite sum structure. SVRG was later shown to work for policy evaluation, a problem in reinforcement learning in which one aims to estimate the value function of a given policy. SVRG makes use of gradient estimates at two scales. At the slower scale, SVRG computes a full gradient over the whole dataset, which could lead to prohibitive computation costs. In this work, we show that two variants of SVRG for policy evaluation could significantly diminish the number of gradient calculations while preserving a linear convergence speed. More importantly, our theoretical result implies that one does not need to use the entire dataset in every epoch of SVRG when it is applied to policy evaluation with linear function approximation. Our experiments demonstrate large computational savings provided by the proposed methods.
APA, Harvard, Vancouver, ISO, and other styles
9

Gronauer, Sven, Martin Gottwald, and Klaus Diepold. "The Successful Ingredients of Policy Gradient Algorithms." In Thirtieth International Joint Conference on Artificial Intelligence {IJCAI-21}. California: International Joint Conferences on Artificial Intelligence Organization, 2021. http://dx.doi.org/10.24963/ijcai.2021/338.

Full text
Abstract:
Despite the sublime success in recent years, the underlying mechanisms powering the advances of reinforcement learning are yet poorly understood. In this paper, we identify these mechanisms - which we call ingredients - in on-policy policy gradient methods and empirically determine their impact on the learning. To allow an equitable assessment, we conduct our experiments based on a unified and modular implementation. Our results underline the significance of recent algorithmic advances and demonstrate that reaching state-of-the-art performance may not need sophisticated algorithms but can also be accomplished by the combination of a few simple ingredients.
APA, Harvard, Vancouver, ISO, and other styles
10

Riedmiller, Martin, Jan Peters, and Stefan Schaal. "Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark." In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning. IEEE, 2007. http://dx.doi.org/10.1109/adprl.2007.368196.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Policy gradient methods"

1

Umberger, Pierce. Experimental Evaluation of Dynamic Crack Branching in Poly(methyl methacrylate) (PMMA) Using the Method of Coherent Gradient Sensing. Fort Belvoir, VA: Defense Technical Information Center, February 2010. http://dx.doi.org/10.21236/ada518614.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

A Decision-Making Method for Connected Autonomous Driving Based on Reinforcement Learning. SAE International, December 2020. http://dx.doi.org/10.4271/2020-01-5154.

Full text
Abstract:
At present, with the development of Intelligent Vehicle Infrastructure Cooperative Systems (IVICS), the decision-making for automated vehicle based on connected environment conditions has attracted more attentions. Reliability, efficiency and generalization performance are the basic requirements for the vehicle decision-making system. Therefore, this paper proposed a decision-making method for connected autonomous driving based on Wasserstein Generative Adversarial Nets-Deep Deterministic Policy Gradient (WGAIL-DDPG) algorithm. In which, the key components for reinforcement learning (RL) model, reward function, is designed from the aspect of vehicle serviceability, such as safety, ride comfort and handling stability. To reduce the complexity of the proposed model, an imitation learning strategy is introduced to improve the RL training process. Meanwhile, the model training strategy based on cloud computing effectively solves the problem of insufficient computing resources of the vehicle-mounted system. Test results show that the proposed method can improve the efficiency for RL training process with reliable decision making performance and reveals excellent generalization capability.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography