To see the other types of publications on this topic, follow the link: Reinforcement Learning Algorithms.

Dissertations / Theses on the topic 'Reinforcement Learning Algorithms'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Reinforcement Learning Algorithms.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Janagam, Anirudh, and Saddam Hossen. "Analysis of Network Intrusion Detection System with Machine Learning Algorithms (Deep Reinforcement Learning Algorithm)." Thesis, Blekinge Tekniska Högskola, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-17126.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Song, Yupu. "A Forex Trading System Using Evolutionary Reinforcement Learning." Digital WPI, 2017. https://digitalcommons.wpi.edu/etd-theses/1240.

Full text
Abstract:
Building automated trading systems has long been one of the most cutting-edge and exciting fields in the financial industry. In this research project, we built a trading system based on machine learning methods. We used the Recurrent Reinforcement Learning (RRL) algorithm as our fundamental algorithm, and by introducing Genetic Algorithms (GA) in the optimization procedure, we tackled the problems of picking good initial values of parameters and dynamically updating the learning speed in the original RRL algorithm. We call this optimization algorithm the Evolutionary Recurrent Reinforcement Learning algorithm (ERRL), or the GA-RRL algorithm. ERRL allows us to find many local optimal solutions easier and faster than the original RRL algorithm. Finally, we implemented the GA-RRL system on EUR/USD at a 5-minute level, and the backtest performance showed that our GA-RRL system has potentially promising profitability. In future research we plan to introduce some risk control mechanism, implement the system on different markets and assets, and perform backtest at higher frequency level.
APA, Harvard, Vancouver, ISO, and other styles
3

Moturu, Krishna Priya Darsini. "Application of reinforcement learning algorithms to software verification." Master's thesis, Québec : Université Laval, 2006. http://www.theses.ulaval.ca/2006/23583/23583.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Cuningham, Blake. "Evolutionary algorithms for optimising reinforcement learning policy approximation." Master's thesis, Faculty of Science, 2019. http://hdl.handle.net/11427/31170.

Full text
Abstract:
Reinforcement learning methods have become more efficient in recent years. In particular, the A3C (asynchronous advantage actor critic) approach demonstrated in Mnih et al. (2016) was able to halve the training time of the existing state-of-the-art approaches. However, these methods still require relatively large amounts of training resources due to the fundamental exploratory nature of reinforcement learning. Other machine learning approaches are able to improve the ability to train reinforcement learning agents by better processing input information to help map states to actions - convolutional and recurrent neural networks are helpful when input data is in image form that does not satisfy the Markov property. The specific required architecture of these convolutional and recurrent neural network models is not obvious given infinite possible permutations. There is very limited research giving clear guidance on neural network structure in a RL (reinforcement learning) context, and grid search-like approaches require too many resources and do not always find good optima. In order to address these, and other, challenges associated with traditional parameter optimization methods, an evolutionary approach similar to that taken by Dufourq and Bassett (2017) for image classification tasks was used to find the optimal model architecture when training an agent that learns to play Atari Pong. The approach found models that were able to train reinforcement learning agents faster, and with fewer parameters than that found by OpenAI’s model in Blackwell et al. (2018) - a superhuman level of performance.
APA, Harvard, Vancouver, ISO, and other styles
5

Chalup, Stephan Konrad. "Incremental learning with neural networks, evolutionary computation and reinforcement learning algorithms." Thesis, Queensland University of Technology, 2001.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
6

Frank, Jordan William 1980. "Reinforcement learning in the presence of rare events." Thesis, McGill University, 2009. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=111576.

Full text
Abstract:
Learning agents often find themselves in environments in which rare significant events occur independently of their current choice of action. Traditional reinforcement learning algorithms sample events according to their natural probability of occurring, and therefore tend to exhibit slow convergence and high variance in such environments. In this thesis, we assume that learning is done in a simulated environment in which the probability of these rare events can be artificially altered. We present novel algorithms for both policy evaluation and control, using both tabular and function approximation representations of the value function. These algorithms automatically tune the rare event probabilities to minimize the variance and use importance sampling to correct for changes in the dynamics. We prove that these algorithms converge, provide an analysis of their bias and variance, and demonstrate their utility in a number of domains, including a large network planning task.
APA, Harvard, Vancouver, ISO, and other styles
7

Lee, Siu-keung, and 李少強. "Reinforcement learning for intelligent assembly automation." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2002. http://hub.hku.hk/bib/B31244397.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Qi, Dehu. "Multi-agent systems : integrating reinforcement learning, bidding and genetic algorithms /." free to MU campus, to others for purchase, 2002. http://wwwlib.umi.com/cr/mo/fullcit?p3060133.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Brunnström, Jesper, and Kamil Kaminski. "Exploring Deep Reinforcement Learning Algorithms for Homogeneous Multi-Agent Systems." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-239364.

Full text
Abstract:
Despite advances in Deep Reinforcement Learning, multi-agent systems remain somewhat unexplored, in comparison to single-agent systems, with few clear conclusions. In order to investigate this, two algorithms have been implemented and tested on a simple multi-agent system: Deep Q Learning with several improvements (EDQN) and Asynchronous Advantage ActorCritic (A3C). The result shows that with an increasing number of agents, learning a well performing policy takes more time. When only a few agents are used, the performance of both algorithms when fully trained is similar and could be viewed as satisfactory. With more than 3-4 agents the performance of the A3C algorithm decreases while EDQN maintains its good performance. Certain hyperparameters of these algorithms have been investigated and the results have been presented. In conclusion EDQN performs better than A3C with multiple agents. For both algorithms, there is a strong sensitivity with regards to the hyperparameters.
APA, Harvard, Vancouver, ISO, and other styles
10

Dalla, Libera Alberto. "Learning algorithms for robotics systems." Doctoral thesis, Università degli studi di Padova, 2019. http://hdl.handle.net/11577/3422839.

Full text
Abstract:
Robotics systems are now increasingly widespread in our day-life. For instance, robots have been successfully used in several fields, like, agriculture, construction, defense, aerospace, and hospitality. However, there are still several issues to be addressed for allowing the large scale deployment of robots. Issues related to security, and manufacturing and operating costs are particularly relevant. Indeed, differently from industrial applications, service robots should be cheap and capable of operating in unknown, or partially-unknown environments, possibly with minimal human intervention. To deal with these challenges, in the last years the research community focused on deriving learning algorithms capable of providing flexibility and adaptability to the robots. In this context, the application of Machine Learning and Reinforcement Learning techniques turns out to be especially useful. In this manuscript, we propose different learning algorithms for robotics systems. In Chapter 2, we propose a solution for learning the geometrical model of a robot directly from data, combining proprioceptive measures with data collected with a 2D camera. Besides testing the accuracy of the kinematic models derived with real experiments, we validate the possibility of deriving a kinematic controller based on the model identified. Instead, in Chapter 3, we address the robot inverse dynamics problem. Our strategy relies on the fact that the robot inverse dynamics is a polynomial function in a particular input space. Besides characterizing the input space, we propose a data-driven solution based on Gaussian Process Regression (GPR). Given the type of each joint, we define a kernel named Geometrically Inspired Polynomial (GIP) kernel, which is given by the product of several polynomial kernels. To cope with the dimensionality of the resulting polynomial, we use a variation of the standard polynomial kernel, named Multiplicative Polynomial kernel, further discussed in Chapter 6. Tests performed on simulated and real environments show that, compared to other data-driven solutions, the GIP kernel-based estimator is more accurate and data-efficient. In Chapter 4, we propose a proprioceptive collision detection algorithm based on GPR. Compared to other proprioceptive approaches, we closely inspect the robot behaviors in quasi-static configurations, namely, configurations in which joint velocities are null or close to zero. Such configurations are particularly relevant in the Collaborative Robotics context, where humans and robots work side-by-side sharing the same environment. Experimental results performed with a UR10 robot confirm the relevance of the problem and the effectiveness of the proposed solution. Finally, in Chapter 5, we present MC-PILCO, a model-based policy search algorithm inspired by the PILCO algorithm. As the original PILCO algorithm, MC-PILCO models the system evolution relying on GPR, and improves the control policy minimizing the expected value of a cost function. However, instead of approximating the expected cost by moment matching, MC-PILCO approximates the expected cost with a Monte Carlo particle-based approach; no assumption about the type of GPR model is necessary. Thus, MC-PILCO allows more freedom in designing the GPR models, possibly leading to better models of the system dynamics. Results obtained in a simulated environment show consistent improvements with respect to the original algorithm, both in terms of speed and success rate.
APA, Harvard, Vancouver, ISO, and other styles
11

Bertozzi, Enrico. "Development of Reinforcement Learning Algorithms for Non-cooperative Target Localization and Tracking." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2020.

Find full text
Abstract:
The problem addressed in this thesis is to use swarm agents to find the optimal placement to reach optimal localization performance of a target node in a wireless sensor network scenario. Localization can be based on simply received signal strength (RSSI) and trilateration. To measure the accuracy of the localization process, geometric dilution of precision (GDOP) has been used. Trilateration is performed by mobile anchors that, in this work, will be supposed to be drones. Three anchors are used. The anchors are free to move in an environment represented by a grid. Each drone can assume a grid cell as location. To move from a cell to another there are five actions allowed. Each agent can move one cell square north, south, east, west or remain in its current position, if possible. Localization is performed on a target node arbitrarily positioned in the environment. Each time drones make a move, a reward is awarded to them depending on the estimated distance from the target and the GDOP. This allows drones to determine whether or not the action taken in a particular cell was valid. Three different algorithms have been proposed and implemented. The first one called 'Multi agent Q-learning' is used in small gridworld. Each executable action in a cell is assigned a certain value, called q-value, indicating how much that action is useful to reach the final goal. The tested scenarios include both environments with and without obstacles. A deep reinforcement learning approach was used to shift the problem even to larger environments. Thanks to the use of neural networks, an algorithm called 'actor-critic' has been implemented. The action will be chosen over a distribution of probabilities. Finally, the two algorithms have been united in a hybrid technique that allows trilateration to be performed even on mobile targets.
APA, Harvard, Vancouver, ISO, and other styles
12

Dhandayuthapani, Sumithra. "Automatic selection of dynamic loop scheduling algorithms for load balancing using reinforcement learning." Master's thesis, Mississippi State : Mississippi State University, 2004. http://library.msstate.edu/etd/show.asp?etd=etd-06292004-144402.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Andrag, Walter H. "Reinforcement learning for routing in communication networks." Thesis, Stellenbosch : Stellenbosch University, 2003. http://hdl.handle.net/10019.1/53570.

Full text
Abstract:
Thesis (MSc)--Stellenbosch University, 2003.<br>ENGLISH ABSTRACT: Routing policies for packet-switched communication networks must be able to adapt to changing traffic patterns and topologies. We study the feasibility of implementing an adaptive routing policy using the Q-Learning algorithm which learns sequences of actions from delayed rewards. The Q-Routing algorithm adapts a network's routing policy based on local information alone and converges toward an optimal solution. We demonstrate that Q-Routing is a viable alternative to other adaptive routing methods such as Bellman-Ford. We also study variations of Q-Routing designed to better explore possible routes and to take into consideration limited buffer size and optimize multiple objectives.<br>AFRIKAANSE OPSOMMING:Die roetering in kommunikasienetwerke moet kan aanpas by veranderings in netwerktopologie en verkeersverspreidings. Ons bestudeer die bruikbaarheid van 'n aanpasbare roeteringsalgoritme gebaseer op die "Q-Learning"-algoritme wat dit moontlik maak om 'n reeks besluite te kan neem gebaseer op vertraagde vergoedings. Die roeteringsalgoritme gebruik slegs nabygelee inligting om roeteringsbesluite te maak en konvergeer na 'n optimale oplossing. Ons demonstreer dat die roeteringsalgoritme 'n goeie alternatief vir aanpasbare roetering is, aangesien dit in baie opsigte beter vaar as die Bellman-Ford algoritme. Ons bestudeer ook variasies van die roeteringsalgoritme wat beter paaie kan ontdek, minder geheue gebruik by netwerkelemente, en wat meer as een doelfunksie kan optimeer.
APA, Harvard, Vancouver, ISO, and other styles
14

Crandall, Jacob W. "Learning Successful Strategies in Repeated General-sum Games." Diss., CLICK HERE for online access, 2005. http://contentdm.lib.byu.edu/ETD/image/etd1156.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

White, Spencer Kesson. "Reinforcement Programming: A New Technique in Automatic Algorithm Development." Diss., CLICK HERE for online access, 2006. http://contentdm.lib.byu.edu/ETD/image/etd1368.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Mancini, Riccardo. "Optimizing cardboard-blank picking in a packaging machine by using Reinforcement Learning algorithms." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2022.

Find full text
Abstract:
Artificial Intelligence (AI) has been one of the most promising research topics for years and the world of industrial process control is beginning to approach the possibilities offered by the so-called Machine Learning. Problems without a model, with multiple degrees of freedom and difficult to interpret, where traditional control technologies lose efficiency, seem like the ideal benchmark for AI. This thesis focuses on a packaging machine of coffee capsules being currently optimized at the Research and Innovation department of IMA S.p.A. company. In particular, the analysis concerns the apparatus responsible for picking the cardboard blanks that will be subsequently and progressively formed to envelop the capsules. The success of this first operation depends on various controllable parameters and as many disturbances. The relationship between the former and the latter is not easily identifiable, and its understanding has so far been entrusted to the experiential knowledge of operators called to intervene in the event of incorrect picking cycles. This thesis activity aims to achieve an adaptive control using Machine Learning algorithms, specifically the ones in the branch of Reinforcement Learning, to identify and autonomously perform the correction of the parameters that best avoid missing or incorrect picking cycles, which affect the productivity of the production system and the quality of the final product. After a dissertation on the theoretical foundations and the state of the art of RL, the case study is introduced and adapted to the RL framework. The subsequent choice of the best training modality is then followed by the description of the implementation steps, and the results obtained online during the tests of the controller are finally presented.
APA, Harvard, Vancouver, ISO, and other styles
17

Gustafsson, Robin, and Lucas Fröjdendahl. "Machine Learning for Traffic Control of Unmanned Mining Machines : Using the Q-learning and SARSA algorithms." Thesis, KTH, Hälsoinformatik och logistik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-260285.

Full text
Abstract:
Manual configuration of rules for unmanned mining machine traffic control can be time-consuming and therefore expensive. This paper presents a Machine Learning approach for automatic configuration of rules for traffic control in mines with autonomous mining machines by using Q-learning and SARSA. The results show that automation might be able to cut the time taken to configure traffic rules from 1-2 weeks to a maximum of approximately 6 hours which would decrease the cost of deployment. Tests show that in the worst case the developed solution is able to run continuously for 24 hours 82% of the time compared to the 100% accuracy of the manual configuration. The conclusion is that machine learning can plausibly be used for the automatic configuration of traffic rules. Further work in increasing the accuracy to 100% is needed for it to replace manual configuration. It remains to be examined whether the conclusion retains pertinence in more complex environments with larger layouts and more machines.<br>Manuell konfigurering av trafikkontroll för obemannade gruvmaskiner kan vara en tidskrävande process. Om denna konfigurering skulle kunna automatiseras så skulle det gynnas tidsmässigt och ekonomiskt. Denna rapport presenterar en lösning med maskininlärning med Q-learning och SARSA som tillvägagångssätt. Resultaten visar på att konfigureringstiden möjligtvis kan tas ned från 1–2 veckor till i värsta fallet 6 timmar vilket skulle minska kostnaden för produktionssättning. Tester visade att den slutgiltiga lösningen kunde köra kontinuerligt i 24 timmar med minst 82% träffsäkerhet jämfört med 100% då den manuella konfigurationen används. Slutsatsen är att maskininlärning eventuellt kan användas för automatisk konfiguration av trafikkontroll. Vidare arbete krävs för att höja träffsäkerheten till 100% så att det kan användas istället för manuell konfiguration. Fler studier bör göras för att se om detta även är sant och applicerbart för mer komplexa scenarier med större gruvlayouts och fler maskiner.
APA, Harvard, Vancouver, ISO, and other styles
18

Ludwig, Jeremy R. "Extending dynamic scripting." Thesis, Connect to title online (Scholars' Bank) Connect to title online (ProQuest), 2008. http://hdl.handle.net/1794/9222.

Full text
Abstract:
Thesis (Ph. D.)--University of Oregon, 2008.<br>Typescript. Includes vita and abstract. Includes bibliographical references (leaves 163-167). Also available online in Scholars' Bank; and in ProQuest, free to University of Oregon users.
APA, Harvard, Vancouver, ISO, and other styles
19

Aberdeen, Douglas Alexander, and doug aberdeen@anu edu au. "Policy-Gradient Algorithms for Partially Observable Markov Decision Processes." The Australian National University. Research School of Information Sciences and Engineering, 2003. http://thesis.anu.edu.au./public/adt-ANU20030410.111006.

Full text
Abstract:
Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). ¶ In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ¶ Monte-Carlo policy-gradient approaches tend to produce gradient estimates with high variance. Two novel methods for reducing variance are introduced. The first uses high-order filters to replace the eligibility trace of the gradient estimator. The second uses a low-variance value-function method to learn a subset of the parameters and a policy-gradient method to learn the remainder. ¶ The algorithms are applied to large domains including a simulated robot navigation scenario, a multi-agent scenario with 21,000 states, and the complex real-world task of large vocabulary continuous speech recognition. To the best of the author's knowledge, no other policy-gradient algorithms have performed well at such tasks. ¶ The high variance of Monte-Carlo methods requires lengthy simulation and hence a super-computer to train agents within a reasonable time. The ANU ``Bunyip'' Linux cluster was built with such tasks in mind. It was used for several of the experimental results presented here. One chapter of this thesis describes an application written for the Bunyip cluster that won the international Gordon-Bell prize for price/performance in 2001.
APA, Harvard, Vancouver, ISO, and other styles
20

Karlsson, Daniel. "Hyperparameter optimisation using Q-learning based algorithms." Thesis, Karlstads universitet, Fakulteten för hälsa, natur- och teknikvetenskap (from 2013), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-78096.

Full text
Abstract:
Machine learning algorithms have many applications, both for academic and industrial purposes. Examples of applications are classification of diffraction patterns in materials science and classification of properties in chemical compounds within the pharmaceutical industry. For these algorithms to be successful they need to be optimised,  part of this is achieved by training the algorithm, but there are components of the algorithms that cannot be trained. These hyperparameters have to be tuned separately. The focus of this work was optimisation of hyperparameters in classification algorithms based on convolutional neural networks. The purpose of this thesis was to investigate the possibility of using reinforcement learning algorithms, primarily Q-learning, as the optimising algorithm.  Three different algorithms were investigated, Q-learning, double Q-learning and a Q-learning inspired algorithm, which was designed during this work. The algorithms were evaluated on different problems and compared to a random search algorithm, which is one of the most common optimisation tools for this type of problem. All three algorithms were capable of some learning, however the Q-learning inspired algorithm was the only one to outperform the random search algorithm on the test problems.  Further, an iterative scheme of the Q-learning inspired algorithm was implemented, where the algorithm was allowed to refine the search space available to it. This showed further improvements of the algorithms performance and the results indicate that similar performance to the random search may be achieved in a shorter period of time, sometimes reducing the computational time by up to 40%.<br>Maskininlärningsalgoritmer har många tillämpningsområden, både akademiska och inom industrin. Exempel på tillämpningar är, klassificering av diffraktionsmönster inom materialvetenskap och klassificering av egenskaper hos kemiska sammansättningar inom läkemedelsindustrin. För att dessa algoritmer ska prestera bra behöver de optimeras. En del av optimering sker vid träning av algoritmerna, men det finns komponenter som inte kan tränas. Dessa hyperparametrar måste justeras separat. Fokuset för det här arbetet var optimering av hyperparametrar till klassificeringsalgoritmer baserade på faltande neurala nätverk. Syftet med avhandlingen var att undersöka möjligheterna att använda förstärkningsinlärningsalgoritmer, främst ''Q-learning'', som den optimerande algoritmen.  Tre olika algoritmer undersöktes, ''Q-learning'', dubbel ''Q-learning'' samt en algoritm inspirerad av ''Q-learning'', denna utvecklades under arbetets gång. Algoritmerna utvärderades på olika testproblem och jämfördes mot resultat uppnådda med en slumpmässig sökning av hyperparameterrymden, vilket är en av de vanligare metoderna för att optimera den här typen av algoritmer. Alla tre algoritmer påvisade någon form av inlärning, men endast den ''Q-learning'' inspirerade algoritmen presterade bättre än den slumpmässiga sökningen.  En iterativ implemetation av den ''Q-learning'' inspirerade algoritmen utvecklades också. Den iterativa metoden tillät den tillgängliga hyperparameterrymden att förfinas mellan varje iteration. Detta medförde ytterligare förbättringar av resultaten som indikerade att beräkningstiden i vissa fall kunde minskas med upp till 40% jämfört med den slumpmässiga sökningen med bibehållet eller förbättrat resultat.
APA, Harvard, Vancouver, ISO, and other styles
21

Thirunavukkarasu, Muthukumar. "Reinforcing Reachable Routes." Thesis, Virginia Tech, 2004. http://hdl.handle.net/10919/9904.

Full text
Abstract:
Reachability routing is a newly emerging paradigm in networking, where the goal is to determine all paths between a sender and a receiver. It is becoming relevant with the changing dynamics of the Internet and the emergence of low-bandwidth wireless/ad hoc networks. This thesis presents the case for reinforcement learning (RL) as the framework of choice to realize reachability routing, within the confines of the current Internet backbone infrastructure. The setting of the reinforcement learning problem offers several advantages, including loop resolution, multi-path forwarding capability, cost-sensitive routing, and minimizing state overhead, while maintaining the incremental spirit of the current backbone routing algorithms. We present the design and implementation of a new reachability algorithm that uses a model-based approach to achieve cost-sensitive multi-path forwarding. Performance assessment of the algorithm in various troublesome topologies shows consistently superior performance over classical reinforcement learning algorithms. Evaluations of the algorithm based on different criteria on many types of randomly generated networks as well as realistic topologies are presented.<br>Master of Science
APA, Harvard, Vancouver, ISO, and other styles
22

Mendonça, Matheus Ribeiro Furtado de. "Evolution of reward functions for reinforcement learning applied to stealth games." Universidade Federal de Juiz de Fora (UFJF), 2016. https://repositorio.ufjf.br/jspui/handle/ufjf/4771.

Full text
Abstract:
Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-05-31T11:40:17Z No. of bitstreams: 1 matheusribeirofurtadodemendonca.pdf: 1083096 bytes, checksum: bb42372f22411bc93823b92e7361a490 (MD5)<br>Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-05-31T12:42:30Z (GMT) No. of bitstreams: 1 matheusribeirofurtadodemendonca.pdf: 1083096 bytes, checksum: bb42372f22411bc93823b92e7361a490 (MD5)<br>Made available in DSpace on 2017-05-31T12:42:30Z (GMT). No. of bitstreams: 1 matheusribeirofurtadodemendonca.pdf: 1083096 bytes, checksum: bb42372f22411bc93823b92e7361a490 (MD5) Previous issue date: 2016<br>CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior<br>Muitos jogos modernos apresentam elementos que permitem que o jogador complete certos objetivos sem ser visto pelos inimigos. Isso culminou no surgimento de um novo gênero chamado de jogos furtivos, onde a furtividade é essencial. Embora elementos de furtividade sejam muito comuns em jogos modernos, este tema não tem sido estudado extensivamente. Este trabalho aborda três problemas distintos: (i) como utilizar uma abordagem por aprendizado de máquinas de forma a permitir que o agente furtivo aprenda como se comportar adequadamente em qualquer ambiente, (ii) criar um método eficiente para planejamento de caminhos furtivos que possa ser acoplado à nossa formulação por aprendizado de máquinas e (iii) como usar computação evolutiva de forma a definir certos parâmetros para nossa abordagem por aprendizado de máquinas. É utilizado aprendizado por reforço para aprender bons comportamentos que sejam capazes de atingir uma alta taxa de sucesso em testes aleatórios de um jogo furtivo. Também é proposto uma abor dagem evolucionária capaz de definir automaticamente uma boa função de reforço para a abordagem por aprendizado por reforço.<br>Many modern games present stealth elements that allow the player to accomplish a certain objective without being spotted by enemy patrols. This gave rise to a new genre called stealth games, where covertness plays a major role. Although quite popular in modern games, stealthy behaviors has not been extensively studied. In this work, we tackle three different problems: (i) how to use a machine learning approach in order to allow the stealthy agent to learn good behaviors for any environment, (ii) create an efficient stealthy path planning method that can be coupled with our machine learning formulation, and (iii) how to use evolutionary computing in order to define specific parameters for our machine learning approach without any prior knowledge of the problem. We use Reinforcement Learning in order to learn good covert behavior capable of achieving a high success rate in random trials of a stealth game. We also propose an evolutionary approach that is capable of automatically defining a good reward function for our reinforcement learning approach.
APA, Harvard, Vancouver, ISO, and other styles
23

Cook, Philip R. "Limitations and Extensions of the WoLF-PHC Algorithm." Diss., CLICK HERE for online access, 2007. http://contentdm.lib.byu.edu/ETD/image/etd2109.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Romandini, Nicolò. "Evaluation and implementation of reinforcement learning and pattern recognition algorithms for task automation on web interfaces." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text
Abstract:
Automated task execution in a web context is a major challenge today. One of the main fields in which this is needed is undoubtedly that of Information Security, where it is becoming increasingly necessary to find techniques that allow security tests to be carried out without human intervention. Not only to relieve programmers from performing repetitive tasks, but above all to be able to perform many more tests in the same amount of time. Although techniques already exist to automate the execution of actions on web interfaces, these solutions are often limited to running in the environment for which they were designed. It is, indeed, impossible for them to execute the learnt behaviour in different and unseen environments. The aim of this thesis project is to analyse different Machine Learning techniques in order to find an optimal solution to this problem. In other words, to obtain an agent capable of executing a task in all the environments in which it operates. The approaches analysed and implemented can be traced back to two areas of Machine Learning, Reinforcement Learning and Pattern Recognition. Each approach was tested using real web applications in order to measure their abilities in a context as close to reality as possible. Although Reinforcement Learning approaches were found to be the most automated, they failed to achieve satisfactory results. On the contrary, the Pattern Recognition approach was found to be the most capable of executing tasks, even complex ones, in different and unseen environments, requiring, however, a lot of preliminary work.
APA, Harvard, Vancouver, ISO, and other styles
25

Pentapalli, Mridul. "A comparative study of Roth-Erev and modied Roth-Erev reinforcement learning algorithms for uniform-price double auctions." [Ames, Iowa : Iowa State University], 2008.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
26

Saha, Avijit. "Development of a Software Platform with Distributed Learning Algorithms for Building Energy Efficiency and Demand Response Applications." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/74423.

Full text
Abstract:
In the United States, over 40% of the country's total energy consumption is in buildings, most of which are either small-sized (<5,000 sqft) or medium-sized (5,000-50,000 sqft). These buildings offer excellent opportunities for energy saving and demand response (DR), but these opportunities are rarely utilized due to lack of effective building energy management systems and automated algorithms that can assist a building to participate in a DR program. Considering the low load factor in US and many other countries, DR can serve as an effective tool to reduce peak demand through demand-side load curtailment. A convenient option for the customer to benefit from a DR program is to use automated DR algorithms within a software that can learn user comfort preferences for the building loads and make automated load curtailment decisions without affecting customer comfort. The objective of this dissertation is to provide such a solution. First, this dissertation contributes to the development of key features of a building energy management open source software platform that enable ease-of-use through plug and play and interoperability of devices in a building, cost-effectiveness through deployment in a low-cost computer, and DR through communication infrastructure between building and utility and among multiple buildings, while ensuring security of the platform. Second, a set of reinforcement learning (RL) based algorithms is proposed for the three main types of loads in a building: heating, ventilation and air conditioning (HVAC) loads, lighting loads and plug loads. In absence of a DR program, these distributed agent-based learning algorithms are designed to learn the user comfort ranges through explorative interaction with the environment and accumulating user feedback, and then operate through policies that favor maximum user benefit in terms of saving energy while ensuring comfort. Third, two sets of DR algorithms are proposed for an incentive-based DR program in a building. A user-defined priority based DR algorithm with smart thermostat control and utilization of distributed energy resources (DER) is proposed for residential buildings. For commercial buildings, a learning-based algorithm is proposed that utilizes the learning from the RL algorithms to use a pre-cooling/pre-heating based load reduction method for HVAC loads and a mixed integer linear programming (MILP) based optimization method for other loads to dynamically maintain total building demand below a demand limit set by the utility during a DR event, while minimizing total user discomfort. A user defined priority based DR algorithm is also proposed for multiple buildings in a community so that they can participate in realizing combined DR objectives. The software solution proposed in this dissertation is expected to encourage increased participation of smaller and medium-sized buildings in demand response and energy saving activities. This will help in alleviating power system stress conditions by employing the untapped DR potential in such buildings.<br>Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
27

Lee, Jong Min. "A Study on Architecture, Algorithms, and Applications of Approximate Dynamic Programming Based Approach to Optimal Control." Diss., Georgia Institute of Technology, 2004. http://hdl.handle.net/1853/5048.

Full text
Abstract:
This thesis develops approximate dynamic programming (ADP) strategies suitable for process control problems aimed at overcoming the limitations of MPC, which are the potentially exorbitant on-line computational requirement and the inability to consider the future interplay between uncertainty and estimation in the optimal control calculation. The suggested approach solves the DP only for the state points visited by closed-loop simulations with judiciously chosen control policies. The approach helps us combat a well-known problem of the traditional DP called 'curse-of-dimensionality,' while it allows the user to derive an improved control policy from the initial ones. The critical issue of the suggested method is a proper choice and design of function approximator. A local averager with a penalty term is proposed to guarantee a stably learned control policy as well as acceptable on-line performance. The thesis also demonstrates versatility of the proposed ADP strategy with difficult process control problems. First, a stochastic adaptive control problem is presented. In this application an ADP-based control policy shows an "active" probing property to reduce uncertainties, leading to a better control performance. The second example is a dual-mode controller, which is a supervisory scheme that actively prevents the progression of abnormal situations under a local controller at their onset. Finally, two ADP strategies for controlling nonlinear processes based on input-output data are suggested. They are model-based and model-free approaches, and have the advantage of conveniently incorporating the knowledge of identification data distribution into the control calculation with performance improvement.
APA, Harvard, Vancouver, ISO, and other styles
28

Bountourelis, Theologos. "Efficient pac-learning for episodic tasks with acyclic state spaces and the optimal node visitation problem in acyclic stochastic digaphs." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/28144.

Full text
Abstract:
Thesis (M. S.)--Industrial and Systems Engineering, Georgia Institute of Technology, 2009.<br>Committee Chair: Reveliotis, Spyros; Committee Member: Ayhan, Hayriye; Committee Member: Goldsman, Dave; Committee Member: Shamma, Jeff; Committee Member: Zwart, Bert.
APA, Harvard, Vancouver, ISO, and other styles
29

Björck, Erik, and Fredrik Omstedt. "A comparison of algorithms used in traffic control systems." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-229709.

Full text
Abstract:
A challenge in today's society is to handle a large amount of vehicles traversing an intersection. Traffic lights are often used to control the traffic flow in these intersections. However, there are inefficiencies since the algorithms used to control the traffic lights do not perfectly adapt to the traffic situation. The purpose of this paper is to compare three different types of algorithms used in traffic control systems to find out how to minimize vehicle waiting times. A pretimed, a deterministic and a reinforcement learning algorithm were compared with each other. Test were conducted on a four-way intersection with various traffic demands using the program Simulation of Urban MObility (SUMO). The results showed that the deterministic algorithm performed best for all demands tested. The reinforcement learning algorithm performed better than the pretimed for low demands, but worse for varied and higher demands. The reasons behind these results are the deterministic algorithm's knowledge about vehicular movement and the negative effects the curse of dimensionality has on the training of the reinforcement learning algorithm. However, more research must be conducted to ensure that the results obtained are trustworthy in similar and different traffic situations.<br>En utmaning i dagens samhälle är att hantera en stor mängd fordon som kör igenom en korsning. Trafikljus används ofta för att kontrollera trafikflödena genom dessa korsningar. Det finns däremot ineffektiviteter eftersom algoritmerna som används för att kontrollera trafikljusen inte är perfekt anpassade till trafiksituationen. Syftet med denna rapport är att jämföra tre typer av algoritmer som används i trafiksystem för att undersöka hur väntetid för fordon kan minimeras. En tidsbaserad, en deterministisk och en förstärkande inlärning-algoritm jämfördes med varandra. Testerna utfördes på en fyrvägskorsning med olika trafikintensiteter med hjälp av programmet Simulation of Urban MObility (SUMO). Resultaten visade att den deterministiska algoritmen presterade bäst för alla olika trafikintensiteter. Inlärningsalgoritmen presterade bättre än den tidsbaserade på låga intensiteter, men sämre på varierande och högre intensiteter. Anledningarna bakom resultaten är att den deterministiska algoritmen har kunskap om hur fordon rör sig samt att dimensionalitetsproblem påverkar träningen av inlärningsalgoritmen negativt. Det krävs däremot mer forskning för att säkerställa att resultaten är pålitliga i liknande och annorlunda trafiksituationer.
APA, Harvard, Vancouver, ISO, and other styles
30

Mämmelä, O. (Olli). "Algorithms for efficient and energy-aware network resource management in autonomous communications systems." Doctoral thesis, Oulun yliopisto, 2017. http://urn.fi/urn:isbn:9789526216089.

Full text
Abstract:
Abstract According to industry estimates, monthly global mobile data traffic will surpass 30.6 exabytes by 2020 and global mobile data traffic will increase nearly eightfold between 2015 and 2020. Most of the mobile data traffic is generated by smartphones, and the total number of smartphones is expected to continue growing by 2020, which results in rapid traffic growth. In addition, the upcoming 5G networks and Internet of Things based communication are estimated to involve a large amount of network traffic. The increase in mobile data traffic and in the number of connected devices poses a challenge to network operators, service providers, and data center operators. If the transmission capacity of the network and the amount of data traffic are not in line with each other, congestion may occur and ultimately the quality of experience degrades. Mobile networks are also becoming more reliant on data centers that provide efficient computing power. However, the energy consumption of data centers has grown in recent years, which is a problem for data center operators. A traditional strategy to overcome these problems is to scale up the resources or by providing more efficient hardware. Resource over-provisioning increases operating and capital expenditures without a guarantee of increased average revenue per user. In addition, the growing complexity and dynamics of communication systems is a challenge for efficient resource management. Intelligent and resilient methods that can efficiently use existing resources by making autonomous decisions without intervention from human administrators are thus needed. The goal of this research is to implement, develop, model, and test algorithms that can enable efficient and energy-aware network resource management in autonomous communications systems. First, an energy-aware algorithm is introduced for high-performance computing data centers to reduce the energy consumption within a single data center and across a federation of data centers. For network access selection in heterogeneous wireless networks, two algorithms are proposed, a client side algorithm that tries to optimize users' quality of experience and a network side algorithm that focuses on optimizing the global resource usage of the network. Finally, for a video service, an algorithm is presented that can enhance the video content delivery in a controllable and resource-efficient way without major changes in the mobile network infrastructure<br>Tiivistelmä Langattoman tietoliikenteen nopean kasvun ennustetaan jatkuvan edelleen lähivuosinakin ja alan teollisuuden arvioiden mukaan matkapuhelinliikenteen määrä ylittäisi globaalisti 30,6 eksatavua vuoteen 2020 mennessä. Tämä tarkoittaisi liikennemäärän kahdeksankertaistumista ajanjaksolla 2015–2020. Älypuhelimet tuottavat suurimman osan matkapuhelinliikenteestä, ja älypuhelimien lukumäärän arvioidaan jatkavan kasvuaan vuoteen 2020 saakka, mikä johtaa nopeaan liikenteen kasvuun. Tämän lisäksi arvioidaan, että 5G verkot ja esineiden Internet tuottavat suuren määrän verkkoliikennettä. Matkapuhelinliikenteen ja laitteiden määrän kasvu tuo haasteita verkko-operaattoreille, palvelun tarjoajille, ja datakeskusoperaattoreille. Mikäli verkossa ei ole tarpeeksi siirtokapasiteettia dataliikenteen määrää varten, verkko ruuhkautuu ja lopulta palvelukokemus kärsii. Matkapuhelinverkot tulevat myös tulevaisuudessa tarvitsemaan datakeskusten laskentakapasiteettia. Datakeskusten energiankulutus on kuitenkin kasvanut viime vuosina, mikä on ongelma datakeskusoperaattoreille. Perinteinen strategia ongelmien ratkaisemiseksi on lisätä resurssien määrää tai tarjota tehokkaampaa laitteistoa. Resurssien liiallinen lisääminen kasvattaa kuitenkin sekä käyttö- että pääomakustannuksia ilman takuuta siitä, että keskimääräinen myyntitulo per käyttäjä kasvaisi. Tämän lisäksi tietoliikennejärjestelmät ovat monimutkaisia ja dynaamisia järjestelmiä, minkä vuoksi tehokas resurssienhallinta on haastavaa. Tämän vuoksi tarvitaan älykkäitä ja kestäviä metodeja, jotka pystyvät käyttämään olemassa olevia resursseja tehokkaasti tekemällä autonomisia päätöksiä ilman ylläpitäjän väliintuloa. Tämän tutkimuksen tavoitteena on toteuttaa, kehittää, mallintaa, ja testata algoritmeja, jotka mahdollistavat tehokkaan ja energiatietoisen verkkoresurssien hallinnan autonomisissa tietoliikennejärjestelmissä. Tutkimus esittää aluksi supertietokonedatakeskuksiin energiatietoisen algoritmin, jonka avulla voidaan vähentää energiankulutusta yhden datakeskuksen sisällä sekä usean eri datakeskuksen välillä. Verkkoyhteyden valintaan heterogeenisissä langattomissa verkoissa esitetään kaksi algoritmia. Ensimmäinen on käyttäjäkohtainen algoritmi, joka pyrkii optimoimaan yksittäisen käyttäjän palvelukokemusta. Toinen on verkon puolen algoritmi, joka keskittyy optimoimaan verkon kokonaisresurssien käyttöä. Lopuksi esitetään videopalvelulle algoritmi, joka parantaa videosisällön jakoa kontrolloidusti ja resurssitehokkaasti ilman että matkapuhelinverkon infrastruktuurille tarvitaan muutoksia
APA, Harvard, Vancouver, ISO, and other styles
31

Jedor, Matthieu. "Bandit algorithms for recommender system optimization." Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASM027.

Full text
Abstract:
Dans cette thèse de doctorat, nous étudions l'optimisation des systèmes de recommandation dans le but de fournir des suggestions de produits plus raffinées pour un utilisateur.La tâche est modélisée à l'aide du cadre des bandits multi-bras.Dans une première partie, nous abordons deux problèmes qui se posent fréquemment dans les systèmes de recommandation : le grand nombre d'éléments à traiter et la gestion des contenus sponsorisés.Dans une deuxième partie, nous étudions les performances empiriques des algorithmes de bandit et en particulier comment paramétrer les algorithmes traditionnels pour améliorer les résultats dans les environnements stationnaires et non stationnaires qui l'on rencontre en pratique.Cela nous amène à analyser à la fois théoriquement et empiriquement l'algorithme glouton qui, dans certains cas, est plus performant que l'état de l'art<br>In this PhD thesis, we study the optimization of recommender systems with the objective of providing more refined suggestions of items for a user to benefit.The task is modeled using the multi-armed bandit framework.In a first part, we look upon two problems that commonly occured in recommendation systems: the large number of items to handle and the management of sponsored contents.In a second part, we investigate the empirical performance of bandit algorithms and especially how to tune conventional algorithm to improve results in stationary and non-stationary environments that arise in practice.This leads us to analyze both theoretically and empirically the greedy algorithm that, in some cases, outperforms the state-of-the-art
APA, Harvard, Vancouver, ISO, and other styles
32

Allmendinger, Richard. "Tuning evolutionary search for closed-loop optimization." Thesis, University of Manchester, 2012. https://www.research.manchester.ac.uk/portal/en/theses/tuning-evolutionary-search-for-closedloop-optimization(d54e63e2-7927-42aa-b974-c41e717298cb).html.

Full text
Abstract:
Closed-loop optimization deals with problems in which candidate solutions are evaluated by conducting experiments, e.g. physical or biochemical experiments. Although this form of optimization is becoming more popular across the sciences, it may be subject to rather unexplored resourcing issues, as any experiment may require resources in order to be conducted. In this thesis we are concerned with understanding how evolutionary search is affected by three particular resourcing issues -- ephemeral resource constraints (ERCs), changes of variables, and lethal environments -- and the development of search strategies to combat these issues. The thesis makes three broad contributions. First, we motivate and formally define the resourcing issues considered. Here, concrete examples in a range of applications are given. Secondly, we theoretically and empirically investigate the effect of the resourcing issues considered on evolutionary search. This investigation reveals that resourcing issues affect optimization in general, and that clear patterns emerge relating specific properties of the different resourcing issues to performance effects. Thirdly, we develop and analyze various search strategies augmented on an evolutionary algorithm (EA) for coping with resourcing issues. To cope specifically with ERCs, we develop several static constraint-handling strategies, and investigate the application of reinforcement learning techniques to learn when to switch between these static strategies during an optimization process. We also develop several online resource-purchasing strategies to cope with ERCs that leave the arrangement of resources to the hands of the optimizer. For problems subject to changes of variables relating to the resources, we find that knowing which variables are changed provides an optimizer with valuable information, which we exploit using a novel dynamic strategy. Finally, for lethal environments, where visiting parts of the search space can cause the permanent loss of resources, we observe that a standard EA's population may be reduced in size rapidly, complicating the search for innovative solutions. To cope with such scenarios, we consider some non-standard EA setups that are able to innovate genetically whilst simultaneously mitigating risks to the evolving population.
APA, Harvard, Vancouver, ISO, and other styles
33

Besson, Lilian. "Multi-Players Bandit Algorithms for Internet of Things Networks." Thesis, CentraleSupélec, 2019. http://www.theses.fr/2019CSUP0005.

Full text
Abstract:
Dans cette thèse de doctorat, nous étudions les réseaux sans fil et les appareils reconfigurables qui peuvent accéder à des réseaux de type radio intelligente, dans des bandes non licenciées et sans supervision centrale. Nous considérons notamment des réseaux actuels ou futurs de l’Internet des Objets (IoT), avec l’objectif d’augmenter la durée de vie de la batterie des appareils, en les équipant d’algorithmes d’apprentissage machine peu coûteux mais efficaces, qui leur permettent d’améliorer automatiquement l’efficacité de leurs communications sans fil. Nous proposons deux modèles de réseaux IoT, et nous montrons empiriquement, par des simulations numériques et une validation expérimentale réaliste, le gain que peuvent apporter nos méthodes, qui se reposent sur l’apprentissage par renforcement. Les différents problèmes d’accès au réseau sont modélisés avec des Bandits Multi-Bras (MAB), mais l’analyse de la convergence d’un grand nombre d’appareils jouant à un jeu collaboratif sans communication ni aucune coordination reste délicate, lorsque les appareils suivent tous un modèle d’activation aléatoire. Le reste de ce manuscrit étudie donc deux modèles restreints, d’abord des banditsmulti-joueurs dans des problèmes stationnaires, puis des bandits mono-joueur non stationnaires. Nous détaillons également une autre contribution, la bibliothèque Python open-source SMPyBandits, qui permet des simulations numériques de problèmes MAB, qui couvre les modèles étudiés et d’autres<br>In this PhD thesis, we study wireless networks and reconfigurable end-devices that can access Cognitive Radio networks, in unlicensed bands and without central control. We focus on Internet of Things networks (IoT), with the objective of extending the devices’ battery life, by equipping them with low-cost but efficient machine learning algorithms, in order to let them automatically improve the efficiency of their wireless communications. We propose different models of IoT networks, and we show empirically on both numerical simulations and real-world validation the possible gain of our methods, that use Reinforcement Learning. The different network access problems are modeled as Multi-Armed Bandits (MAB), but we found that analyzing the realistic models was intractable, because proving the convergence of many IoT devices playing a collaborative game, without communication nor coordination is hard, when they all follow random activation patterns. The rest of this manuscript thus studies two restricted models, first multi-players bandits in stationary problems, then non-stationary single-player bandits. We also detail another contribution, SMPyBandits, our open-source Python library for numerical MAB simulations, that covers all the studied models and more
APA, Harvard, Vancouver, ISO, and other styles
34

Tarbouriech, Jean. "Goal-oriented exploration for reinforcement learning." Electronic Thesis or Diss., Université de Lille (2022-....), 2022. http://www.theses.fr/2022ULILB014.

Full text
Abstract:
Apprendre à atteindre des buts est une compétence à acquérir à grande pertinence pratique pour des agents intelligents. Par exemple, ceci englobe de nombreux problèmes de navigation (se diriger vers telle destination), de manipulation robotique (atteindre telle position du bras robotique) ou encore certains jeux (gagner en accomplissant tel objectif). En tant qu'être vivant interagissant avec le monde, je suis constamment motivé par l'atteinte de buts, qui varient en portée et difficulté.L'Apprentissage par Renforcement (AR) est un paradigme prometteur pour formaliser et apprendre des comportements d'atteinte de buts. Un but peut être modélisé comme une configuration spécifique d'états de l'environnement qui doit être atteinte par interaction séquentielle et exploration de l'environnement inconnu. Bien que divers algorithmes en AR dit "profond" aient été proposés pour ce modèle d'apprentissage conditionné par des états buts, les méthodes existantes manquent de compréhension rigoureuse, d'efficacité d'échantillonnage et de capacités polyvalentes. Il s'avère que l'analyse théorique de l'AR conditionné par des états buts demeurait très limitée, même dans le scénario basique d'un nombre fini d'états et d'actions.Premièrement, nous nous concentrons sur le scénario supervisé, où un état but qui doit être atteint en minimisant l'espérance des coûts cumulés est fourni dans la définition du problème. Après avoir formalisé le problème d'apprentissage incrémental (ou ``online'') de ce modèle souvent appelé Plus Court Chemin Stochastique, nous introduisons deux algorithmes au regret sous-linéaire (l'un est le premier disponible dans la littérature, l'autre est quasi-optimal).Au delà d'entraîner l'agent d'AR à résoudre une seule tâche, nous aspirons ensuite qu'il apprenne de manière autonome à résoudre une grande variété de tâches, dans l'absence de toute forme de supervision en matière de récompense. Dans ce scénario non-supervisé, nous préconisons que l'agent sélectionne lui-même et cherche à atteindre ses propres états buts. Nous dérivons des garanties non-asymptotiques de cette heuristique populaire dans plusieurs cadres, chacun avec son propre objectif d'exploration et ses propres difficultés techniques. En guise d'illustration, nous proposons une analyse rigoureuse du principe algorithmique de viser des états buts "incertains", que nous ancrons également dans le cadre de l'AR profond.L'objectif et les contributions de cette thèse sont d'améliorer notre compréhension formelle de l'exploration d'états buts pour l'AR, dans les scénarios supervisés et non-supervisés. Nous espérons qu'elle peut aider à suggérer de nouvelles directions de recherche pour améliorer l'efficacité d'échantillonnage et l'interprétabilité d'algorithmes d'AR basés sur la sélection et/ou l'atteinte d'états buts dans des applications pratiques<br>Learning to reach goals is a competence of high practical relevance to acquire for intelligent agents. For instance, this encompasses many navigation tasks ("go to target X"), robotic manipulation ("attain position Y of the robotic arm"), or game-playing scenarios ("win the game by fulfilling objective Z"). As a living being interacting with the world, I am constantly driven by goals to reach, varying in scope and difficulty.Reinforcement Learning (RL) holds the promise to frame and learn goal-oriented behavior. Goals can be modeled as specific configurations of the environment that must be attained via sequential interaction and exploration of the unknown environment. Although various deep RL algorithms have been proposed for goal-oriented RL, existing methods often lack principled understanding, sample efficiency and general-purpose effectiveness. In fact, very limited theoretical analysis of goal-oriented RL was available, even in the basic scenario of finitely many states and actions.We first focus on a supervised scenario of goal-oriented RL, where a goal state to be reached in minimum total expected cost is provided as part of the problem definition. After formalizing the online learning problem in this setting often known as Stochastic Shortest Path (SSP), we introduce two no-regret algorithms (one is the first available in the literature, the other attains nearly optimal guarantees).Beyond training our RL agent to solve only one task, we then aspire that it learns to autonomously solve a wide variety of tasks, in the absence of any reward supervision. In this challenging unsupervised RL scenario, we advocate to "Set Your Own Goals" (SYOG), which suggests the agent to learn the ability to intrinsically select and reach its own goal states. We derive finite-time guarantees of this popular heuristic in various settings, each with its specific learning objective and technical challenges. As an illustration, we propose a rigorous analysis of the algorithmic principle of targeting "uncertain" goals which we also anchor in deep RL.The main focus and contribution of this thesis are to instigate a principled analysis of goal-oriented exploration in RL, both in the supervised and unsupervised scenarios. We hope that it helps suggest promising research directions to improve the interpretability and sample efficiency of goal-oriented RL algorithms in practical applications
APA, Harvard, Vancouver, ISO, and other styles
35

Del, Ben Enrico <1997&gt. "Reinforcement Learning: a Q-Learning Algorithm for High Frequency Trading." Master's Degree Thesis, Università Ca' Foscari Venezia, 2021. http://hdl.handle.net/10579/20411.

Full text
Abstract:
The scope of this work is to test the implementation of an automated trading system based on Reinforcement Learning: a machine learning algorithm in which an intelligent agent acts to maximize its rewards given the environment around it. Indeed, given the environmental inputs and the environmental responses to the actions taken, the agent will learn how to behave in best way possible. In particular, in this work, a Q-Learning algorithm has been used to produce trading signals on the basis of high frequency data of the Limit Order Book for some selected stocks.
APA, Harvard, Vancouver, ISO, and other styles
36

Brokking, Alexander, and Michael Wink. "Algorithmic Stock Trading using Deep Reinforcement learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-302521.

Full text
Abstract:
Recent breakthroughs in Deep Learning and Reinforcement Learning have enabled the new field of Deep Reinforcement Learning. This study explores some of the state of the art applications of deep reinforcement learning in the field of finance and algorithmic trading. By building on previous research from Yang et al. at Columbia University, this study aims to validate their findings and explore ways to improve their proposed trading model using the Sharpe ratio in the reward function. We show that there is significant variability in the performance of their trading model and question their premise of basing their results on the best performing model iteration. Moreover, we explore how the Sharpe ratio calculated over a 21 day and 63 day rolling period can be used as a reward function. However, this did not result in any significant change in outcome which could be attributed to the high performance variability in both the original algorithm and our changed algorithm which thwarts consistent conclusions.<br>Nya genombrott inom djupinlärning och förstärkningsinlärning har möjliggjort forskningsområdet djup förstärkningsinlärning. Den här studien utforskar några nya appliceringsområden av djup förstärkningsinlärning inom finans och algoritmisk handel. Genom att bygga på tidigare forskning av Yang et al. från Columbia University avser den här studien att validera deras resultat och hitta sätt att förbättra deras föreslagna modell med hjälp av Sharpekvoten som belöningsfunktion. Vi visar att det är stor varians i prestandan av deras modell och ifrågasätter deras premiss av att basera sina resultat på deras bästa modellinstans. Vidare utforskar vi hur Sharpekvoten beräknad rullande över 21 dagar och 63 dagar kan användas som belöningsfunktion. Resultaten visade däremot inte på någon signifikant förändring i prestanda vilket kan förklarars av den stora variansen i modellprestandan som försvårar konsekventa slutsatser.
APA, Harvard, Vancouver, ISO, and other styles
37

Olsson, Rasmus, and Jens Egeland. "Reinforcement Learning Routing Algorithm for Bluetooth Mesh Networks." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-234287.

Full text
Abstract:
Today’s office and home environments are moving towards more connected dig- ital infrastructures, meaning there are multiple heterogeneous devices that uses short-range communication to stay connected. Mobile phones, tablets, lap- tops, sensors, printers are examples of devices in such environments. From this, the Internet of Things (IoT) paradigm arises, and to enable it, energy efficient machine-to-machine (M2M) communications are needed. Our study will use Bluetooth Low Energy (BLE) technology for communication between devices, and it demonstrates the impact of routing algorithms in such networks. With the goal to increase the network lifetime, a distributed and dynamic Reinforce- ment Learning (RL) routing algorithm is proposed. The algorithm is based on a RL technique called Q-learning. Performance analysis is performed in different scenarios comparing the proposed algorithm against two static and centralized reference routing algorithms. The results show that our proposed RL routing algorithm performs better as the node degree of the topology increases. Com- pared to the reference algorithms the proposed algorithm can handle a higher load on the network with significant performance improvement, due to the dy- namic change of routes. The increase in network lifetime with 75 devices is 124% and 100 devices is 349%, because of the ability to change routes as time passes which is emphasized when the node degree increases. For 35, 55 and 75 devices the average node degrees are 2.21, 2.39 and 2.54. On a lower number of devices our RL routing algorithm performs nearly as good as the best refer- ence algorithm, the Energy Aware Routing (EAR) algorithm, with a decrease in network lifetime around 19% on 35 devices and 10% on 55 devices. A decrease in the network lifetime on lower number of devices is because of the cost for learning new paths is higher than the gain from exploring multiple paths.<br>Dagens kontors- och hemmiljöer rör sig mot mer sammankopplad digital in-frastruktur, vilket innebär att det finns många heterogena enheter som behöver kommunicera med varandra på korta avstånd. Mobiltelefoner, tablets, bärbara datorer, sensorer, skrivare är exempel på enheter i sådana miljöer. Utifrån detta uppkommer IoT, och för att möjliggöra det, behövs energieffektiva M2M kom-munikationslösningar. Vår studie kommer att anvanda BLE teknik för kommu-nikation mellan enheter, och den kommer att demonstrera effekterna av routing algoritmer i sådana nätverk. Med målet att öka livstiden för nätverket föreslås en distribuerad och dynamisk RL routing algoritm baserad på Q-learning. En jämförelse mellan den föreslagna algoritmen och de två statiska och centraliser-ade referensalgoritmerna görs i olika simulerings scenarier. Resultaten visar att vår föreslagna RL routing algoritm fungerar bättre när nod graden i topologin ökar. Jämfört med referensalgoritmerna kan den föreslagna algoritmen hantera en högre belastning på nätverket med betydande prestandaförbättring, tack vare den dynamiska förändringen av rutter som leder till en bättre belastningsbal-ans. Ökningen i nätverkslivstiden med 75 enheter är 124% och med 100 enheter är ökningen 349%, på grund av förmågan att byta rutter vilket syns tydligare när nodgraden ökar. För 35, 55 och 75 enheter är nodgraderna 2.21, 2.39 och 2.54. Vid ett lägre antal enheter presterar vår RL routing algoritm nästan lika bra som den bästa referensalgoritmen, EAR, med en minskning av nätverks livstiden på runt 19% med 35 enheter och 10% med 55 enheter. En minskning av nätverks livstiden på lägre antal enheter beror på att kostnaden för att lära sig nya vägar är högre än vinsten från att utforska flera vägar.
APA, Harvard, Vancouver, ISO, and other styles
38

Kwok, Hing-Wah Computer Science &amp Engineering Faculty of Engineering UNSW. "Hierarchical reinforcement learning in adversarial environments." Publisher:University of New South Wales. Computer Science & Engineering, 2009. http://handle.unsw.edu.au/1959.4/43424.

Full text
Abstract:
It is known that one of the downfalls of reinforcement learning is the amount of time required to learn an optimal policy. This especially holds true for environments with large state spaces or environments with multiple agents. It is also known that standard Q-Learning develops a deterministic policy, and so in games where a stochastic policy is required (such as rock, paper, scissors) a Q-Learner opponent can be defeated without too much difficulty once the learning has ceased. Initially we investigated the impact that the MAXQ hierarchical reinforcement learning algorithm had in an adversarial environment. We found that it was difficult to conduct state space abstraction, especially when an unpredictable or co-evolving opponent was involved. We noticed that to keep the domains zero-sum, discounted learning was required. We had also found that a speed increase could be obtained through the use of hierarchy in the adversarial environment. We then investigated the ability to obtain similar learning speed increases to adversarial reinforcement learning through the use of this hierarchical methodology. Applying the hierarchical decomposition to Bowling's Win or Learn Fast (WoLF) algorithm we were able to maintain the accelerated learning rate whilst simultaneously retaining the stochastic elements of the WoLF algorithm. We made an assessment on the impact of the adversarial component of the hierarchy at both the higher and lower tiers of the hierarchical tree. Finally, we introduce the idea of pivot points. A pivot point is the last possible time you can wait before having to make a decision and thus revealing your strategy to the opponent. This results in maximising confusion for the opponent. Through the use of these pivot points, which could only have been discovered through the use of hierarchy, we were able to perform improved state-space abstraction since no decision needed to be made, in regards to the opponent, until this point was reached.
APA, Harvard, Vancouver, ISO, and other styles
39

Backstad, Sebastian. "Federated Averaging Deep Q-NetworkA Distributed Deep Reinforcement Learning Algorithm." Thesis, Umeå universitet, Institutionen för datavetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-149637.

Full text
Abstract:
In the telecom sector, there is a huge amount of rich data generated every day. This trend will increase with the launch of 5G networks. Telco companies are interested in analyzing their data to shape and improve their core businesses. However, there can be a number of limiting factors that prevents them from logging data to central data centers for analysis.  Some examples include data privacy, data transfer, network latency etc. In this work, we present a distributed Deep Reinforcement Learning (DRL) method called Federated Averaging Deep Q-Network (FADQN), that employs a distributed hierarchical reinforcement learning architecture. It utilizes gradient averaging to decrease communication cost. Privacy concerns are also satisfied by training the agent locally and only sending aggregated information to the centralized server. We introduce two versions of FADQN: synchronous and asynchronous. Results on the cart-pole environment show 80 times reduction in communication without any significant loss in performance. Additionally, in case of asynchronous approach, we see a great improvement in convergence.
APA, Harvard, Vancouver, ISO, and other styles
40

Xiang, Ziyi. "A comparison of genetic algorithm and reinforcement learning for autonomous driving." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-261595.

Full text
Abstract:
This paper compares two different methods, reinforcement learning and genetic algorithm for designing autonomous cars’ control system in a dynamic environment. The research problem could be formulated as such: How is the learning efficiency compared between reinforcement learning and genetic algorithm on autonomous navigation through a dynamic environment? In conclusion, the genetic algorithm outperforms the reinforcement learning on mean learning time, despite the fact that the prior shows a large variance, i.e. genetic algorithm provide a better learning efficiency.<br>I det här papperet jämförs två olika metoder, förstärkningsinlärning och genetisk algoritm för att designa autonoma bilar styrsystem i en dynamisk miljö. Forskningsproblemet kan formuleras som: Hur är inlärningseffektiviteten jämför mellan förstärkningsinlärning och genetisk algoritm på autonom navigering i en dynamisk miljö? Sammanfattningsvis, den genetisk algoritm överträffar förstärkningsinlärning på genomsnittlig inlärningstid, trots att den tidigare visar en stor varians, dvs genetisk algoritm, ger en bättre inlärningseffektivitet.
APA, Harvard, Vancouver, ISO, and other styles
41

Masoudi, Mohammad Amin. "Robust Deep Reinforcement Learning for Portfolio Management." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/42743.

Full text
Abstract:
In Finance, the use of Automated Trading Systems (ATS) on markets is growing every year and the trades generated by an algorithm now account for most of orders that arrive at stock exchanges (Kissell, 2020). Historically, these systems were based on advanced statistical methods and signal processing designed to extract trading signals from financial data. The recent success of Machine Learning has attracted the interest of the financial community. Reinforcement Learning is a subcategory of machine learning and has been broadly applied by investors and researchers in building trading systems (Kissell, 2020). In this thesis, we address the issue that deep reinforcement learning may be susceptible to sampling errors and over-fitting and propose a robust deep reinforcement learning method that integrates techniques from reinforcement learning and robust optimization. We back-test and compare the performance of the developed algorithm, Robust DDPG, with UBAH (Uniform Buy and Hold) benchmark and other RL algorithms and show that the robust algorithm of this research can reduce the downside risk of an investment strategy significantly and can ensure a safer path for the investor’s portfolio value.
APA, Harvard, Vancouver, ISO, and other styles
42

Cunha, João Alexandre da Silva Costa e. "Techniques for batch reinforcement learning in robotics." Doctoral thesis, Universidade de Aveiro, 2015. http://hdl.handle.net/10773/15735.

Full text
Abstract:
Doutoramento em Engenharia Informática<br>This thesis addresses the Batch Reinforcement Learning methods in Robotics. This sub-class of Reinforcement Learning has shown promising results and has been the focus of recent research. Three contributions are proposed that aim to extend the state-of-art methods allowing for a faster and more stable learning process, such as required for learning in Robotics. The Q-learning update-rule is widely applied, since it allows to learn without the presence of a model of the environment. However, this update-rule is transition-based and does not take advantage of the underlying episodic structure of collected batch of interactions. The Q-Batch update-rule is proposed in this thesis, to process experiencies along the trajectories collected in the interaction phase. This allows a faster propagation of obtained rewards and penalties, resulting in faster and more robust learning. Non-parametric function approximations are explored, such as Gaussian Processes. This type of approximators allows to encode prior knowledge about the latent function, in the form of kernels, providing a higher level of exibility and accuracy. The application of Gaussian Processes in Batch Reinforcement Learning presented a higher performance in learning tasks than other function approximations used in the literature. Lastly, in order to extract more information from the experiences collected by the agent, model-learning techniques are incorporated to learn the system dynamics. In this way, it is possible to augment the set of collected experiences with experiences generated through planning using the learned models. Experiments were carried out mainly in simulation, with some tests carried out in a physical robotic platform. The obtained results show that the proposed approaches are able to outperform the classical Fitted Q Iteration.<br>Esta tese aborda a aplicação de métodos de Aprendizagem por Reforço em Lote na Robótica. Como o nome indica, os métodos de Aprendizagem por Reforço em Lote aprendem a completar uma tarefa processando um lote de interacções com o ambiente. São propostas três contribuições que procuram possibilitar a aprendizagem de uma forma mais rápida e estável. A regra Q-learning e amplamente usada dado que permite aprender sem a existência de um modelo do ambiente. No entanto, esta tem por base uma única transição, não tirando partido da estrutura baseada em episódios do lote de experiências. E proposta, neste trabalho, a regra Q-Batch que processa as experiências através es das trajectórias descritas aquando da interacção. Desta forma, e possível propagar mais rapidamente o valor das recompensas e penalizações obtidas, permitindo assim aprender de uma forma mais robusta e rápida. E também explorada a aplicação de aproximações não paramétricas como Processos Gaussianos. Este tipo de aproximadores permite codificar conhecimento prévio sobre as características da função a aproximar sob a forma de núcleos, fornecendo maior exibilidade e precisão. A aplicação de Processos Gaussianos na Aprendizagem por Reforço em Lote apresentou um maior desempenho na aprendizagem de comportamentos do que outras aproximações existentes na literatura. Por ultimo, de forma a extrair mais informação das experiências adquiridas pelo agente, são incorporadas técnicas de aprendizagem de modelos de transição. Desta forma, e possível ampliar o conjunto de experiências adquiridas através da interacção com o ambiente, com experiências geradas através de planeamento com recurso aos modelos de transição. Foram realizadas experiências principalmente em simulação, com alguns tests realizados numa plataforma robótica f sica. Os resultados obtidos mostram que as abordagens propostas são capaz de superar o método Fitted Q Iteration clássico.
APA, Harvard, Vancouver, ISO, and other styles
43

Abdalla, Alaa Eatzaz. "A reinforcement learning algorithm for operations planning of a hydroelectric power multireservoir system." Thesis, University of British Columbia, 2007. http://hdl.handle.net/2429/30702.

Full text
Abstract:
The main objective of reservoir operations planning is to determine the optimum operation policies that maximize the expected value of the system resources over the planning horizon. This control problem is challenged with different sources of uncertainty that a reservoir system planner has to deal with. In the reservoir operations planning problem, there is a trade-off between the marginal value of water in storage and the electricity market price. The marginal value of water is uncertain too and is largely dependent on storage in the reservoir and storage in other reservoirs as well. The challenge here is how to deal with this large scale multireservoir problem under the encountered uncertainties. In this thesis, the use of a novel methodology to establish a good approximation of the optimal control of a large-scale hydroelectric power system applying Reinforcement Learning (RL) is presented. RL is an artificial intelligence method to machine learning that offers key advantages in handling problems that are too large to be solved by conventional dynamic programming methods. In this approach, a control agent progressively learns the optimal strategies that maximize rewards through interaction with a dynamic environment. This thesis introduces the main concepts and computational aspects of using RL for the multireservoir operations planning problem. A scenario generation-moment matching technique was adopted to generate a set of scenarios for the natural river inflows, electricity load, and market prices random variables. In this way, the statistical properties of the original distributions are preserved. The developed reinforcement learning reservoir optimization model (RLROM) was successfully applied to the BC Hydro main reservoirs on the Peace and Columbia Rivers. The model was used to: derive optimal control policies for this multireservoir system, to estimate the value of water in storage, and to establish the marginal value of water / energy. The RLROM outputs were compared to the classical method of optimizing reservoir operations, namely, stochastic dynamic programming (SDP), and the results for one and two reservoir systems were identical. The results suggests that the RL model is much more efficient at handling large scale reservoir operations problems and can give a very good approximate solution to this complex problem.<br>Applied Science, Faculty of<br>Civil Engineering, Department of<br>Graduate
APA, Harvard, Vancouver, ISO, and other styles
44

Reed, Jane C. S. B. Massachusetts Institute of Technology. "Application of genetic algorithm and deep reinforcement learning for in-core fuel management." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/127308.

Full text
Abstract:
Thesis: S.B., Massachusetts Institute of Technology, Department of Nuclear Science and Engineering, May, 2020<br>Cataloged from the official PDF of thesis.<br>Includes bibliographical references (page 21).<br>The nuclear reactor core is composed of few hundred assemblies. The loading of these assemblies is done with the goal of reducing its overall cost while maintaining safety limits. Typically, the core designers choose a unique position and fuel enrichment for each assembly through use of expert judgement. In this thesis, alternatives to the current core reload design process are explored. Genetic algorithm and deep Q-learning are applied in an attempt to reduce core design time and improve the final core layout. The reference core represents a 4-loop pressurized water reactor where fixed number of fuel enrichments and burnable poison distributions are assumed. The algorithms automatically shuffles the assembly positions to find the optimum loading pattern. It is determined that both algorithms are able to successfully start with a poorly performing core loading pattern and discover a well performing one, by the metrics of boron concentration, cycle exposure, enthalpy-rise factor, and pin power peaking. This shows potential for further applications of these algorithms for core design with a more expanded search space.<br>by Jane C. Reed.<br>S.B.<br>S.B. Massachusetts Institute of Technology, Department of Nuclear Science and Engineering
APA, Harvard, Vancouver, ISO, and other styles
45

Gardini, Lorenzo. "Studio e Sperimentazione di Algoritmi di Reinforcement Learning Applicati a Video Game." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/21847/.

Full text
Abstract:
Nella prima parte del mio lavoro viene presentato uno studio di una prima soluzione "from scratch" sviluppata da Andrew Karpathy. Seguono due miei miglioramenti: il primo modificando direttamente il codice della precedente soluzione e introducendo, come obbiettivo aggiuntivo per la rete nelle prime fasi di gioco, l'intercettazione della pallina da parte della racchetta, migliorando l'addestramento iniziale; il secondo é una mia personale implementazione utilizzando algoritmi più complessi, che sono allo stato dell'arte su giochi dell'Atari, e che portano un addestramento molto più veloce della rete.
APA, Harvard, Vancouver, ISO, and other styles
46

Elkind, Daniel(Daniel Harris). "A reinforcement learning algorithm for efficient dynamic trading execution in the presence of signals." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/124585.

Full text
Abstract:
Thesis: S.M. in Management Research, Massachusetts Institute of Technology, Sloan School of Management, 2019<br>Cataloged from PDF version of thesis.<br>Includes bibliographical references (pages 27-29).<br>This paper focuses the optimal trading execution problem, where a trader seeks to maximize the proceeds from trading a given quantity of shares of a financial asset over a fixed-duration trading period, considering that trading impacts the future trajectory of prices. I propose a reinforcement learning (RL) algorithm to solve this maximization problem. I prove that the algorithm converges to the optimal solution in a large class of settings and point out a useful duality between the learning contraction and the dynamic programming PDE. Using simulations calibrated to historical exchange trading data, I show that (i) the algorithm reproduces the analytical solution for the case of random walk prices with a linear absolute price impact function and (ii) matches the output of classical dynamic programming methods for the case of geometric brownian motion prices with linear relative price impact. In the most relevant case, when a signal containing information about prices is introduced to the environment, traditional computational methods become intractable. My algorithm still finds the optimal execution policy, leading to a statistically and economically meaningful reduction in trading costs.<br>by Daniel Elkind.<br>S.M. in Management Research<br>S.M.inManagementResearch Massachusetts Institute of Technology, Sloan School of Management
APA, Harvard, Vancouver, ISO, and other styles
47

Robards, Matthew Walters. "Online learning algorithms for reinforcement learning with function approximation." Phd thesis, 2011. http://hdl.handle.net/1885/150825.

Full text
Abstract:
Reinforcement learning deals with the problem of sequential decision making in uncertain stochastic environments. In this thesis I deal with agents who attempt to solve the reinforcement learning problem online and in real-time. This presents experimental challenges for which I introduce novel kernelised algorithms. Kernel algorithms are very useful in reinforcement learning settings as they enable learning in situations where a very high-dimensional or hand engineered feature vector would otherwise be required. Furthermore, I attempt to address the theoretical challenges which arise from online on-policy algorithms, for which I introduce a type of analysis which is novel (and useful) to reinforcement learning in its lack of restrictive assumptions on the behaviour policy. I will introduce three novel algorithms attempting to advance the areas of kernel, empirical and theoretical reinforcement learning. The first of these algorithms presents a kernel extension of SARSA for its empirical properties - namely its incorporation of eligibility traces with sparse kernel algorithms. I then present a model-free/model-based ensemble which use gradient based methods for online learning. I present them with regret analysis which enables an analysis of the value functions learned with no probabilistic assumptions, and hence no assumptions on the behaviour policy. Along the way I also make a novel "sub-contribution", namely non-squared loss functions for reinforcement learning. The use of different loss functions constitutes a running theme through the algorithms I introduce, as I show that various non-traditional (to reinforcement learning) loss functions can be useful for both efficiency of the algorithm, and for accuracy by ensuring smooth function approximations. I present thorough experimental and theoretical analyses along the way.
APA, Harvard, Vancouver, ISO, and other styles
48

Akchurina, Natalia [Verfasser]. "Multi-agent reinforcement learning algorithms / Natalia Akchurina." 2010. http://d-nb.info/1005943761/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Sindhu, P. R. "Algorithms for Challenges to Practical Reinforcement Learning." Thesis, 2020. https://etd.iisc.ac.in/handle/2005/4983.

Full text
Abstract:
Reinforcement learning (RL) in real world applications faces major hurdles - the foremost being safety of the physical system controlled by the learning agent and the varying environment conditions in which the autonomous agent functions. A RL agent learns to control a system by exploring available actions. In some operating states, when the RL agent exercises an exploratory action, the system may enter unsafe operation, which can lead to safety hazards both for the system as well as for humans supervising the system. RL algorithms thus need to respect these safety constraints and must do so with limited available information. Additionally, RL autonomous agents learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, etc., one often encounters situations with non-stationary environments, and in these scenarios, RL algorithms yield sub-optimal decisions. In this thesis, the first part develops algorithmic solutions to the challenges of safety and non-stationary environmental conditions. In order to handle safety restrictions and facilitate safe exploration during learning, this thesis proposes a cross-entropy method based sample efficient learning algorithm. This algorithm is developed on constrained optimization framework and utilizes very limited information for the learning of feasible policies. Also during the learning iterations, the exploration is guided in a manner that minimizes safety violations. In the first part, another algorithm for the second challenge is also described. The goal of this algorithm is to maximize the long-term discounted reward accrued when the latent model of the environment changes with time. To achieve this, the algorithm leverages a change point detection algorithm to find change in the statistics of the environment. The results from this statistical algorithm are used to reset learning of policies. The second part of this thesis describes the application of RL in networked intelligent systems. We consider two such systems - aerial quadrotor navigation and industrial internet of things system. In quadrotor navigation problem, with improved usage of machine learning computational frameworks, our proposed method is able to improve upon previously proposed obstacle avoidance algorithms in aerial vehicles. Obstacle avoidance in quadrotor aerial vehicle navigation brings in additional challenges when compared to ground vehicles. This is because, an aerial vehicle has to navigate across more types of obstacles - for e.g., objects like decorative items, furnishings, ceiling fans, sign-boards, tree branches, etc., are also potential obstacles for a quadrotor aerial vehicle. Thus, methods of obstacle avoidance developed for ground robots are clearly inadequate for UAV navigation. Our algorithm improves the efficiency of learning by inferring navigation decisions from temporal information of the ambient surroundings. This information is represented using monocular camera images collected by the quadrotor aerial vehicle. An industrial internet-of-things (IIoT) system has multiple IoT devices, a user equipment (UE), together with a base station (BS) that receives the UE and IoT data. To circumvent the issue of numerous IoT-to-BS connections and to conserve IoT devices' energies, the UE serves as a relay to forward the IoT data to the BS. In this thesis, we consider a specific problem of multiple objective optimization that arises in this simple IIoT setup. The UE employs frame-based uplink transmissions, wherein it shares few slots of every frame to relay the IoT data. The IIoT system experiences a transmission failure called outage when IoT data is not transmitted. The unsent UE data is stored in the UE's buffer and is discarded after the storage time exceeds the age threshold. As the UE and IoT devices share the transmission slots, trade-offs exist between system outages and aged UE data loss. To resolve system outage-data ageing challenge, we adapt the Q-learning algorithm for slot-sharing between UE and IoT data and show numerical results for the same.
APA, Harvard, Vancouver, ISO, and other styles
50

Weaver, Lex. "Reinforcement learning : some algorithmic improvements." Phd thesis, 2003. http://hdl.handle.net/1885/148526.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography