Log in

Relevant bibliographies by topics / Bandit learning / Journal articles

To see the other types of publications on this topic, follow the link: Bandit learning.

Journal articles on the topic 'Bandit learning'

Author: Grafiati

Published: 10 December 2022

Last updated: 31 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Bandit learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Ciucanu, Radu, Pascal Lafourcade, Gael Marcadet, and Marta Soare. "SAMBA: A Generic Framework for Secure Federated Multi-Armed Bandits." Journal of Artificial Intelligence Research 73 (February 23, 2022): 737–65. http://dx.doi.org/10.1613/jair.1.13163.

Full text

Abstract:

The multi-armed bandit is a reinforcement learning model where a learning agent repeatedly chooses an action (pull a bandit arm) and the environment responds with a stochastic outcome (reward) coming from an unknown distribution associated with the chosen arm. Bandits have a wide-range of application such as Web recommendation systems. We address the cumulative reward maximization problem in a secure federated learning setting, where multiple data owners keep their data stored locally and collaborate under the coordination of a central orchestration server. We rely on cryptographic schemes and

APA, Harvard, Vancouver, ISO, and other styles

2

Azizi, Javad, Branislav Kveton, Mohammad Ghavamzadeh, and Sumeet Katariya. "Meta-Learning for Simple Regret Minimization." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 6 (2023): 6709–17. http://dx.doi.org/10.1609/aaai.v37i6.25823.

Full text

Abstract:

We develop a meta-learning framework for simple regret minimization in bandits. In this framework, a learning agent interacts with a sequence of bandit tasks, which are sampled i.i.d. from an unknown prior distribution, and learns its meta-parameters to perform better on future tasks. We propose the first Bayesian and frequentist meta-learning algorithms for this setting. The Bayesian algorithm has access to a prior distribution over the meta-parameters and its meta simple regret over m bandit tasks with horizon n is mere O(m / √n). On the other hand, the meta simple regret of the frequentist

APA, Harvard, Vancouver, ISO, and other styles

3

Sharaf, Amr, and Hal Daumé III. "Meta-Learning Effective Exploration Strategies for Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 11 (2021): 9541–48. http://dx.doi.org/10.1609/aaai.v35i11.17149.

Full text

Abstract:

In contextual bandits, an algorithm must choose actions given ob- served contexts, learning from a reward signal that is observed only for the action chosen. This leads to an exploration/exploitation trade-off: the algorithm must balance taking actions it already believes are good with taking new actions to potentially discover better choices. We develop a meta-learning algorithm, Mêlée, that learns an exploration policy based on simulated, synthetic con- textual bandit tasks. Mêlée uses imitation learning against these simulations to train an exploration policy that can be applied to true con

APA, Harvard, Vancouver, ISO, and other styles

4

Charniauski, Uladzimir, and Yao Zheng. "Autoregressive Bandits in Near-Unstable or Unstable Environment." American Journal of Undergraduate Research 21, no. 2 (2024): 15–25. http://dx.doi.org/10.33697/ajur.2024.116.

Full text

Abstract:

AutoRegressive Bandits (ARBs) is a novel model of a sequential decision-making problem as an autoregressive (AR) process. In this online learning setting, the observed reward follows an autoregressive process, whose action parameters are unknown to the agent and create an AR dynamic that depends on actions the agent chooses. This study empirically demonstrates how assigning the extreme values of systemic stability indexes and other reward-governing parameters severely impairs the ARBs learning in the respective environment. We show that this algorithm suffers numerically larger regrets of high

APA, Harvard, Vancouver, ISO, and other styles

5

Zhao, Yunfan, Tonghan Wang, Dheeraj Mysore Nagaraj, Aparna Taneja, and Milind Tambe. "The Bandit Whisperer: Communication Learning for Restless Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 22 (2025): 23404–13. https://doi.org/10.1609/aaai.v39i22.34508.

Full text

Abstract:

Applying Reinforcement Learning (RL) to Restless Multi-Arm Bandits (RMABs) offers a promising avenue for addressing allocation problems with resource constraints and temporal dynamics. However, classic RMAB models largely overlook the challenges of (systematic) data errors - a common occurrence in real-world scenarios due to factors like varying data collection protocols and intentional noise for differential privacy. We demonstrate that conventional RL algorithms used to train RMABs can struggle to perform well in such settings. To solve this problem, we propose the first communication learni

APA, Harvard, Vancouver, ISO, and other styles

6

Wan, Zongqi, Zhijie Zhang, Tongyang Li, Jialin Zhang, and Xiaoming Sun. "Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (2023): 10087–94. http://dx.doi.org/10.1609/aaai.v37i8.26202.

Full text

Abstract:

Multi-arm bandit (MAB) and stochastic linear bandit (SLB) are important models in reinforcement learning, and it is well-known that classical algorithms for bandits with time horizon T suffer from the regret of at least the square root of T. In this paper, we study MAB and SLB with quantum reward oracles and propose quantum algorithms for both models with the order of the polylog T regrets, exponentially improving the dependence in terms of T. To the best of our knowledge, this is the first provable quantum speedup for regrets of bandit problems and in general exploitation in reinforcement lea

APA, Harvard, Vancouver, ISO, and other styles

7

Yang, Luting, Jianyi Yang, and Shaolei Ren. "Contextual Bandits with Delayed Feedback and Semi-supervised Learning (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 18 (2021): 15943–44. http://dx.doi.org/10.1609/aaai.v35i18.17968.

Full text

Abstract:

Contextual multi-armed bandit (MAB) is a classic online learning problem, where a learner/agent selects actions (i.e., arms) given contextual information and discovers optimal actions based on reward feedback. Applications of contextual bandit have been increasingly expanding, including advertisement, personalization, resource allocation in wireless networks, among others. Nonetheless, the reward feedback is delayed in many applications (e.g., a user may only provide service ratings after a period of time), creating challenges for contextual bandits. In this paper, we address delayed feedback

APA, Harvard, Vancouver, ISO, and other styles

8

Zhou, Pengjie, Haoyu Wei, and Huiming Zhang. "Selective Reviews of Bandit Problems in AI via a Statistical View." Mathematics 13, no. 4 (2025): 665. https://doi.org/10.3390/math13040665.

Full text

Abstract:

Reinforcement Learning (RL) is a widely researched area in artificial intelligence that focuses on teaching agents decision-making through interactions with their environment. A key subset includes multi-armed bandit (MAB) and stochastic continuum-armed bandit (SCAB) problems, which model sequential decision-making under uncertainty. This review outlines the foundational models and assumptions of bandit problems, explores non-asymptotic theoretical tools like concentration inequalities and minimax regret bounds, and compares frequentist and Bayesian algorithms for managing exploration–exploita

APA, Harvard, Vancouver, ISO, and other styles

9

Qu, Jiaming. "Survey of dynamic pricing based on Multi-Armed Bandit algorithms." Applied and Computational Engineering 37, no. 1 (2024): 160–65. http://dx.doi.org/10.54254/2755-2721/37/20230497.

Full text

Abstract:

Dynamic pricing seeks to determine the most optimal selling price for a product or service, taking into account factors like limited supply and uncertain demand. This study aims to provide a comprehensive exploration of dynamic pricing using the multi-armed bandit problem framework in various contexts. The investigation highlights the prevalence of Thompson sampling in dynamic pricing scenarios with a Bayesian backdrop, where the seller possesses prior knowledge of demand functions. On the other hand, in non-Bayesian situations, the Upper Confidence Bound (UCB) algorithm family gains traction

APA, Harvard, Vancouver, ISO, and other styles

10

Kapoor, Sayash, Kumar Kshitij Patel, and Purushottam Kar. "Corruption-tolerant bandit learning." Machine Learning 108, no. 4 (2018): 687–715. http://dx.doi.org/10.1007/s10994-018-5758-5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Du, Yihan, Siwei Wang, and Longbo Huang. "A One-Size-Fits-All Solution to Conservative Bandit Problems." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (2021): 7254–61. http://dx.doi.org/10.1609/aaai.v35i8.16891.

Full text

Abstract:

In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actual

APA, Harvard, Vancouver, ISO, and other styles

12

Cheung, Wang Chi, David Simchi-Levi, and Ruihao Zhu. "Hedging the Drift: Learning to Optimize Under Nonstationarity." Management Science 68, no. 3 (2022): 1696–713. http://dx.doi.org/10.1287/mnsc.2021.4024.

Full text

Abstract:

We introduce data-driven decision-making algorithms that achieve state-of-the-art dynamic regret bounds for a collection of nonstationary stochastic bandit settings. These settings capture applications such as advertisement allocation, dynamic pricing, and traffic network routing in changing environments. We show how the difficulty posed by the (unknown a priori and possibly adversarial) nonstationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Beginning with the linear bandit setting, we design and analyze a sliding window-upp

APA, Harvard, Vancouver, ISO, and other styles

13

Lupu, Andrei, Audrey Durand, and Doina Precup. "Leveraging Observations in Bandits: Between Risks and Benefits." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6112–19. http://dx.doi.org/10.1609/aaai.v33i01.33016112.

Full text

Abstract:

Imitation learning has been widely used to speed up learning in novice agents, by allowing them to leverage existing data from experts. Allowing an agent to be influenced by external observations can benefit to the learning process, but it also puts the agent at risk of following sub-optimal behaviours. In this paper, we study this problem in the context of bandits. More specifically, we consider that an agent (learner) is interacting with a bandit-style decision task, but can also observe a target policy interacting with the same environment. The learner observes only the target’s actions, no

APA, Harvard, Vancouver, ISO, and other styles

14

Caro, Felipe, and Onesun Steve Yoo. "INDEXABILITY OF BANDIT PROBLEMS WITH RESPONSE DELAYS." Probability in the Engineering and Informational Sciences 24, no. 3 (2010): 349–74. http://dx.doi.org/10.1017/s0269964810000021.

Full text

Abstract:

This article considers an important class of discrete time restless bandits, given by the discounted multiarmed bandit problems with response delays. The delays in each period are independent random variables, in which the delayed responses do not cross over. For a bandit arm in this class, we use a coupling argument to show that in each state there is a unique subsidy that equates the pulling and nonpulling actions (i.e., the bandit satisfies the indexibility criterion introduced by Whittle (1988). The result allows for infinite or finite horizon and holds for arbitrary delay lengths and infi

APA, Harvard, Vancouver, ISO, and other styles

15

Buchholz, Simon, Jonas M. Kübler, and Bernhard Schölkopf. "Multi-Armed Bandits and Quantum Channel Oracles." Quantum 9 (March 25, 2025): 1672. https://doi.org/10.22331/q-2025-03-25-1672.

Full text

Abstract:

Multi-armed bandits are one of the theoretical pillars of reinforcement learning. Recently, the investigation of quantum algorithms for multi-armed bandit problems was started, and it was found that a quadratic speed-up (in query complexity) is possible when the arms and the randomness of the rewards of the arms can be queried in superposition. Here we introduce further bandit models where we only have limited access to the randomness of the rewards, but we can still query the arms in superposition. We show that then the query complexity is the same as for classical algorithms. This generalize

APA, Harvard, Vancouver, ISO, and other styles

16

Zhao, Shanshan, Wenhai Cui, Bei Jiang, Linglong Kong, and Xiaodong Yan. "Responsible Bandit Learning via Privacy-Protected Mean-Volatility Utility." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 19 (2024): 21815–22. http://dx.doi.org/10.1609/aaai.v38i19.30182.

Full text

Abstract:

For ensuring the safety of users by protecting the privacy, the traditional privacy-preserving bandit algorithm aiming to maximize the mean reward has been widely studied in scenarios such as online ride-hailing, advertising recommendations, and personalized healthcare. However, classical bandit learning is irresponsible in such practical applications as they fail to account for risks in online decision-making and ignore external system information. This paper firstly proposes privacy protected mean-volatility utility as the objective of bandit learning and proves its responsibility, because i

APA, Harvard, Vancouver, ISO, and other styles

17

Narita, Yusuke, Shota Yasui, and Kohei Yata. "Efficient Counterfactual Learning from Bandit Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 4634–41. http://dx.doi.org/10.1609/aaai.v33i01.33014634.

Full text

Abstract:

What is the most statistically efficient way to do off-policy optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm w

APA, Harvard, Vancouver, ISO, and other styles

18

Varatharajah, Yogatheesan, and Brent Berry. "A Contextual-Bandit-Based Approach for Informed Decision-Making in Clinical Trials." Life 12, no. 8 (2022): 1277. http://dx.doi.org/10.3390/life12081277.

Full text

Abstract:

Clinical trials are conducted to evaluate the efficacy of new treatments. Clinical trials involving multiple treatments utilize the randomization of treatment assignments to enable the evaluation of treatment efficacies in an unbiased manner. Such evaluation is performed in post hoc studies that usually use supervised-learning methods that rely on large amounts of data collected in a randomized fashion. That approach often proves to be suboptimal in that some participants may suffer and even die as a result of having not received the most appropriate treatments during the trial. Reinforcement-

APA, Harvard, Vancouver, ISO, and other styles

19

Zhu, Zhaowei, Jingxuan Zhu, Ji Liu, and Yang Liu. "Federated Bandit." Proceedings of the ACM on Measurement and Analysis of Computing Systems 5, no. 1 (2021): 1–29. http://dx.doi.org/10.1145/3447380.

Full text

Abstract:

In this paper, we study Federated Bandit, a decentralized Multi-Armed Bandit problem with a set of N agents, who can only communicate their local data with neighbors described by a connected graph G. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet they only have access to local and potentially biased feedback/evaluation of the true reward for each action taken. Learning only locally will lead agents to sub-optimal actions while converging to a no-regret strategy requires a collection of distributed data. Motivated by the proposal of federated learning, we ai

APA, Harvard, Vancouver, ISO, and other styles

20

Lopez, Romain, Inderjit S. Dhillon, and Michael I. Jordan. "Learning from eXtreme Bandit Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 10 (2021): 8732–40. http://dx.doi.org/10.1609/aaai.v35i10.17058.

Full text

Abstract:

We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases c

APA, Harvard, Vancouver, ISO, and other styles

21

Sharma, Dravyansh, and Arun Suggala. "Offline-to-Online Hyperparameter Transfer for Stochastic Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 19 (2025): 20362–70. https://doi.org/10.1609/aaai.v39i19.34243.

Full text

Abstract:

Classic algorithms for stochastic bandits typically use hyperparameters that govern their critical properties such as the trade-off between exploration and exploitation. Tuning these hyperparameters is a problem of great practical significance. However this is a challenging problem and in certain cases is information theoretically impossible. To address this challenge, we consider a practically relevant transfer learning setting where one has access to offline data collected from several bandit problems (tasks) coming from an unknown distribution over the tasks. Our aim is to use this offline

APA, Harvard, Vancouver, ISO, and other styles

22

Asanov, Igor. "Bandit cascade: A test of observational learning in the bandit problem." Journal of Economic Behavior & Organization 189 (September 2021): 150–71. http://dx.doi.org/10.1016/j.jebo.2021.06.006.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Dimakopoulou, Maria, Zhengyuan Zhou, Susan Athey, and Guido Imbens. "Balanced Linear Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 3445–53. http://dx.doi.org/10.1609/aaai.v33i01.33013445.

Full text

Abstract:

Contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models, which can lead to difficult estimation problems along the path of learning. We develop algorithms for contextual bandits with linear payoffs that integrate balancing methods from the causal inference literature in their estimation to make it less prone to problems of estimation bias. We provide the first regret bound analyses for linear contextual bandits with balancing and show that our al

APA, Harvard, Vancouver, ISO, and other styles

24

Cohen, Saar, and Noa Agmon. "Online Learning of Coalition Structures by Selfish Agents." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 13 (2025): 13709–17. https://doi.org/10.1609/aaai.v39i13.33498.

Full text

Abstract:

Coalition formation concerns autonomous agents that strategically interact to form self-organized coalitions. When agents lack initial sufficient information to evaluate their preferences before interacting with others, they learn them online through repeated feedback while iteratively forming coalitions. In this work, we introduce online learning in coalition formation from a non-cooperative perspective, studying the impact of collective data utilization where selfish agents aim to accelerate their learning by leveraging a shared data platform. Thus, the efficiency and dynamics of the learnin

APA, Harvard, Vancouver, ISO, and other styles

25

Nobari, Sadegh. "DBA: Dynamic Multi-Armed Bandit Algorithm." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9869–70. http://dx.doi.org/10.1609/aaai.v33i01.33019869.

Full text

Abstract:

We introduce Dynamic Bandit Algorithm (DBA), a practical solution to improve the shortcoming of the pervasively employed reinforcement learning algorithm called Multi-Arm Bandit, aka Bandit. Bandit makes real-time decisions based on the prior observations. However, Bandit is heavily biased to the priors that it cannot quickly adapt itself to a trend that is interchanging. As a result, Bandit cannot, quickly enough, make profitable decisions when the trend is changing. Unlike Bandit, DBA focuses on quickly adapting itself to detect these trends early enough. Furthermore, DBA remains as almost a

APA, Harvard, Vancouver, ISO, and other styles

26

Shi, Chengshuai, and Cong Shen. "Federated Multi-Armed Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 11 (2021): 9603–11. http://dx.doi.org/10.1609/aaai.v35i11.17156.

Full text

Abstract:

Federated multi-armed bandits (FMAB) is a new bandit paradigm that parallels the federated learning (FL) framework in supervised learning. It is inspired by practical applications in cognitive radio and recommender systems, and enjoys features that are analogous to FL. This paper proposes a general framework of FMAB and then studies two specific federated bandit models. We first study the approximate model where the heterogeneous local models are random realizations of the global model from an unknown distribution. This model introduces a new uncertainty of client sampling, as the global model

APA, Harvard, Vancouver, ISO, and other styles

27

Tran, Alasdair, Cheng Soon Ong, and Christian Wolf. "Combining active learning suggestions." PeerJ Computer Science 4 (July 23, 2018): e157. http://dx.doi.org/10.7717/peerj-cs.157.

Full text

Abstract:

We study the problem of combining active learning suggestions to identify informative training examples by empirically comparing methods on benchmark datasets. Many active learning heuristics for classification problems have been proposed to help us pick which instance to annotate next. But what is the optimal heuristic for a particular source of data? Motivated by the success of methods that combine predictors, we combine active learners with bandit algorithms and rank aggregation methods. We demonstrate that a combination of active learners outperforms passive learning in large benchmark dat

APA, Harvard, Vancouver, ISO, and other styles

28

Yang, Jianyi, and Shaolei Ren. "Robust Bandit Learning with Imperfect Context." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (2021): 10594–602. http://dx.doi.org/10.1609/aaai.v35i12.17267.

Full text

Abstract:

A standard assumption in contextual multi-arm bandit is that the true context is perfectly known before arm selection. Nonetheless, in many practical applications (e.g., cloud resource management), prior to arm selection, the context information can only be acquired by prediction subject to errors or adversarial modification. In this paper, we study a novel contextual bandit setting in which only imperfect context is available for arm selection while the true context is revealed at the end of each round. We propose two robust arm selection algorithms: MaxMinUCB (Maximize Minimum UCB) which max

APA, Harvard, Vancouver, ISO, and other styles

29

Truong, Quoc-Tuan, and Hady W. Lauw. "Variational learning from implicit bandit feedback." Machine Learning 110, no. 8 (2021): 2085–105. http://dx.doi.org/10.1007/s10994-021-06028-0.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Tze-Leung Lai and S. Yakowitz. "Machine learning and nonparametric bandit theory." IEEE Transactions on Automatic Control 40, no. 7 (1995): 1199–209. http://dx.doi.org/10.1109/9.400491.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

He-Yueya, Joy, Jonathan Lee, Matthew Jörke, and Emma Brunskill. "Cost-Aware Near-Optimal Policy Learning." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 27 (2025): 28088–96. https://doi.org/10.1609/aaai.v39i27.35027.

Full text

Abstract:

It is often of interest to learn a context-sensitive decision policy, such as in contextual multi-armed bandit processes. To quantify the efficiency of a machine learning algorithm for such settings, probably approximately correct (PAC) bounds, which bound the number of samples required, or cumulative regret guarantees, are typically used. However, real-world settings often have limited resources for experimentation, and decisions/interventions may differ in the amount of resources required (e.g., money or time). Therefore, it is of interest to consider how to design an experiment strategy tha

APA, Harvard, Vancouver, ISO, and other styles

32

Wu, Jiazhen. "In-depth Exploration and Implementation of Multi-Armed Bandit Models Across Diverse Fields." Highlights in Science, Engineering and Technology 94 (April 26, 2024): 201–5. http://dx.doi.org/10.54097/d3ez0n61.

Full text

Abstract:

This paper presents an in-depth analysis of the Multi-Armed Bandit (MAB) problem, tracing its evolution from its origins in the gambling domain of the 1940s to its current prominence in machine learning and artificial intelligence. The analysis begins with a historical overview, noting key developments like Herbert Robbins' probabilistic framework and the expansion of the problem into strategic decision-making in the 1970s. The emergence of algorithms like the Upper Confidence Bound (UCB) and Thompson Sampling in the late 20th century is highlighted, demonstrating the MAB problem's transition

APA, Harvard, Vancouver, ISO, and other styles

33

Karpov, Nikolai, and Qin Zhang. "Instance-Sensitive Algorithms for Pure Exploration in Multinomial Logit Bandit." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 7 (2022): 7096–103. http://dx.doi.org/10.1609/aaai.v36i7.20669.

Full text

Abstract:

Motivated by real-world applications such as fast fashion retailing and online advertising, the Multinomial Logit Bandit (MNL-bandit) is a popular model in online learning and operations research, and has attracted much attention in the past decade. In this paper, we give efficient algorithms for pure exploration in MNL-bandit. Our algorithms achieve instance-sensitive pull complexities. We also complement the upper bounds by an almost matching lower bound.

APA, Harvard, Vancouver, ISO, and other styles

34

Huo, Xiaoguang, and Feng Fu. "Risk-aware multi-armed bandit problem with application to portfolio selection." Royal Society Open Science 4, no. 11 (2017): 171377. http://dx.doi.org/10.1098/rsos.171377.

Full text

Abstract:

Sequential portfolio selection has attracted increasing interest in the machine learning and quantitative finance communities in recent years. As a mathematical framework for reinforcement learning policies, the stochastic multi-armed bandit problem addresses the primary difficulty in sequential decision-making under uncertainty, namely the exploration versus exploitation dilemma, and therefore provides a natural connection to portfolio selection. In this paper, we incorporate risk awareness into the classic multi-armed bandit setting and introduce an algorithm to construct portfolio. Through

APA, Harvard, Vancouver, ISO, and other styles

35

Gao, Xuefeng, and Tianrun Xu. "Order scoring, bandit learning and order cancellations." Journal of Economic Dynamics and Control 134 (January 2022): 104287. http://dx.doi.org/10.1016/j.jedc.2021.104287.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Xu, Yiming, Vahid Keshavarzzadeh, Robert M. Kirby, and Akil Narayan. "A Bandit-Learning Approach to Multifidelity Approximation." SIAM Journal on Scientific Computing 44, no. 1 (2022): A150—A175. http://dx.doi.org/10.1137/21m1408312.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Brezzi, Monica, and Tze Leung Lai. "Optimal learning and experimentation in bandit problems." Journal of Economic Dynamics and Control 27, no. 1 (2002): 87–108. http://dx.doi.org/10.1016/s0165-1889(01)00028-8.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Rosenberg, Dinah, Eilon Solan, and Nicolas Vieille. "Social Learning in One-Arm Bandit Problems." Econometrica 75, no. 6 (2007): 1591–611. http://dx.doi.org/10.1111/j.1468-0262.2007.00807.x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Lefebvre, Germain, Christopher Summerfield, and Rafal Bogacz. "A Normative Account of Confirmation Bias During Reinforcement Learning." Neural Computation 34, no. 2 (2022): 307–37. http://dx.doi.org/10.1162/neco_a_01455.

Full text

Abstract:

Abstract Reinforcement learning involves updating estimates of the value of states and actions on the basis of experience. Previous work has shown that in humans, reinforcement learning exhibits a confirmatory bias: when the value of a chosen option is being updated, estimates are revised more radically following positive than negative reward prediction errors, but the converse is observed when updating the unchosen option value estimate. Here, we simulate performance on a multi-arm bandit task to examine the consequences of a confirmatory bias for reward harvesting. We report a paradoxical fi

APA, Harvard, Vancouver, ISO, and other styles

40

Narita, Yusuke, Kyohei Okumura, Akihiro Shimizu, and Kohei Yata. "Counterfactual Learning with General Data-Generating Policies." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (2023): 9286–93. http://dx.doi.org/10.1609/aaai.v37i8.26113.

Full text

Abstract:

Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log data from a different policy. We extend its applicability by developing an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases. We v

APA, Harvard, Vancouver, ISO, and other styles

41

Zhu, Zhaowei, Jingxuan Zhu, Ji Liu, and Yang Liu. "Federated Bandit: A Gossiping Approach." ACM SIGMETRICS Performance Evaluation Review 49, no. 1 (2022): 3–4. http://dx.doi.org/10.1145/3543516.3453919.

Full text

Abstract:

We study Federated Bandit, a decentralized Multi-Armed Bandit (MAB) problem with a set of N agents, who can only communicate their local data with neighbors described by a connected graph G. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet they only have access to local and potentially biased feedback/evaluation of the true reward for each action taken. Learning only locally will lead agents to sub-optimal actions while converging to a no-regret strategy requires a collection of distributed data. Motivated by the proposal of federated learning, we aim for a s

APA, Harvard, Vancouver, ISO, and other styles

42

Tang, Qiao, Hong Xie, Yunni Xia, Jia Lee, and Qingsheng Zhu. "Robust Contextual Bandits via Bootstrapping." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 13 (2021): 12182–89. http://dx.doi.org/10.1609/aaai.v35i13.17446.

Full text

Abstract:

Upper confidence bound (UCB) based contextual bandit algorithms require one to know the tail property of the reward distribution. Unfortunately, such tail property is usually unknown or difficult to specify in real-world applications. Using a tail property heavier than the ground truth leads to a slow learning speed of the contextual bandit algorithm, while using a lighter one may cause the algorithm to diverge. To address this fundamental problem, we develop an estimator (evaluated from historical rewards) for the contextual bandit UCB based on the multiplier bootstrapping technique. We first

APA, Harvard, Vancouver, ISO, and other styles

43

Kaibel, Chris, and Torsten Biemann. "Rethinking the Gold Standard With Multi-armed Bandits: Machine Learning Allocation Algorithms for Experiments." Organizational Research Methods 24, no. 1 (2019): 78–103. http://dx.doi.org/10.1177/1094428119854153.

Full text

Abstract:

In experiments, researchers commonly allocate subjects randomly and equally to the different treatment conditions before the experiment starts. While this approach is intuitive, it means that new information gathered during the experiment is not utilized until after the experiment has ended. Based on methodological approaches from other scientific disciplines such as computer science and medicine, we suggest machine learning algorithms for subject allocation in experiments. Specifically, we discuss a Bayesian multi-armed bandit algorithm for randomized controlled trials and use Monte Carlo sim

APA, Harvard, Vancouver, ISO, and other styles

44

Garcelon, Evrard, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. "Improved Algorithms for Conservative Exploration in Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (2020): 3962–69. http://dx.doi.org/10.1609/aaai.v34i04.5812.

Full text

Abstract:

In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself. In this paper, we study the conservative learni

APA, Harvard, Vancouver, ISO, and other styles

45

Fei, Bo. "Comparative analysis and applications of classic multi-armed bandit algorithms and their variants." Applied and Computational Engineering 68, no. 1 (2024): 17–30. http://dx.doi.org/10.54254/2755-2721/68/20241389.

Full text

Abstract:

The multi-armed bandit problem, a pivotal aspect of Reinforcement Learning (RL), presents a classic dilemma in sequential decision-making, balancing exploration with exploitation. Renowned bandit algorithms like Explore-Then-Commit, Epsilon-Greedy, SoftMax, Upper Confidence Bound (UCB), and Thompson Sampling have demonstrated efficacy in addressing this issue. Nevertheless, each algorithm exhibits unique strengths and weaknesses, necessitating a detailed comparative evaluation. This paper executes a series of implementations of various established bandit algorithms and their derivatives, aimin

APA, Harvard, Vancouver, ISO, and other styles

46

Wu, Wen, Nan Cheng, Ning Zhang, Peng Yang, Weihua Zhuang, and Xuemin Shen. "Fast mmwave Beam Alignment via Correlated Bandit Learning." IEEE Transactions on Wireless Communications 18, no. 12 (2019): 5894–908. http://dx.doi.org/10.1109/twc.2019.2940454.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

He, Di, Wei Chen, Liwei Wang, and Tie-Yan Liu. "Online learning for auction mechanism in bandit setting." Decision Support Systems 56 (December 2013): 379–86. http://dx.doi.org/10.1016/j.dss.2013.07.004.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Cayci, Semih, Atilla Eryilmaz, and R. Srikant. "Learning to Control Renewal Processes with Bandit Feedback." ACM SIGMETRICS Performance Evaluation Review 47, no. 1 (2019): 41–42. http://dx.doi.org/10.1145/3376930.3376957.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Cayci, Semih, Atilla Eryilmaz, and R. Srikant. "Learning to Control Renewal Processes with Bandit Feedback." Proceedings of the ACM on Measurement and Analysis of Computing Systems 3, no. 2 (2019): 1–32. http://dx.doi.org/10.1145/3341617.3326158.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Zhang, Shuning. "Utilizing Reinforcement Learning Bandit Algorithms in Advertising Optimization." Highlights in Science, Engineering and Technology 94 (April 26, 2024): 195–200. http://dx.doi.org/10.54097/z976ty46.

Full text

Abstract:

This research provides a comprehensive analysis of the application of Multi-Armed Bandit (MAB) algorithms in the field of advertising, particularly highlighting the crucial balance between exploration and exploitation strategies. The implementation of MAB algorithms, especially within the framework of reinforcement learning, introduces a dynamic approach to optimizing advertisement placements and mixtures. This paper conducts a critical review of traditional advertising technologies such as rule engines and keyword targeting, drawing a comparison with more advanced techniques like the Explore-

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!