Panneau de gestion des cookies
NOTRE UTILISATION DES COOKIES
Des cookies sont utilisés sur notre site pour accéder à des informations stockées sur votre terminal. Nous utilisons des cookies techniques pour assurer le bon fonctionnement du site ainsi qu’avec notre partenaire des cookies fonctionnels de sécurité et partage d’information soumis à votre consentement pour les finalités décrites. Vous pouvez paramétrer le dépôt de ces cookies en cliquant sur le bouton « PARAMETRER » ci-dessous.

Reinforcement Learning

Ects : 6

Enseignant responsable :

Volume horaire : 24

Description du contenu de l'enseignement :

Outline:

  • Introduction to reinforcement learning (RL) and Markov decision processes (MDP)
  • Algorithms for MDPs: dynamic programming, value iteration, policy iteration, linear programming
  • Multi armed bandits: exploration vs. exploitation, stochastic bandits, adversarial bandits
  • Value-based RL: general ideas and basic algorithms (TD-learning, SARSA, Q-learning)
  • Function approximation: linear function approximation, deep neural networks and DQN
  • Policy-based and actor-critic RL: policy gradient, natural policy gradient, actor-critic algorithms
  • Bandits and RL: regret bounds and UCRL algorithm
  • Decentralized MDPs and multi-agent RL: complexity results, simple algorithms (e.g. idependent learners) and their applications (e.g. EV charging, wind farm control)

Organization of lectures in two parts (with 15min break):

  • First part: lecture covering main notions
  • Second part: more advanced material and/or exercises

Pré-requis obligatoires :

Basic notions in linear algebra and probability theory; Python

Compétence à acquérir :

Reinforcement Learning (RL) refers to scenarios where the learning algorithm operates in closed-loop, simultaneously using past data to adjust its decisions and taking actions that will influence future observations. RL algorithms combine ideas from control, machine learning, statistics, and operations research. A common tread of all RL algorithms is the need to balance exploration (trying knew things) and exploitation (choosing the most successful actions so far).

This course will introduce the main models (multi-armed bandits and Markov decision processes) and key ideas for algorithm design (e.g. model-based vs. model-free RL, value based vs. policy based algorithms, on-policy vs. off-policy learning, function approximation).

Mode de contrôle des connaissances :

Homework assignments and project

Bibliographie, lectures recommandées

Books MDPs:

  • Martin L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. 2014.
  • Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Vol I, 4th Edition, Athena Scientific. 2017.
  • Dimitri P. Bertsekas. Dynamic Programming and Optimal Control: Approximate Dynamic Programming , Vol II, 4th Edition, Athena Scientific. 2012.

Books RL:

Bandit algorithms:

  • Tor Lattimore and Csaba Szepesvari. Bandit algorithms. Cambridge University Press. 2020.

Implementation:

En savoir plus sur le cours