Direct Policy Optimization

le 1 juillet 2025

12h45

Manufacture des Tabacs
Salle MH003

Jairo Gudiño, IRIT

Abstract: In this presentation, we introduce Direct Policy Optimization (DPO), an algorithm designed to align policies with human preferences through a purely supervised learning framework. Starting from pairwise comparisons between model outputs—where one response is preferred over another—DPO defines a loss that encourages the policy to assign higher probability to preferred actions. Unlike traditional reinforcement learning approaches that rely on interaction with an environment or the estimation of value functions, DPO directly optimizes the policy by minimizing a contrastive objective based on the softmax of log-likelihood ratios. We outline the theoretical foundations of the method, including its connections to reward shaping and KL control, and describe how it can be implemented efficiently. Finally, we present empirical results from our ongoing research on the robustness of LLMs to prompt-injection attacks in augmented democracy systems.

Mis à jour le 24 juin 2025