Reward Hacking in Reinforcement Learning
Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task. Reward hacking exists because RL environments are often imperfect, and it is fundamentally challenging to accurately specify a reward function. With the rise of language models generalizing to a broad spectrum of tasks and RLHF becomes a de facto method for alignment training, reward hacking in RL training of language models has become a critical practical challenge. Instances where the model learns to modify unit tests to pass coding tasks, or where responses contain biases that mimic a user’s preference, are pretty concerning and are likely one of the major blockers for real-world deployment of more autonomous use cases of AI models. Most of the past work on this topic has been quite theoretical and focused on defining or demonstrating the existence of reward hacking. However, research into practical mitigations, especially in the context of RLHF and LLMs, remains limited. I especially want to call out for more research efforts directed toward understanding and developing mitigation for reward hacking in the future. Hope I will be able to cover the mitigation part in a dedicated post soon. Background Reward Function in RL Reward function defines the task, and reward shaping significantly impacts learning efficiency and accuracy in reinforcement learning. Designing a reward function for an RL task often feels like a ‘dark art’. Many factors contribute to this complexity: How you decompose a big goal into small goals? Is the reward sparse or dense? How you measure the success? Various choices may lead to good or problematic learning dynamics, including unlearnable tasks or hackable reward functions. There is a long history of research on how to do reward shaping in RL. For example, in an 1999 paper by Ng et al., the authors studied how to modify the reward function in Markov Decision Processes (MDPs) such that the optimal policy remains unchanged. They found that linear transformation works. Given a MDP $M = (S, A, T, \gamma, R)$, we want to create a transformed MDP $M’ = (S, A, T, \gamma, R’)$ where $R’ = R + F$ and $F: S \times A \times S \mapsto \mathbb{R}$, such that we can guide the learning algorithm to be more efficient. Given a real-valued function $\Phi: S \mapsto \mathbb{R}$, $F$ is a potential-based shaping function if for all $s \in S - {s_0}, a \in A, s’ \in S$: This would guarantee that the sum of discounted $F$, $F(s_1, a_1, s_2) + \gamma F(s_2, a_2, s_3) + \dots$, ends up being 0. If $F$ is such a potential-based shaping function, it is both sufficient and necessary to ensure $M$ and $M’$ share the same optimal policies. When $F(s, a, s’) = \gamma \Phi(s’) - \Phi(s)$, and if we further assume that $\Phi(s_0) = 0$, where $s_0$ is absorbing state, and $\gamma=1$, and then for all $s \in S, a \in A$: This form…

