"Reward hacking" has to be a similar problem space as "sycophancy", no?

cubefox · 2025-05-25T11:55:53 1748174153

Sycophancy is one form of RLHF induced reward hacking, but reasoning training (RLVR) can also induce other forms of reward hacking. OpenAIs models are particularly affected. See https://www.lesswrong.com/posts/rKC4xJFkxm6cNq4i9/reward-hac...

klysm · 2025-05-25T20:06:43 1748203603

Reward hacking is literally just overfitting with a different name no?

n2d4 · 2025-05-25T22:01:34 1748210494

They're different concepts with similar symptoms. Overfitting is when a model doesn't generalize well during training. Reward hacking happens after training, and it's when the model does something that's technically correct but probably not what a human would've done or wanted; like hardcoding fixes for test cases.

cyanydeez · 2025-05-25T12:31:28 1748176288

keep in mind these models are being taught to talk to each other, so, probably a trick theyre using on each other