Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Reward hacking" has to be a similar problem space as "sycophancy", no?


Sycophancy is one form of RLHF induced reward hacking, but reasoning training (RLVR) can also induce other forms of reward hacking. OpenAIs models are particularly affected. See https://www.lesswrong.com/posts/rKC4xJFkxm6cNq4i9/reward-hac...


Reward hacking is literally just overfitting with a different name no?


They're different concepts with similar symptoms. Overfitting is when a model doesn't generalize well during training. Reward hacking happens after training, and it's when the model does something that's technically correct but probably not what a human would've done or wanted; like hardcoding fixes for test cases.


keep in mind these models are being taught to talk to each other, so, probably a trick theyre using on each other




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: