https://arxiv.org/abs/2306.14111

Is RLHF More Difficult than Standard RL? (Yuanhao Wang, Qinghua Liu, Chi Jin)

rlhf는 reward가 아니라 preference signal에 대해 학습되는데, preference는 정보가 reward보다 부족하기 때문에 학습이 더 어려워지는 것이 아닌가? 하는 질문이네요. 일단 논문의 결론은 기존 rl 방법으로 풀 수 있고 rl에 비해 rlhf가 더 어렵지는 않은 것 같다고 하네요. 생각해보지 않았던 문제인데 흥미롭군요.

#alignment #rl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

230625 Is RLHF More Difficult than Standard RL.md

230625 Is RLHF More Difficult than Standard RL.md

Files

230625 Is RLHF More Difficult than Standard RL.md

Latest commit

History

230625 Is RLHF More Difficult than Standard RL.md

File metadata and controls