Reward Difference Optimization For Sample Reweighting In Offline RLHF
Document Type
Conference Proceeding
Publication Date
1-1-2024
Abstract
With the rapid advances in Large Language Models (LLMs), aligning LLMs with human preferences become increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the “ordinal relationship” between responses, overlooking the crucial aspect of “how much” one is preferred over the others. To address this issue, we propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO. Specifically, we introduce reward difference coefficients to reweigh sample pairs in offline RLHF. We then develop a difference model which captures rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation, thereby highlighting its potential for aligning LLMs with human intent and values.
Identifier
85217617154 (Scopus)
ISBN
[9798891761681]
Publication Title
EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
External Full Text Location
https://doi.org/10.18653/v1/2024.findings-emnlp.115
First Page
2109
Last Page
2123
Fund Ref
Graduate Research and Innovation Projects of Jiangsu Province
Recommended Citation
Wang, Shiqi; Zhang, Zhengze; Zhao, Rui; Tan, Fei; and Nguyen, Cam Tu, "Reward Difference Optimization For Sample Reweighting In Offline RLHF" (2024). Faculty Publications. 725.
https://digitalcommons.njit.edu/fac_pubs/725