Faculty Publications

Reward Difference Optimization For Sample Reweighting In Offline RLHF

Shiqi Wang, Nanjing University
Zhengze Zhang, Nanjing University
Rui Zhao, Chinese University of Hong Kong
Fei Tan, New Jersey Institute of Technology
Cam Tu Nguyen, Nanjing University

Document Type

Conference Proceeding

Publication Date

1-1-2024

Abstract

With the rapid advances in Large Language Models (LLMs), aligning LLMs with human preferences become increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the “ordinal relationship” between responses, overlooking the crucial aspect of “how much” one is preferred over the others. To address this issue, we propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO. Specifically, we introduce reward difference coefficients to reweigh sample pairs in offline RLHF. We then develop a difference model which captures rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation, thereby highlighting its potential for aligning LLMs with human intent and values.

Identifier

85217617154 (Scopus)

ISBN

[9798891761681]

Publication Title

EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024

External Full Text Location

https://doi.org/10.18653/v1/2024.findings-emnlp.115

First Page

2109

Last Page

2123

Fund Ref

Graduate Research and Innovation Projects of Jiangsu Province

Recommended Citation

Wang, Shiqi; Zhang, Zhengze; Zhao, Rui; Tan, Fei; and Nguyen, Cam Tu, "Reward Difference Optimization For Sample Reweighting In Offline RLHF" (2024). Faculty Publications. 725.
https://digitalcommons.njit.edu/fac_pubs/725

This document is currently not available here.

COinS

DOI

10.18653/v1/2024.findings-emnlp.115

Faculty Publications

Reward Difference Optimization For Sample Reweighting In Offline RLHF

Document Type

Publication Date

Abstract

Identifier

ISBN

Publication Title

External Full Text Location

First Page

Last Page

Fund Ref

Recommended Citation

DOI

Search

Browse

Author Corner

Links

Faculty Publications

Reward Difference Optimization For Sample Reweighting In Offline RLHF

Authors

Document Type

Publication Date

Abstract

Identifier

ISBN

Publication Title

External Full Text Location

First Page

Last Page

Fund Ref

Recommended Citation

Share

DOI

Search

Browse

Author Corner

Links