Reward Difference Optimization For Sample Reweighting In Offline RLHF

Document Type

Conference Proceeding

Publication Date

1-1-2024

Abstract

With the rapid advances in Large Language Models (LLMs), aligning LLMs with human preferences become increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the “ordinal relationship” between responses, overlooking the crucial aspect of “how much” one is preferred over the others. To address this issue, we propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO. Specifically, we introduce reward difference coefficients to reweigh sample pairs in offline RLHF. We then develop a difference model which captures rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation, thereby highlighting its potential for aligning LLMs with human intent and values.

Identifier

85217617154 (Scopus)

ISBN

[9798891761681]

Publication Title

EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024

External Full Text Location

https://doi.org/10.18653/v1/2024.findings-emnlp.115

First Page

2109

Last Page

2123

Fund Ref

Graduate Research and Innovation Projects of Jiangsu Province

This document is currently not available here.

Share

COinS