[Quick Pitch] On Extending Direct Preference Optimization to Accommodate Ties @NeurIPS 2025
Direct Preference Optimization (DPO) trains language models on pairs of preferred and dispreferred responses, \(y_w \succ y_l\). But not all pairs have a clear winner. For pairs without clear preference - i.e., ties - a common approach is to simply discard them (e.g., Llama3 and Qwen2)