Superalignment With Dynamic Human Values

Published in ICLR 2025 Workshop on Bidirectional Human-AI Alignment (BiAlign), 2025

Abstract

Two core challenges of alignment are 1) scalable oversight and 2) accounting for the dynamic nature of human values. While solutions like recursive reward modeling address 1), they do not simultaneously account for 2). We sketch a roadmap for a novel algorithmic framework that trains a superhuman reasoning model to decompose complex tasks into subtasks that are still amenable to human-level guidance. Our approach relies on what we call the part-to-complete generalization hypothesis, which states that the alignment of subtask solutions generalizes to the alignment of complete solutions. We advocate measuring this generalization and propose ways to improve it in the future.

BibTeX

@article{mai2025superalignment,
  title={Superalignment with Dynamic Human Values},
  author={Mai, Florian and Kacz{\'e}r, David and Corr{\^e}a, Nicholas Kluge and Flek, Lucie},
  journal={arXiv preprint arXiv:2503.13621},
  year={2025}
}