This home page is generated by GPT-5.5-xhigh.

Z-Reward: Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions
Xin Jin1,2,*    Huanqia Cai1,*,†    Zhen Li1    Zechao Zhan1    Dengyang Jiang1    Aiming Hao1    Yuming Jiang1    Chunle Guo2,#    Peng Gao1,#    Ming-Ming Cheng2    Steven C.H. Hoi1
1Z-Image Team, Alibaba Group    2VCIP, CS, Nankai University
*Equal contribution    Project lead    #Corresponding authors

Tech Report, 2026

  Paper   PDF Due to some copyright reasons, we cannot provide the source code at this time   Code   HF Daily Paper   BibTex
Z-Reward teaser

Reasoning-Internalized Reward Modeling

Scalar reward models compress subjective visual preference into a single number. Z-Reward instead learns rubric-aligned score distributions and transfers reasoning-heavy judgments from a large teacher VLM into a compact student VLM for efficient deployment.

Reasoning example for distributional reward modeling

Pointwise Annotation with Score Adjustment

The annotation process first obtains rubric scores for generated candidates, then performs quality checks and score adjustment so the supervision better reflects relative image quality.

Pointwise annotation and score adjustment pipeline

Stronger Teachers, Compact Students

The GDSO teacher improves preference accuracy through distributional supervision, and RISD transfers the teacher's reasoning-conditioned score distribution to a smaller model without explicit reasoning at inference time.

Human preference accuracy across methods and model sizes

Method and model-size comparison.

Human preference accuracy training curve

Training dynamics on human preference accuracy.

Reward-Guided Text-to-Image Optimization

Z-Reward can be used as a differentiable reward signal during text-to-image optimization. Across alignment, aesthetics, realism, and physical plausibility, reward scores steadily improve during RL.

Text-image alignment reward curve Aesthetic reward curve Realism reward curve Physical plausibility reward curve

Visual Examples

Reward-guided optimization improves prompt following, text rendering, and visual fidelity while preserving strong image composition.

Text rendering comparison before and after reward-guided optimization Butterfly example comparison before and after reward-guided optimization Portrait example comparison before and after reward-guided optimization

BibTex

@article{jin2026beyond,
  title={Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions},
  author={Jin, Xin and Cai, Huanqia and Li, Zhen and Zhan, Zechao and Jiang, Dengyang and Hao, Aiming and Jiang, Yuming and Guo, Chunle and Gao, Peng and Cheng, Ming-Ming and Hoi, Steven C.H.},
  journal={arXiv preprint arXiv:2606.09076},
  year={2026}
}

Contact

Feel free to contact us at srameojin@gmail.com!