End-to-end RL for multi-turn tool-integrated reasoning · corresponding author · ICLR 2026
Plug-and-play RL that stabilizes multi-turn TIR by filtering void turns
(no code block / no final answer) during policy updates — mitigating
distributional drift from tool feedback and gradient explosions.
From Qwen2.5-7B base (no SFT): AIME24 22.1 → 50.5, with emergent
self-correction and cross-validation. Guided research direction and
experimental design.
Zero RL for open base models · co-first author · COLM 2025
Systematic study of zero RL (RL on base models without SFT) across 10
open bases (Llama3, Mistral, Qwen2.5, …). Key recipes: format rewards,
query difficulty control; first verification / “aha moment” behaviors
in non-Qwen small models. Open-source toolkit for zero-RL research.
Co-led direction and experimental design.
Automated data mixture optimization for pre-training · co-first author · ICLR 2025 Spotlight
Formulates mixture selection as regression: train small proxy models on
random mixtures, fit mixture → metric, extrapolate to the large run.
~10% compute of prior methods; outperforms human expert selection and DoReMi;
transfers from 1M proxies to billion-scale models. Adopted in production
pre-training (e.g. Sailor). Proposed the idea and led the project.