Emergent Reasoning via Recursive Latent Reinforcement Pretraining

back to publications

back

March 5, 2026

Abstract

Large language models (LLMs) often rely on explicit chain-of-thought (CoT) traces to solve multi-step reasoning problems, but these traces increase inference cost, expose brittle prompt dependence, and complicate training objectives. We study an alternative: \emph{latent deliberation} implemented as a small recurrent refinement module that performs multiple internal thinking steps while keeping the external sequence length fixed. We introduce \textbf{Recursive Latent Reinforcement Pretraining (RLRP)}, a training recipe that augments a base causal LLM with a shared latent head executed for refinement steps on \emph{every token}. The head updates a latent state via bounded residual iterations and projects it back to the hidden space to produce step-wise logits. Training combines (i) deep supervision with a convex combination of per-step next-token cross-entropies, (ii) data-aware routing that interleaves reasoning-focused and fluency-focused batches, and (iii) soft reinforcement learning on reasoning batches that maximizes the model's probability mass on the ground-truth next token, optionally restricted to answer spans. We additionally consider an improvement penalty that encourages later refinement steps to outperform the first step. Our approach is simple, compatible with standard autoregressive LMs and distributed training, and focuses on iterative latent refinement without increasing output tokens.