Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM Reasoning

Research

Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM Reasoning

February 4, 2026

Paris Lab

Share this Post:

current-url

1 Introduction

Large language models (LLMs) have demonstrated remarkable capabilities across diverse reasoning tasks, from mathematical problem solving to code generation. Policy optimization methods have emerged as effective techniques for post-training enhancement, using task-specific rewards to guide model improvement. However, all existing group-based policy optimization methods exclusively use KL divergence for regularization, leaving the fundamental question of divergence choice unexplored despite its critical impact on optimization dynamics, training stability, and generation characteristics.

Traditional policy optimization methods like Proximal Policy Optimization (PPO) [25] require training separate value networks and can suffer from training instability. Group Relative Policy Optimization (GRPO) [26] introduced a simpler alternative that eliminates critic networks by processing rewards at the group level, normalizing advantages across multiple sampled responses for each prompt. This foundational insight has spawned numerous variants: Dr. GRPO [20] addresses optimization bias; GSPO [32] reformulates the objective at the sequence level; off-policy GRPO [21] adapts to batch-based training; G2RPO-A [14] incorporates adaptive guidance mechanisms; GTPO [27] introduces gradient and entropy control; and methods like DAPO [18], TreeRPO [31], and multi-layer GRPO [12] explore different reward processing strategies.

Despite this rich landscape of algorithmic innovations, all existing GRPO variants maintain a fundamental commonality: they exclusively use KL divergence for policy regularization. The choice of divergence function critically shapes the geometry of policy updates and influences both solution quality and generation characteristics. While the community has extensively explored reward processing, training strategies, and gradient control, the divergence function itself has remained unchanged. This raises a natural question: Can alternative divergences improve solution quality, training stability, and generation efficiency beyond what KL divergence provides?

This observation motivates our work on Group-Based Mirror Policy Optimization (GBMPO), a framework that extends group-based methods to support flexible Bregman divergences [7] while preserving their stability benefits. GBMPO enables principled exploration of alternative policy regularization schemes, including hand-designed divergences like L2 in probability space (ProbL2) and learned neural mirror maps.

We evaluate GBMPO on mathematical reasoning (GSM8K) and code generation (MBPP/HumanEval). Our results demonstrate that alternative Bregman divergences substantially improve accuracy across both tasks. On GSM8K, ProbL2-GRPO achieves 86.7% accuracy, improving +5.5 points over Dr. GRPO’s 81.2% baseline. On MBPP, NM-GRPO-ES achieves 60.8% pass@1 (best result) while generating 36% shorter responses than Dr. GRPO, demonstrating task-specific efficiency gains in code generation. The framework also provides substantial variance reduction: ProbL2-GRPO reduces training variance from ±0.7 to ±0.4 on GSM8K compared to Dr. GRPO.

Our key contributions are:

We introduce GBMPO, a general framework that extends any group-based method (GRPO, GSPO) to flexible Bregman divergences, enabling systematic investigation of how divergence choice affects solution quality, training stability, and generation characteristics

We demonstrate that KL divergence is not optimal for policy regularization in group-based RL. On GSM8K mathematical reasoning, hand-designed ProbL2-GRPO achieves 86.7% accuracy (+5.5 points over Dr. GRPO). On MBPP code generation, neural mirror maps reach 60.1-60.8% pass@1 while reducing response length by 24-36%, with random initialization capturing most benefits.

We establish that alternative Bregman divergences provide substantial training stability improvements: ProbL2-GRPO reduces variance from ±0.7 to ±0.4 on GSM8K, while ES-optimized neural mirrors achieve ±0.2 variance on MBPP (70% reduction). This stability gain, combined with task-specific efficiency improvements on code generation, makes GBMPO variants attractive for production deployments.

We develop an evolutionary meta-learning approach for discovering neural mirror maps that provides marginal accuracy improvements (+0.3-0.7 points) but substantial variance reduction and efficiency gains. Given the computational cost (180 training runs), we find that randomly initialized neural mirrors offer a practical alternative for most applications.

Our results challenge the default use of KL divergence in group-based policy optimization, establishing divergence choice as a critical design dimension that affects accuracy, training stability, and task-specific generation characteristics.

2 Background and Related Work

2.1 Policy Optimization for Language Models

Policy gradient methods form the foundation of modern policy optimization for language models. The REINFORCE algorithm [29] optimizes the policy by maximizing expected rewards, but suffers from high variance. Proximal Policy Optimization (PPO) [25] addresses this through trust region constraints, typically implemented via importance sampling with clipped ratios. However, PPO requires training a value network to estimate advantages, adding complexity and potential instability.

Direct Preference Optimization (DPO) [23] eliminates the need for explicit reward models by deriving a closed-form solution that optimizes preferences directly. While elegant, DPO is limited to pairwise preference data and may not fully capture the richness of reward signals available in outcome-based tasks.

2.2 Group-Based Policy Optimization

Group Relative Policy Optimization (GRPO) [26] introduces a key simplification: instead of normalizing rewards globally across the entire batch, it normalizes within groups of responses generated from the same prompt. This group-based normalization provides several benefits: (1) sample efficiency through multiple responses per prompt without value networks, (2) training stability via group-wise variance reduction, and (3) computational efficiency by eliminating critic networks.

GRPO’s success has inspired numerous algorithmic variants that modify different components while preserving the group-based structure. Dr. GRPO [20] identifies and corrects optimization bias in GRPO that artificially inflates response length, providing an unbiased optimization method that improves token efficiency. GSPO [32] addresses GRPO’s token-level importance sampling by reformulating the objective at the sequence level, improving stability for large-scale training. Off-policy GRPO [21] enables batch-based training with reduced communication costs. G2RPO-A [14] injects ground-truth reasoning steps with adaptive guidance strength to compensate for model weaknesses. GTPO [27] introduces trajectory-based optimization with gradient filtering and entropy control to prevent policy collapse. Additional variants like DAPO [18] (dense advantages), TreeRPO [31] (tree-structured sampling), and multi-layer GRPO [12] explore different reward processing and training strategies.

Despite this rich ecosystem of variants, all maintain KL divergence for policy regularization. While the community has extensively explored reward processing, training strategies, and gradient control mechanisms, the divergence function, which fundamentally determines the geometry of policy updates, has remained unchanged. Our work addresses this gap by extending group-based methods to flexible Bregman divergences.

2.3 Mirror Descent and Bregman Divergences

Mirror descent [22, 6] is a generalization of gradient descent that operates in a transformed space defined by a convex potential function ϕ. The Bregman divergence associated with ϕ is:

\[ D_{\phi}(p \parallel q) \;=\; \phi(p) - \phi(q) - \langle \nabla \phi(q),\, p - q \rangle. \tag{1} \]

Different choices of $\phi$ yield different divergences. For example:

$\phi(p)=\sum_i p_i \log p_i$ yields KL divergence.
$\phi(p)=\lVert p \rVert_2^2$ yields squared Euclidean distance.
More complex $\phi$ can encode task-specific geometries.

Mirror descent has been successfully applied to reinforcement learning [13, 3], but primarily in the tabular or low-dimensional continuous control settings. Recent work has explored learning mirror maps through evolutionary strategies [2], demonstrating that the choice of mirror map significantly affects convergence speed and final performance. However, extending these ideas to high-dimensional language model policy spaces for reasoning tasks remains largely unexplored.

2.4 Meta-Learning for Optimization

Meta-learning aims to discover learning algorithms that generalize across tasks. Evolutionary strategies (ES) [24, 15] provide a gradient-free approach to meta-optimization, making them suitable for optimizing non-differentiable objectives like final task performance. Recent work has applied meta-learning to discover preference optimization objectives [1], using ES to search the space of mirror descent-based algorithms for LLM alignment.

Building on these insights, our work investigates whether meta-learning can discover improved neural mirror maps within the GRPO framework. While prior work [9, 5, 1] has explored learned optimizers and preference optimization objectives, we focus specifically on learning the divergence structure that regularizes policy updates. We examine the tradeoffs between ES-discovered mirror maps and simpler alternatives like random initialization, considering both accuracy improvements and computational costs.

2.5 Reasoning in Language Models

Mathematical reasoning and code generation are two domains where LLMs still face significant challenges. GSM8K [10] provides grade-school math problems requiring multi-step reasoning. MBPP [4] tests code generation with short Python programming challenges.

Recent work has shown that reinforcement learning can substantially improve reasoning capabilities beyond supervised fine-tuning alone [26, 11]. Our work builds on this foundation by investigating how divergence choice affects the learning of reasoning skills.

3 Method: Group-Based Mirror Policy Optimization

We now present GBMPO, our framework for group-based policy optimization with flexible Bregman divergences. We first review GRPO and its debiased variant Dr. GRPO (Section 3.1), then introduce our generalization to Bregman divergences (Section 3.2), formalize the GBMPO framework (Section 3.3), and describe practical instantiations (Section 3.4).

3.1 Group Relative Policy Optimization and Dr. GRPO

GRPO [26] optimizes the following objective for each prompt $x$:

\[ \mathcal{L}_{\mathrm{GRPO}}(\theta) \;=\; \mathbb{E}_{y \sim \pi_{\theta}(\cdot \mid x)} \Bigg[ A(x,y)\,\log\!\frac{\pi_{\theta}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} \;-\; \beta\,\mathrm{KL}(\pi_{\theta}\,\|\,\pi_{\mathrm{ref}}) \Bigg], \tag{2} \]

where $\pi_{\theta}$ is the policy being optimized, $\pi_{\mathrm{ref}}$ is a reference policy (typically the base pretrained model), $A(x,y)$ is the advantage computed via group-based normalization, and $\beta$ controls the strength of KL regularization.

The key innovation of GRPO is group-based advantage estimation. For each prompt $x$, we sample $K$ responses $\{y_1,\ldots,y_K\}$ and compute:

\[ A_{\mathrm{GRPO}}(x,y_i) \;=\; \frac{r(x,y_i)-\mu_x}{\sigma_x+\epsilon}, \tag{3} \]

where $\mu_x$ and $\sigma_x$ are the mean and standard deviation of rewards within the group $\{r(x,y_1),\ldots,r(x,y_K)\}$.

However, Liu et al. [20] identified two optimization biases in standard GRPO: (1) response-level length bias, where normalizing the loss by sequence length causes longer incorrect responses to be under-penalized, and (2) question-level difficulty bias, where dividing by standard deviation gives disproportionate weight to questions with very low or very high reward variance. These biases can cause the model to generate progressively longer incorrect responses during training.

Dr. GRPO (GRPO Done Right) addresses these biases through two modifications. First, it removes standard deviation normalization to eliminate difficulty bias. Second, it normalizes the loss by a fixed constant $L$ (typically the maximum completion length) instead of the actual sequence length, eliminating length bias. The Dr. GRPO advantage becomes:

\[ A_{\mathrm{Dr.\,GRPO}}(x,y_i) \;=\; r(x,y_i)-\mu_x, \tag{4} \]

This debiased formulation ensures consistent gradient magnitudes across responses of different lengths and questions of different difficulties, leading to better token efficiency while maintaining reasoning performance. Our GBMPO framework builds on Dr. GRPO’s advantage estimation, extending it with flexible Bregman divergences.

3.2 Bregman Mirror Policy Optimization

We generalize the regularization term by replacing KL divergence with an arbitrary Bregman divergence. Given a strictly convex potential function $\phi : \Delta^{|V|} \to \mathbb{R}$, where $\Delta^{|V|}$ is the probability simplex over vocabulary $V$, the Bregman divergence is:

\[ D_{\phi}(p \parallel q) \;=\; \phi(p) - \phi(q) - \langle \nabla \phi(q),\, p - q \rangle. \tag{5} \]

For language model policies, we compute the divergence token-by-token:

\[ D_{\phi}(\pi_{\theta}\parallel\pi_{\mathrm{ref}})(y \mid x) \;=\; \sum_{t=1}^{|y|} D_{\phi}\!\Big( \pi_{\theta}(\cdot \mid x, y_{\lt t}) \parallel \pi_{\mathrm{ref}}(\cdot \mid x, y_{\lt t}) \Big). \tag{6} \]

This token-level decomposition is crucial for computational tractability, as it allows us to compute divergences during the forward pass without requiring full vocabulary marginalization.

3.3 The GBMPO Framework

GBMPO builds on Dr. GRPO’s debiased advantage estimation, combining it with flexible Bregman divergences. The objective for each prompt $x$ becomes:

\[ \mathcal{L}_{\mathrm{GBMPO}}(\theta,\phi) \;=\; \mathbb{E}_{y \sim \pi_{\theta}} \Bigg[ A(x,y)\,\log\!\frac{\pi_{\theta}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} \;-\; \alpha\,D_{\phi}\!\big(\pi_{\theta}\parallel\pi_{\mathrm{ref}}\big)(y \mid x) \Bigg], \tag{7} \]

where $\alpha$ is a coefficient controlling the strength of Bregman regularization (analogous to $\beta$ in standard GRPO), and we adopt Dr. GRPO’s debiased advantage estimation:

\[ A(x,y_i) \;=\; r(x,y_i) - \frac{1}{K}\sum_{j=1}^{K} r(x,y_j) \;=\; r(x,y_i) - \mu_x. \tag{8} \]

This formulation eliminates the optimization biases identified in standard GRPO while enabling flexible divergence structures. The choice of $\phi$ determines the geometry of policy updates, allowing us to explore divergences beyond KL that may better suit specific reasoning tasks.

3.4 Practical Instantiations

We explore two main approaches to defining $\phi$:

Hand-designed divergences. We can use closed-form convex functions:

KL (standard GRPO): $\phi(p)=\sum_i p_i \log p_i$, giving $D_{\phi}(p\parallel q)=\mathrm{KL}(p\parallel q)$.
ProbL2: $\phi(p)=\tfrac{1}{2}\lVert p\rVert_2^2$, giving $D_{\phi}(p\parallel q)=\tfrac{1}{2}\lVert p-q\rVert_2^2$.
$\alpha$-divergence: $\phi(p)=\dfrac{1}{\alpha(\alpha-1)}\sum_i\big(p_i^{\alpha}-p_i\big)$ for $\alpha \neq 0,1$.

Neural mirror maps. Following Alfano et al. [2], we use the $\omega$-potential mirror map framework, where the mirror map $h$ is defined through an inverse potential $\phi^{-1}:[0,1]\to\mathbb{R}$ satisfying specific monotonicity and smoothness conditions. The KL divergence corresponds to $\phi^{-1}(y)=\log y$ (or equivalently $\phi(x)=e^x$), while L2 divergence arises from $\phi^{-1}(y)=y$ (or $\phi(x)=x$). This framework allows us to parameterize $\phi^{-1}$ using a neural network with 126 neurons distributed across 6 activation types (cubic, quadratic, square root, cube root, logarithm, and exponential), plus linear and logarithmic terms to ensure KL and L2 are recoverable as special cases. The full architecture is described in Appendix C.

The gradient of the Bregman divergence with respect to policy parameters can be computed efficiently:

\[ \nabla_{\theta} D_{\phi}(\pi_{\theta}\parallel\pi_{\mathrm{ref}}) \;=\; \mathbb{E}_{y\sim\pi_{\theta}} \Big[ \nabla_{\theta}\log \pi_{\theta}(y\mid x)\cdot\big(\nabla\phi(\pi_{\theta})-\nabla\phi(\pi_{\mathrm{ref}})\big) \Big]. \tag{9} \]

This allows us to implement GBMPO as a simple modification to the GRPO training loop, adding only the divergence computation and its gradient.

‍

4 Evolutionary Meta-Learning for Mirror Maps

While hand-designed divergences like ProbL2 are simple and interpretable, they may not capture task-specific optimization geometries. Inspired by recent work on meta-learning objectives for preference optimization [1] and learning mirror maps through evolutionary search [2], we use evolutionary strategies (ES) to discover neural mirror maps that perform well across a distribution of related tasks.

4.1 Meta-Learning Setup

We frame mirror map discovery as a meta-learning problem. Let $\mathcal{T}$ be a distribution over tasks (e.g., different mathematical reasoning datasets or code generation benchmarks). Our goal is to find mirror map parameters $\psi^{*}$ that perform well on average:

\[ \psi^{*} \;=\; \arg\max_{\psi}\; \mathbb{E}_{\tau \sim \mathcal{T}} \Big[ \mathrm{Performance}\big(\mathrm{GBMPO}(\psi),\,\tau\big) \Big], \tag{10} \]

where $\mathrm{Performance}$ measures task-specific metrics (e.g., accuracy on GSM8K, pass@1 on MBPP).

4.2 Evolutionary Optimization

Sample $N/2$ Gaussian perturbations $\{\epsilon_j\}_{j=1}^{N/2} \sim \mathcal{N}(0, I)$ and create antithetic pairs $[\epsilon_1, -\epsilon_1, \ldots, \epsilon_{N/2}, -\epsilon_{N/2}]$ for variance reduction.
Generate $N$ perturbed mirror maps: $\psi_i = \psi + \sigma \epsilon_i$, where $\sigma$ is a scalar noise parameter.
For each $\psi_i$, train a policy using GBMPO with the corresponding mirror map on the inner training split.
Evaluate the trained policies on the validation split to obtain fitness scores $F_i$.
Compute the ES gradient estimate:

$$\nabla J \approx \frac{1}{N\sigma}\sum_{i=1}^{N} F_i\,\epsilon_i.$$

(11)
Accept/Reject Decision: If the mean fitness $F = \frac{1}{N}\sum_i F_i$ improves over the best fitness seen so far, accept the update $\psi \leftarrow \psi + \alpha \nabla J$ where $\alpha$ is the learning rate. Otherwise, reject the update and keep $\psi$ unchanged. This safeguard is critical because training costs force us to use very small population sizes $(N \ll \mathrm{dim}(\psi))$, making gradient estimates extremely noisy. With limited iterations $G$, blindly accepting every update would cause performance degradation as noise accumulates.
Elite Sample Retention: On rejection, save the top 25% of samples (ranked by fitness) as elites for reuse in the next iteration. This avoids discarding well-performing perturbations and reduces computational cost by training fewer new policies in subsequent iterations.
Decay the noise: $\sigma \leftarrow \sigma \cdot \text{decay\_rate}$ to gradually refine the search around the current best parameters.

4.3 Data Splitting for Meta-Learning

To prevent overfitting, we use hierarchical data splits: inner train (80%) for policy training, inner validation (20%) for fitness evaluation, and outer test for final evaluation. We conduct separate ES runs for GSM8K and MBPP, discovering task-specific mirror maps for each domain. The complete meta-learning algorithm is presented in Algorithm 1.

Algorithm 1 Evolutionary Meta-Learning for Mirror Maps

Input: Task dataset with inner train/validation splits, population size $N$, iterations $G$, noise $\sigma_0$, learning rate $\alpha$, decay rate $\gamma$

Initialize mirror map parameters $\psi_0$, best fitness $F_{\text{best}}=-\infty$

for $g=1$ to $G$ do

$\sigma_g = \sigma_0 \cdot \gamma^{g-1}$ $\{$Decay noise over time$\}$

Sample $N/2$ perturbations: $\{\epsilon_j\}_{j=1}^{N/2}\sim \mathcal{N}(0,I)$

Create antithetic pairs: $\{\epsilon_1,-\epsilon_1,\ldots,\epsilon_{N/2},-\epsilon_{N/2}\}$

for $i=1$ to $N$ do

$\psi_i = \psi_{g-1} + \sigma_g \epsilon_i$ $\{$Perturbed mirror map$\}$

Train policy $\pi_{\theta_i}$ with GBMPO using $\psi_i$ on inner train split

Evaluate $\pi_{\theta_i}$ on validation split: $F_i=\mathrm{Performance}$

end for

Compute ES gradient: $\nabla J = \frac{1}{N\sigma_g}\sum_{i=1}^{N} F_i \epsilon_i$

if $\mathrm{mean}(F_1,\ldots,F_N) > F_{\text{best}}$ then

$\psi_g = \psi_{g-1} + \alpha \nabla J$ $\{$Accept: update parameters$\}$

$F_{\text{best}}=\mathrm{mean}(F_1,\ldots,F_N)$

else

$\psi_g = \psi_{g-1}$ $\{$Reject: keep current parameters$\}$

Save top 25% samples as elites for next iteration

end if

end for

Return: Best mirror map $\phi_{\psi_G}$

5 Experiments

We evaluate GBMPO on two challenging reasoning domains: mathematical reasoning with GSM8K and code generation with MBPP and HumanEval. Our experiments investigate whether divergence choice significantly impacts final performance, comparing hand-designed divergences like ProbL2 against standard KL and learned neural mirror maps. We assess whether learned divergences improve zero-shot transfer from MBPP to HumanEval using Qwen3-1.7B, providing insights into the effectiveness of different Bregman divergences for distinct reasoning tasks.

5.1 Experimental Setup

Models. We use Qwen3-1.7B, starting directly from the base pretrained checkpoint for RL training.

Datasets. For mathematical reasoning, we use GSM8K [10], which contains 7473 training examples that we split into 5978 for inner training and 1495 for validation, along with 1319 test problems. We evaluate using exact numeric match accuracy between generated and gold answers. For code generation, we train on MBPP [4], which provides 374 training examples, 90 validation samples, and 500 test problems. To assess generalization across code generation tasks, we evaluate zero-shot transfer to HumanEval [8], a collection of 164 hand-written programming problems with unit tests. Both code benchmarks use pass@1 accuracy with greedy decoding.

Training details. We train all methods with 1 prompt and 8 responses per prompt. For mathematical reasoning, we use gradient accumulation over 2 steps for an effective batch size of 16 and train for 4000 steps with maximum prompt length of 256 tokens and completion length of 1024 tokens. For code generation, we use gradient accumulation of 1 step for an effective batch size of 8 and train for 1000 steps with maximum prompt length of 768 tokens and completion length of 512 tokens. All models use LoRA fine-tuning with rank 32 and alpha 64, targeting all linear layers. We use bfloat16 mixed precision and cosine learning rate decay. The Bregman regularization coefficient for neural mirror methods is set to 0.0001, while KL-based baselines use β=0.01. GSPO methods use clipping parameters ϵ=3×10−4 and ϵhigh=4×10−4. Complete hyperparameters are in Appendix A.

Baselines. We compare against three baselines to isolate the impact of Bregman divergence choice. First, we report zero-shot performance of the base pretrained model (Qwen3-1.7B) without any RL training. Our primary RL baseline is Dr. GRPO [20], a debiased variant of GRPO that addresses optimization bias causing artificial response length inflation through careful normalization. We also compare against GSPO [32], which uses sequence-level importance ratios and clipping instead of response-level aggregation, providing better training stability than token-level policy gradient methods.

GBMPO variants. We apply our framework to both Dr. GRPO and GSPO baselines, testing three divergence configurations. ProbL2-GRPO/GSPO uses hand-designed L2 divergence in probability space, providing a simple but theoretically grounded alternative to KL. NM-GRPO/GSPO employs learned neural mirror maps with 126 neurons distributed across 6 activation types (380 total parameters), allowing the divergence to adapt during training. NM-GRPO/GSPO-ES extends the neural mirror approach with evolutionary strategies meta-learning, using vanilla ES with antithetic sampling (population size 12, 15 iterations) to discover mirror map initializations that maximize validation performance. The ES algorithm uses accept/reject decisions and elite sample retention to handle the challenging regime where population size is much smaller than parameter dimensionality (N=12≪380).

Evaluation metrics. We distinguish between three types of metrics. During GBMPO training, the reward function r uses accuracy for mathematical reasoning and pass@1 (greedy decoding) for code generation tasks. For ES meta-learning, we evaluate fitness on the inner validation split using accuracy for GSM8K and pass@10 for MBPP. Using pass@10 (sampling 10 solutions with temperature 0.8) instead of pass@1 provides a more robust and stable fitness signal by averaging over multiple samples, significantly reducing variance in ES gradient estimates. This is critical given the small population size (N≪dim(ψ)), as noisy fitness evaluations would otherwise dominate the optimization. Final test evaluation for all methods uses greedy decoding: accuracy for GSM8K and pass@1 for MBPP/HumanEval. We provide comprehensive pass@k results (k=1,2,5,10) and evaluation on the harder MBPP+ and HumanEval+ benchmarks in Appendix B.

All experiments report mean and standard deviation across 3 random seeds.

5.2 Main Results

‍

Table 1: Main results on GSM8K and MBPP for Qwen3-1.7B, comparing Dr. GRPO and GSPO baselines with GBMPO variants. Numbers show accuracy (%) and average response length (tokens) with standard deviation across 3 seeds. GBMPO consistently improves both baselines.

METHOD	GSM8K Acc (%)	GSM8K Len	MBPP Acc (%)	MBPP Len
Qwen3-1.7B
Base Model	77.7±0.6	209.4±5.7	58.3±0.5	49.1±2.8
GRPO-based methods
Dr. GRPO	81.2±0.7	271.3±15.7	59.8±0.5	75.5±4.3
ProbL2-GRPO	86.7±0.4	310.5±4.6	60.2±0.3	57.8±9.1
NM-GRPO	85.1±0.5	430.6±12.4	60.1±0.6	56.9±3.5
NM-GRPO-ES	85.5±0.4	441.8±13.2	60.8±0.2	48.5±3.1
GSPO-based methods
GSPO	80.7±0.9	269.0±11.3	57.6±0.7	33.4±2.1
ProbL2-GSPO	83.6±0.8	411.3±5.7	58.3±0.4	32.4±2.3
NM-GSPO	85.3±0.6	448.1±1.4	57.9±0.5	35.6±2.5
NM-GSPO-ES	85.6±0.5	445.2±7.8	58.0±0.4	34.8±2.2

‍

Table 1 presents our main results on GSM8K mathematical reasoning and MBPP code generation for Qwen3-1.7B, showing both accuracy and average response length in tokens. We observe several key patterns across methods and tasks that provide insights into the effectiveness of different Bregman divergences for policy optimization.

Bregman divergences consistently improve baselines. Across both GRPO and GSPO families, our GBMPO variants with alternative Bregman divergences substantially outperform the standard KL-based baselines. For GRPO-based methods, ProbL2-GRPO achieves 86.7% on GSM8K (+5.5 points over Dr. GRPO’s 81.2%), while neural mirror methods reach 85.1-85.5%, representing +3.9 to +4.3 point improvements. For GSPO-based methods, ProbL2-GSPO improves by +2.9 points on GSM8K (83.6% versus 80.7%), and neural mirror variants achieve 85.3-85.6%, gains of +4.6 to +4.9 points. These consistent improvements across both hand-designed (ProbL2) and learned (neural mirror) divergences demonstrate that KL divergence, while mathematically convenient, is not universally optimal for policy optimization across reasoning tasks.

Task-specific divergence selection matters. Different Bregman divergences excel at different tasks, revealing task-specific inductive biases. For mathematical reasoning on GSM8K, ProbL2-GRPO achieves the best performance at 86.7%, suggesting that L2 divergence in probability space provides effective regularization for numerical computation tasks. For code generation on MBPP, neural mirror methods achieve 60.1-60.8% accuracy, with random initialization (NM-GRPO at 60.1%) already capturing most of the benefit. Evolutionary meta-learning provides a marginal improvement to 60.8%, but the small gain (+0.7 points) suggests that random initialization of neural mirror maps is sufficient for practical applications.

GRPO versus GSPO reveals accuracy-brevity tradeoffs. The two baseline families exhibit markedly different characteristics. GRPO-based methods achieve strong generalization, with MBPP accuracy ranging from 59.8% to 60.8%, all surpassing the base model’s 58.3%. However, they produce longer responses (48-75 tokens on MBPP). GSPO-based methods generate extremely concise solutions (32-36 tokens on MBPP) but sacrifice some accuracy, with most variants at or below the base model performance. This tradeoff between brevity and correctness stems from GSPO’s sequence-level importance sampling, which more strongly encourages conciseness compared to GRPO’s response-level approach.

Evolutionary meta-learning provides variance reduction and efficiency gains. Comparing neural mirror methods with and without evolutionary strategies reveals that ES provides marginal accuracy improvements but substantial variance reduction and efficiency benefits. On mathematical reasoning, NM-GRPO-ES achieves 85.5% compared to NM-GRPO’s 85.1% on GSM8K (+0.4 points), while NM-GSPO-ES reaches 85.6% versus NM-GSPO’s 85.3% (+0.3 points). On code generation, NM-GRPO-ES improves from 60.1% to 60.8% on MBPP (+0.7 points) while simultaneously reducing response length from 56.9 to 48.5 tokens, a 15% reduction that improves efficiency. However, the primary benefit of ES lies in variance reduction: NM-GRPO-ES achieves ±0.2 standard deviation on MBPP compared to ±0.6 for methods without ES (see variance analysis below). Given the computational cost of ES (180 full training runs for N=12 population over 15 iterations), the marginal accuracy gains may not justify the expense, making randomly initialized neural mirrors a practical alternative for most applications.

Response length patterns reveal optimization characteristics. The average response length provides insight into how different methods balance solution completeness with conciseness. On GSM8K, all RL-trained methods generate longer responses than the base model (209 tokens), with Dr. GRPO at 271 tokens and neural mirror methods producing the longest reasoning chains (430-448 tokens). This length increase correlates with accuracy improvements, suggesting that RL training encourages more detailed step-by-step reasoning. For MBPP code generation, GRPO family methods maintain adequate solution detail (48-75 tokens), whereas GSPO methods produce ultra-brief code (32-36 tokens) that may sacrifice necessary implementation details. These patterns suggest that optimal response length varies significantly by task type and optimization method.

Bregman regularization reduces training variance. Beyond improving mean performance, alternative Bregman divergences also stabilize training. ProbL2-GRPO shows notably lower variance on GSM8K (±0.4) compared to Dr. GRPO (±0.7), and on MBPP (±0.3 versus ±0.5). Neural mirror methods with evolutionary strategies achieve the lowest variance overall, with NM-GRPO-ES at ±0.2 on MBPP, representing a 70% reduction compared to the ±0.6 variance of randomly initialized neural mirrors. This variance reduction represents the primary practical benefit of evolutionary meta-learning: while accuracy gains are marginal (+0.3-0.7 points), the improved training stability and 15% efficiency gains in response length make ES-discovered mirror maps attractive for production deployments where consistency matters. However, for research applications or when computational budget is limited, randomly initialized neural mirrors provide most of the accuracy benefits at a fraction of the cost.

‍

Table 2: Zero-shot transfer to HumanEval for Qwen3-1.7B after training on MBPP. We report pass@1 accuracy and average response length in tokens (mean±std across 3 seeds). Lower token counts indicate more concise solutions.

METHOD	Pass@1 (%)	Avg Length (tokens)
Qwen3-1.7B
Base Model	53.5±1.0	72.1±3.2
GRPO-based methods
Dr. GRPO	62.1±1.4	132.2±5.3
ProbL2-GRPO	62.1±1.2	113.4±4.6
NM-GRPO	59.5±1.5	95.9±3.8
NM-GRPO-ES	60.7±0.4	92.1±3.5
GSPO-based methods
GSPO	51.6±1.2	59.8±2.6
ProbL2-GSPO	53.0±1.1	63.6±2.8
NM-GSPO	50.2±1.3	61.4±2.5
NM-GSPO-ES	51.0±1.0	60.2±2.4

‍

Table 2 shows zero-shot transfer performance on HumanEval after training on MBPP. Transfer performance provides crucial evidence about whether learned optimization strategies generalize beyond their training distribution.

GRPO methods excel at zero-shot transfer. The HumanEval results reveal a striking divergence between GRPO and GSPO families in terms of generalization capability. GRPO-based methods achieve substantial improvements over the base model, with Dr. GRPO and ProbL2-GRPO both reaching 62.1% pass@1 (+8.6 points over the base 53.5%), while NM-GRPO-ES achieves 60.7%. In sharp contrast, GSPO-based methods struggle to generalize, with all variants performing at or below the base model level (50.2-53.0%). This degradation is particularly notable for NM-GSPO, which achieves 50.2%, actually worse than the untrained base model. The pattern suggests that GSPO’s sequence-level importance sampling, while effective for generating concise solutions on the training distribution (MBPP), does not learn representations that transfer well to related but distinct tasks.

Response length correlates with transfer success. Examining response lengths on HumanEval provides insight into why GRPO methods generalize better. GRPO variants produce solutions with 92-132 tokens, maintaining sufficient detail to implement correct functionality in the new domain. GSPO methods generate much shorter solutions (60-64 tokens) that, while concise, may omit necessary implementation details for problems they have not seen during training. Notably, NM-GRPO-ES achieves strong transfer performance (60.7%) with the most efficient GRPO-family solutions (92.1 tokens), demonstrating that neural mirror maps, even with random initialization, provide efficiency benefits that transfer across related code generation tasks.

6 Conclusion

We introduced Group-Based Mirror Policy Optimization (GBMPO), a framework that extends group-based policy optimization methods with flexible Bregman divergences beyond the standard KL regularization. Through both hand-designed divergences (L2 in probability space) and learned neural mirror maps (126 neurons with 6 activation types), we demonstrated that divergence choice significantly impacts performance on reasoning tasks. On mathematical reasoning (GSM8K), hand-designed ProbL2-GRPO achieves 86.7% accuracy, improving +5.5 points over the Dr. GRPO baseline. On code generation (MBPP), neural mirror maps achieve 60.1-60.8% pass@1, with random initialization already capturing most of the benefit.

Our evolutionary strategies experiments reveal an important practical insight: while meta-learning provides marginal accuracy improvements (+0.3 to +0.7 points), its primary value lies in variance reduction and efficiency. NM-GRPO-ES reduces training variance from ±0.6 to ±0.2 on MBPP and generates 15% shorter responses while maintaining comparable accuracy. However, the cost of ES optimization (180 full training runs for N=12 population over 15 iterations) may not be justified given that random initialization of neural mirror maps already achieves most of the accuracy gains. For practitioners, randomly initialized neural mirrors offer a practical alternative that captures the benefits of flexible divergences without the computational overhead of meta-learning.

This work opens several directions for future research. First, investigating why certain Bregman divergences excel at specific tasks could reveal deeper connections between task structure and optimal regularization geometry. Second, exploring whether mirror maps learned on one task can transfer to related domains would test the generality of discovered divergence structures. Third, scaling to larger models and more complex reasoning tasks could reveal whether the benefits of flexible divergences compound with model capacity. Finally, combining learned divergences with other recent advances such as tree-based search or multi-turn reasoning could yield further improvements.

‍