Evolution as a Comparable Strategy to Reinforcement Learning

I implemented Evolutionary Strategy (ES) algorithms for LLMs. I have put an ES algorithm through the paces on Prime RL environments. It is a simple version which uses LoRA updates and vLLM as the inference engine. It is comparable to RL but may be learning from different signals.

Let’s start with a paper (https://arxiv.org/pdf/1901.11503) which was trying to determine when zeroth order optimizations should be used (RL vs ES). There are a couple of quotes I want to highlight.

“As our analysis in Sections 4.1 and 4.2 points out, the stochasticity of the environment plays an important role in controlling the variance of our gradient estimates in zeroth order optimization procedures.”

“Such a gradient estimation algorithm (REINFORCE and its derivatives) can be considered a combination of a zeroth-order approach and a first-order approach…”

“In summary, our analysis and experimental results suggests that the complexity of exploration in action space depends on both the dimensionality of action space and horizon, while the complexity of exploration in parameter space solely depends on dimensionality of parameter space, providing a natural way to trade-off between these approaches.”

Here is another paper which explicitly defines the dimensionality argument (https://arxiv.org/pdf/2509.24372).

“Vemula et al. (2019) performed a theoretical analysis of different exploration strategies, and found that the complexity of the parameter space exploration increased quadratically with the number of parameters, whereas the complexity of action space exploration depended on action dimensionality quadratically and horizon length of the reward quartically.”

I highly recommend reading both papers but the quotes are most pertinent to this post. RL explores action space which has quadratic tradeoffs with actions and quartic tradeoffs with horizon. ES explores parameter space and has quadratic tradeoffs with parameters. Both have some stochastic exploration as a partial zeroth order optimizer. Here RL is not quite zeroth order because of the first order updates but ES is purely zeroth order optimization.

Prior assumptions often include that ES would not work at scale because the parameter space is immense. The worst case is an impossible to parse sparseness of signal within LLM scale parameters. This is wrong for the same reason that RL is viable in LLMs. The environment has limited stochasticity due to prior training. For RL, pretraining develops bias in the action space which beneficially limits exploration. In ES, training develops neural thickets (https://arxiv.org/pdf/2603.12228) for ES to exploit. As such, there is no issue with ES on a well pretrained model. Which begs the question of why it is not more commonly deployed. I don’t have an answer.

TL;DR

On-device concision environment - ES clearly learned. The model decreased from 221 to 149 completion tokens over 50 steps.

A6000 concision - ES was faster and had better final performance than DAPO. Our DAPO implementation stalled because it failed to generate diverse outputs. The concision model clearly learned with a LoRA and vLLM implementation.

GSM8K/Math - This test hit an evaluation reward of 75% from a baseline of 53%. None of the completions were truncated at 2048 tokens. There was an issue on an early run where our sigma and LR were too low.

Alphabet Sort - ES allowed my model to match RL competency in 40 steps. The reward increased from 0.264 to 0.834 with a peak of 65% of the evaluations as perfect. Our ES environment was set to a longer horizon but easier reward calculation. This was a mistake but the ES model learned regardless.

Wordle - ES continued to learn even at step 100. We matched RL reward (~0.88) within our environment's eval. The original training of the RL model used a different eval set. When comparing with that set the RL model outperforms ES in reward (RL improved to ~1.05). RL beat ES in solve rate under both evaluations (65% vs 20%). ES had significantly higher adherence to format in later rounds (81% vs 29% at round 5). ES had lower total token output (1500 vs 3000). Each method seems to pick up on different signals related to their exploration strategy.

On-Device Concision

This was a local MLX run using the instruct variant of Gemma 270M. The model was trained to output a shorter answer. The 50-step ES run improved train raw reward from -189.42 to -122.34. The train normalized reward improved from 0.9053 to 0.9388. Eval improved from val_norm=0.8911, mean_len=221.0 at step 5 to val_norm=0.9271, mean_len=148.9 at step 50. I did not take an initial baseline.

So clearly, there was something happening. The algorithm was from this paper (https://arxiv.org/abs/2509.24372). The local concision experiment did not use a LoRA but the upcoming experiments did. The use of LoRA became the primary difference between my implementations and the linked paper.

This should be an easy case where ES works because the reward is dense, cheap, and smooth. If the outputs were shorter then they received improved rewards. I then wanted a direct comparison with RL. So I had Codex write a DAPO trainer. DAPO and ES were deployed on an A6000 GPU rented from PI.

GPU vLLM concision direct DAPO vs ES comparison
I ran a quick DAPO vs ES run on a rented GPU using google/gemma-3-270m-it, steps=50, group_size=32, and max_tokens=1024.

For the final 50-step segment:

Method	Final Train Reward	Final Eval Reward	Eval Content	Eval Len	KL	Wall Time
DAPO	-7.000	-6.375	0.375	16.8	0.04969	294.2s
ES	0.000	-2.500	0.438	13.4	0.08891	106.4s

ES was better and faster. I’m not confident that this was a strong baseline for DAPO. I did debug the initial Codex implementation of DAPO so it functioned but I’ve had issues in this domain before with LLM agent implementations. Regardless, the ES algorithm was learning and it was learning fast. The trouble with the DAPO implementation also speaks to the ease of which ES was implemented in comparison.

The important context is that DAPO stalled. In the segmented logs, DAPO improved for the first few steps, then from about step 10 onward it had grad=0.0, qualified_groups=0, and rounds=10.0. The reward collapsed because the DAPO variant converged to no variance. It had 41/50 zero-gradient steps.

ES train reward moved roughly -10.266 -> 0.000 and eval improved to -2.500. It drifted more in KL. The KL was expected considering the DAPO implementation stalled so it was no longer learning. You can’t diverge if you are not learning.

I don’t really care about the DAPO implementation here. I could use an ES algorithm to quickly train an LLM. This is not a fair comparison and I don’t claim it to be. The fastest way to get to a fair comparison is to run the ES algorithm within known RL environments and models. Prime Intellect has plenty of environments with tuned baselines.

MLX/on-device ES showed clean concision learning.
GPU vLLM direct comparison showed ES materially outperforming DAPO on this toy concision task.
The ES won wall clock by about 2.8x.
The DAPO failure was no-variance/group-filter collapse
This GPU run used shaped concision plus correctness/content reward. It was an extension of the MLX run.

Before I begin the next few sections, it is important to highlight that the train reward and eval reward for my ES implementation often differed by a significant amount. In the next few studies we would often see a train reward ¼ - ½ the values of the eval reward. This has to do with the variance seen in LoRA perturbations. It was not always the case but happened enough I wanted to highlight that this looked to be normal. The eval results were unaffected.

GSM8K / Math

The original full-run setup used Qwen/Qwen3-0.6B, GSM8K, LoRA rank 32, group_size=64, train_examples=16, sigma=0.001, lr=0.0005.

The GSM8K test showed the model would learn. It increased from a baseline of 53% to 67 - 69%. After the model appeared to learn, I wanted to do a LoRA rank sweep. RL is interesting because it tends to learn at lower rank as well as it does in a full finetune. This is why LoRAs are often used preferentially in RL setups. Ideally, ES shows something similar. What is the lowest LoRA rank with idealized learning?

I ran a LoRA rank search from 2 - 64. The grid search evals completed with cells clustered around 0.625-0.672, with the best observed eval 0.6875 for rank 2. I found the search worrying because there was no clear differentiation or trend. My result did not show a clean advantage for higher LoRA rank or broader matrix targeting.

New GSM8K / Math

The new full-run setup used Qwen/Qwen3-0.6B-Base, GSM8K, LoRA rank 16, population_size=64, candidate_chunk_size=32, examples_per_env=8 for 512 candidate rollouts per step, sigma=0.012, lr=0.01, max completion/context budget 2048 tokens. Eval used 128 GSM8K examples, baseline at step 0 and then every 5 steps.

There was a previous issue in which my sigma and learning rate were too low. An indication was that our grid search yielded functionally no difference across values. I expected the model to easily learn GSM8K and I did not know the capacity of ES. It was hard to immediately reconcile that the learning was sub-ideal for this method. I knew it did not compare to previous RL runs but I did not yet have a good sense for the capacity of ES. When I saw some learning, I figured our algorithm was working as intended. I had swept my learning rate an OOM on either side and did not see improvement during initial setup runs. In later experiments I finally determined both the LR and Sigma values were too low! The ‘New’ GSM8K math run was done with my ES implementation which is intended to be added to Prime-RL (if they will have it).

Updated GSM8K (Best Eval)
step = 45
reward_mean = 0.7500
truncated = 0.0%

Using our new setup, the model learned better. By Step 45 we achieved 75% on the environment and were never truncated. The math environment does not use length shaping so the truncation is a side effect of the ES method as opposed to specific training configurations. This is comparable in ‘correct’ answers to what I have seen with other RL methods (except the truncation tends not to improve in RL without explicit rewards).

Word Reversal

One of the environments I tested was Word Reversal. It had a sneaky characteristic! I spent a full day trying to get the model to learn from the SFT checkpoint. I took a break and got the alphabet sort environment to work then tried word reversal again (this is when I had determined working sigma and LR). This time, I baselined the SFT model and carefully reviewed the final reporting from the Prime RL Env. THE SFT MODEL IS AT THE REPORTED RL EVAL VALUES! Our model was not expected to learn and was at RL/SFT equivalence.

I decided to bypass this environment for further testing.

Alphabet Sort

Config: Qwen/Qwen3-4B-Instruct-2507, LoRA ES, group_size=128, train_examples=16, sigma=0.012, lr=0.01, temp 0.0, similarity_power=4, power_per_turn=false, 3-5 turns, 1-5 names per turn. It was configured for 100 steps but we stopped at 60 steps.

Prime’s Alphabet Sort README reports baseline reward 0.264, post-RL reward 0.805, perfect attempts 73.3%, and perfect examples 65%. Our ES run essentially reached that same endpoint: step 60 reward 0.808, and step 50 exact examples 0.65. It matched the reported RL model, then plateaued.

Once we got the LR and Sigma worked out, our model learned quite well. So well, it matched the RL performance (step 100) in only 40 steps. I pulled this run early because it was large and slow (relative to what I am used to on sub 1B models). This is where I ask for sponsorship for GPUs so I may work with larger models (please?).

Train reward improved from 0.202 to 0.614, and train exact from 0.004 to 0.142. Evals were stronger: step 30 reward 0.684, exact 0.05; step 40 reward 0.831, exact 0.60; step 50 reward 0.794, exact 0.65; step 60 reward 0.808, exact 0.55.

Now technically, I messed up. I lowered the similarity power to 4 instead of the 8 used in the RL environment. This inflates our reward. However, I also made the task harder. I increased the number of potential turns (up to 5 as opposed to max of 3) and allowed more names per turn (5 instead of 4). I’m okay calling this approximately equivalent. We ended at parity on exactness (on a harder task) and parity for reward (on an easier reward calculation).

Our next experiment is apples to apples. The result was interesting.

Wordle Full Run

Config: PrimeIntellect/Qwen3-1.7B-Wordle-SFT, LoRA ES, 100 steps, group_size=64, train_examples=16, sigma=0.01, lr=0.01, temp 0.0, 1024 max tokens per turn, full history. Runtime was about 23.6h.

Baseline review was reward 0.513, won 0.05, partial 0.35, format 0.48. Final step-100 review was reward 0.885, won 0.20, partial 0.465, format 0.80. So ES learned useful behavior: better formatting, shorter outputs, more partial progress, and some exact wins. It did not become a strong Wordle solver.

Prime’s Wordle README reports SFT at roughly 5% wins and avg reward ~0.6 and final RL at roughly 60% wins and avg reward ~1.5. Our baseline matched the 5% win claim. Our ES improved substantially but stayed far below RL on exact wins. In an eval comparison, the RL model scored 58.3% exact, while ES scored 25% at temp 0.0 and 26.7% at temp 0.3.

So we can see here that RL in terms of solve rate beat out ES. However, when we look at cap hits and format, our ES training beats the RL model. These were not rewarded as much as earlier solves so the reward of ES lags that of RL as we will see in a moment.

ES Review Progression

Step	Reward	Wins	Partial	Format	Len Bonus	Invalid Rows / Moves	Guess Tags	Avg Tokens
0	0.513	1/20	0.350	0.480	0.017	18 / 64	2.65	3836
10	0.575	1/20	0.400	0.540	0.017	17 / 55	3.25	3424
20	0.672	2/20	0.415	0.650	0.027	14 / 34	4.35	2626
30	0.794	3/20	0.445	0.725	0.054	9 / 19	4.80	2121
40	0.728	2/20	0.445	0.770	0.029	5 / 12	5.45	1828
50	0.865	5/20	0.390	0.770	0.071	6 / 8	5.20	1655
60	0.808	3/20	0.445	0.795	0.054	5 / 8	5.50	1486
70	0.802	2/20	0.505	0.775	0.042	8 / 11	5.40	1636
80	0.851	4/20	0.435	0.780	0.060	3 / 6	5.30	1609
90	0.823	3/20	0.480	0.800	0.033	8 / 11	5.85	1550
100	0.885	4/20	0.465	0.800	0.060	4 / 7	5.55	1513

The above shows eval results from the ES training. Even at the end of training, every metric appears to continue to be improving. It learned to stay in format, produce usable guesses across the whole game, avoid invalid moves, avoid loops and stop rambling.

Behavior Changes
- Format became nearly saturated: 0.48 -> 0.80.
- Invalid rows dropped from 18/20 -> 4/20; total invalid moves dropped 64 -> 7.
- Blank/unparsed final guesses dropped 16/20 -> 3/20.
- Mean guess tags increased 2.65 -> 5.55, meaning it was much more consistently playing all turns instead of rambling/truncating before valid guesses.
- Output length dropped: 3836 -> 1513 completion tokens and 12,807 -> 4,830 chars.
- Partial score improved 0.350 -> 0.465 while peaking at 0.505 on step 70.
- Exact wins improved 1/20 -> 4/20 while peaking at 5/20 on step 50.

The turn-level win pattern also improved. Baseline only solved south in 3 turns. By step 50 it solved paste, store, south, scale, and snake. By step 100 it solved earth, store, scale, and snake.

Cap / Truncation Comparison
Truncated rows essentially disappeared by step 100. The worst performing evaluation had a 5% truncation rate. We will see ES was far less cap-prone than RL, but the RL model had a higher solve rate.

Run	Win Rows	Mean Reward	Mean Output Tokens	Truncated Rows	Invalid Rows
RL default 20x3	35/60	1.044	2529	28/60	32/60
ES default 20x3	9/60	0.663	1336	1/60	53/60
ES temp 0, 20x1	5/20	0.818	1214	1/20	15/20
ES temp 0.3, 20x3	16/60	0.810	1301	2/60	45/60

The default temp was 1. So the ES model performed better in solve rate at lower temps (0.3) as opposed to the RL default (1.0). Raising the temperature beyond 0 made the majority of guesses invalid. This was already a trend where the ES model tended to combine tokens for a guess (invalid english) or output guesses which were not 5 letters. It would output these in the valid format.

RL vs ES Qualitative Difference
On the same ES eval rows that we used to evaluate the ES algorithm during training, RL had lower format and partial scores but higher exact solve rate. When I initially ran the RL model, I was surprised to see a lower total reward. When I ran the RL model on the eval set it was trained on, it then got a higher reward. Our ES model did not improve or regress substantially across different seeds. The RL model is more variant.

Model / Harness	Reward	Exact	Partial	Format	Len Bonus	Turns	Tokens
Baseline review	0.513	0.05	0.350	0.480	0.017	5.85	3836
ES final review	0.885	0.20	0.465	0.800	0.060	5.55	1513
Prime RL on same rows	0.824	0.35	0.250	0.653	0.093	5.40	3024
Prime RL seed0 rows	1.045	0.55	0.190	0.690	0.167	4.55	2407

This is the key distinction: ES learned to adhere to format and play within token constraints. RL learned to solve as soon as possible. Both models improved over baseline.

One interesting difference in their failures emerged. RL’s invalid moves were mostly repeats, while ES’s invalid moves were mostly non-English or wrong-length guesses. In the evals, RL had 45 repeat invalids, 10 length, 10 non-English. ES default had 68 non-English and 48 length invalids when run with non-zero temperature. That suggests ES learned the interaction format but not the valid-word manifold as well as RL.

Let’s dig even deeper into the solutions.

Wordle ES Run
Training-side metrics:

metric	value
step 1 train reward	0.636
step 50 train reward	0.802
step 100 train reward	0.759
best train reward	0.820 at step 88
best train win rate	0.253 at step 55
candidate reward std mean / median	0.076 / 0.073
candidate reward std min / max	0.049 / 0.128
total train tokens	199.4M
avg train tok/s	2394
elapsed train wall time	23.56h
step wall time change	1379s step 1 to 575s step 100

Review/eval trajectory on the fixed 20-row review set:

step	reward	wins	partial	format	avg tokens	invalid rows	invalid moves	cap hits
0	0.513	1/20	0.350	0.480	3836	18	64	64
10	0.575	1/20	0.400	0.540	3424	17	55	52
20	0.672	2/20	0.415	0.650	2626	14	34	30
30	0.794	3/20	0.445	0.725	2121	9	19	15
40	0.728	2/20	0.445	0.770	1828	5	12	6
50	0.865	5/20	0.390	0.770	1655	6	8	6
70	0.802	2/20	0.505	0.775	1636	8	11	5
100	0.885	4/20	0.465	0.800	1513	4	7	0

The turn-based improvement is the clearest signal:

turn	baseline valid	final valid	baseline cap hits	baseline info	final info	final wins	baseline tok	final tok
1	1.00	1.00	0	1.20	1.20	0	90	72
2	0.80	0.95	4	1.80	2.05	1	384	238
3	0.40	1.00	12	0.93	2.18	0	751	346
4	0.21	0.95	15	0.34	2.82	2	871	357
5	0.11	0.88	17	0.18	2.82	1	952	325
6	0.16	0.81	16	0.32	2.50	0	927	324

Here, info = green_count + 0.5 * yellow_count. ES learned to keep playing valid later turns. Baseline collapsed after turn 2/3 into cap-hitting or invalid behavior. Often it looped in the later turns. The final model kept valid <guess>[word]</guess> structure through all 6 turns.

As mentioned, the failure of the baseline is often looping in later turns. ES allowed the model to learn the turn based system. However, the failures in the final model’s turns were often non-English words derived from tokens which maximized letter exploration. It was playing the game to maximize the reward through adherence instead of direct word solving. It would solve for the words and was learning this strategy but almost as a side effect.

We can look at the RL model with the same 20 examples.

turn	active rows	valid	format	bracketed	cap hits	info	green	yellow	wins	mean tokens
1	20	1.00	1.00	1.00	0	1.20	0.55	1.30	0	130
2	20	0.95	0.95	0.95	1	1.95	1.20	1.50	0	368
3	20	0.90	0.90	0.90	2	2.30	1.70	1.20	2	574
4	18	0.56	0.61	0.61	7	1.28	0.83	0.89	2	744
5	16	0.44	0.44	0.44	9	1.41	1.06	0.69	2	802
6	14	0.29	0.29	0.29	10	0.86	0.64	0.43	1	915

Although the RL model continues to solve at later steps, it begins to have fewer valid completions. This highlights reported differences between RL and ES. Since RL works in action space, it has a quartic relationship with horizon. It will find more signal where the task horizon is shorter. ES is independent of horizon but has a worst case quadratic relationship with parameters so will exploit signal with less stochasticity in parameter space.

RL learned to maximize reward using a strategy which required a lower action horizon (get to the correct word) while ES maximized reward through something that likely showed up more often in training (following a specific format). Formatting under that assumption is an easier task to find in the neural thicket. The training strategies seem to pick up on different signals.

That difference is cool! It implies that the signal source is different between RL and ES algorithms (at least within some tasks). There is bound to be overlap in the signal but because the two training regimes learned different strategies then we know that there is ample non-overlapping signal. These two training strategies can be used together to increase our reward and vary the training.

Ending Notes

This post came about because I was reading a new paper on quantization and had questions. It led down a rabbit hole. I now don’t believe one zeroth order optimizer method is better than the other. Both have their benefits. In fact, I am looking forward to using a joint training strategy. I am looking forward to training quantized models directly (QES like). I’m looking forward to exploring more niches. There are many left.

I am currently generating an addition to Prime-RL for the ES used in this blog post. Please let me know your thoughts and consider thoroughly roasting the code when it is released. It should be pretty simple and will be out within a couple weeks of this post. Right now I have a sequential method which benchmarks at 97% of its time in inference. The 3% is weight updates and LoRA loading. The engineering is pretty straightforward so I mostly need to verify it will scale across multi-node training.