What we are still checking
An agent stalls on one puzzle. It loops on a single action, misreads the pixel grid, and never reaches the discrete code it must set. We replay that stuck memory and ablate the pipeline to find out why — and what is left to fix.
Ground truth: pattern 5 is already right. color needs the cycler at (29,45) +1 mod4 stepped to 1; rotation needs the cycler at (49,10) +1 mod4 stepped to 3. Over-cycling wraps past the target. The agent never represents these three numbers at all.
It reads the wrong picture
Level 2 is a 64×64 frame. To the right is a fuel bar that depletes as the agent moves, and a code-setter the agent calls a stencil. The agent treats the depleting fuel as a code grid, and the whole puzzle as movement.
The trace runs 184 turns and never sets a discrete register. It loops on ACTION4, narrating the pixels instead of cycling the color/rotation activators.
What we are doing about it
We replay the stuck memory and run short-turn ablations across the four components — World Model Miner Judge Use — varying options to find why it fails. Not full episodes.
The abductive hypothesis is plausible but wrong: "a single hidden constraint — the yellow checkpoint/refill tiles act like a reset trap that reverts the key and code state." With decode on, the committed skill finally names the real move: "count covers from the current state and stop on the first state that opens the lock; if the activator cycles, prefer the minimum additional covers." Still framed around the stencil, never the three-number register.
Three lenses mine in parallel. explore 0.79 · exploit 0.88 wins · divergent 0.74. Under the world-model arm, explore collapses to 0.21 and exploit (0.86) carries. Every lens converges on the same stencil model — diversity does not break the misread.
combined = advantage(ΔIG) × grounding = 0.88 ≥ 0.7 → COMMIT. But discovery-lift ΔP = 0.00 for every arm: the no-skill baseline finds the cycler at the same rate. The judge commits a wrong stencil skill at 0.88 with zero usefulness. (Eval axis: novel-state cross-judge 0.80 vs 0.10.)
The field is only legible when handed the decoded state: correct field 0/12 from text, 0/12 from raw grid, 7/12 from decoded state (Fisher p=0.0046). Skill direction matters: toward the principle helps (ΔP +0.36, survives target reseed); world-model-as-text is harmful (ΔP −0.17).
Options we are sweeping
Each component × each option, with status. The sweep is cheap: short-turn replays over the recorded prefix, not new episodes.
| Component | Option | What it tests | Status |
|---|---|---|---|
| World Model | none / hypothesis / worldmodel | Does a theorist arm escape the stencil model, or just elaborate it? | done |
| World Model | decode ON / OFF | Does feeding the decoded register change the committed skill? | running |
| Miner | explore / exploit / divergent | Can lens diversity break convergence on one wrong model? | done |
| Miner | register-targeted lens | A lens that mines for the discrete code, not the pixels. | planned |
| Judge | grounding × advantage (0.7 bar) | Current bar: commits plausible skills at 0.88 with ΔP 0. | done |
| Judge | mechanism-correctness gate | Reward discovery-lift (ΔP), reject plausibility-only commits. | planned |
| Use | text / raw grid / decoded state | Which observation makes the code field legible? Decoded only. | done |
| Use | skill direction · toward / as-text | Toward-principle +0.36 vs world-model-as-text −0.17. | done |
| Use | decode primitive in the act loop | Expose (pattern,color,rotation) to the actor each turn. | running |
Roadmap
-
Fix the judge Judge
Reward mechanism-correctness, not plausibility. The 0.88 / ΔP 0 commit is the canonical failure: a wrong stencil skill passes the bar while adding zero discovery-lift. Gate on ΔP.
-
Full-pipeline escape
With the decode primitive exposed and the judge gated, run the four components end-to-end from the stuck prefix and check whether the register reaches (5,1,3) and the level clears.
-
Carry findings forward
Feed what each ablation learned back into memory so the learning loop keeps the toward-principle skill and drops the world-model-as-text one.
-
Decode primitive to production
Promote the decoded-state observation from the ablation harness into the live act loop, so the agent never has to read the register off pixels again.