Sokoban RL

A playable lesson inspired by gym-sokoban. Manual mode · PPO Agent mode.
Last Reward
0.0
Total Score
0.0
Steps
0
Manual mode. Use WASD or ↑↓←→ to move / push. R reset · U undo · N new level. Push all orange boxes onto the yellow targets.

Legend

Wall
Target
Box
Box on Target
Player

Rewards (gym-sokoban)

Per step−0.1
Box onto target+1.0
Box off target−1.0
All boxes solved+10.0

PPO Agent

Statusidle
Episode0
Avg return (last 20)
Best return
Policy entropy
Solve rate (last 20)
Training return per episode
Policy π(a | current state)
↑ push↓ push← push→ push
PPO with a small MLP policy (two hidden layers) trained live in your browser via clipped-surrogate updates. Training uses curriculum imitation from an A* teacher to bootstrap, then pure self-play refinement — this keeps in-browser training tractable (seconds, not hours).

Generalist (Transfer)

A single policy trained on N source levels, then evaluated zero-shot on unseen levels. Uses a 7×7 egocentric observation (player-centered window) so the same policy works across any grid size. D₄ symmetry augmentation gives 8× more training data for free.
Statusuntrained
Source levels0
Training samples
Train solve rate
Test solve rate
Train Generalist generates N source puzzles, runs A* on each, and does imitation learning with D₄ augmentation. Test Zero-Shot generates fresh unseen puzzles and reports the greedy solve rate. Apply to Current plays the generalist policy on whatever level is loaded on the board.

Event Log

✓ Solved!

All boxes are on their targets.

0.0