A playable lesson inspired by gym-sokoban. Manual mode · PPO Agent mode.
Last Reward
0.0
Total Score
0.0
Steps
0
Manual mode. Use WASD or ↑↓←→ to move / push.
R reset · U undo · N new level.
Push all orange boxes onto the yellow targets.
Legend
Wall
Target
Box
Box on Target
Player
Rewards (gym-sokoban)
Per step−0.1
Box onto target+1.0
Box off target−1.0
All boxes solved+10.0
PPO Agent
Statusidle
Episode0
Avg return (last 20)—
Best return—
Policy entropy—
Solve rate (last 20)—
Training return per episode
Policy π(a | current state)
↑ push↓ push← push→ push↑↓←→
PPO with a small MLP policy (two hidden layers) trained live in your browser via clipped-surrogate updates. Training uses curriculum imitation from an A* teacher to bootstrap, then pure self-play refinement — this keeps in-browser training tractable (seconds, not hours).
Generalist (Transfer)
A single policy trained on N source levels, then evaluated zero-shot on unseen levels. Uses a 7×7 egocentric observation (player-centered window) so the same policy works across any grid size. D₄ symmetry augmentation gives 8× more training data for free.
Statusuntrained
Source levels0
Training samples—
Train solve rate—
Test solve rate—
Train Generalist generates N source puzzles, runs A* on each, and does imitation learning with D₄ augmentation. Test Zero-Shot generates fresh unseen puzzles and reports the greedy solve rate. Apply to Current plays the generalist policy on whatever level is loaded on the board.