A playable lesson inspired by gym-sokoban. Manual mode · PPO Agent mode.
Last Reward
0.0
Total Score
0.0
Steps
0
Manual mode. Use WASD or ↑↓←→ to move / push.
R reset · U undo · N new level.
Push all orange boxes onto the yellow targets.
Legend
Wall
Target
Box
Box on Target
Player
Rewards (gym-sokoban)
Per step−0.1
Box onto target+1.0
Box off target−1.0
All boxes solved+10.0
PPO Agent
Statusidle
Episode0
Avg return (last 20)—
Best return—
Policy entropy—
Solve rate (last 20)—
Training return per episode
Policy π(a | current state)
↑ push↓ push← push→ push↑↓←→
PPO with a small MLP policy (two hidden layers) trained live in your browser via clipped-surrogate updates. Training uses curriculum imitation from an A* teacher to bootstrap, then pure self-play refinement — this keeps in-browser training tractable (seconds, not hours).