Sokoban RL

A playable lesson inspired by gym-sokoban. Manual mode · PPO Agent mode.

Last Reward

0.0

Total Score

0.0

Steps

Manual mode. Use WASD or ↑↓←→ to move / push. R reset · U undo · N new level. Push all orange boxes onto the yellow targets.

Legend

Wall

Target

Box

Box on Target

Player

Rewards (gym-sokoban)

Per step−0.1

Box onto target+1.0

Box off target−1.0

All boxes solved+10.0

PPO Agent

Statusidle

Episode0

Avg return (last 20)—

Best return—

Policy entropy—

Solve rate (last 20)—

Training return per episode

Policy π(a | current state)

↑ push↓ push← push→ push ↑↓←→

Speed:

PPO with a small MLP policy (two hidden layers) trained live in your browser via clipped-surrogate updates. Training uses curriculum imitation from an A* teacher to bootstrap, then pure self-play refinement — this keeps in-browser training tractable (seconds, not hours).

Generalist (Transfer)

A single policy trained on N source levels, then evaluated zero-shot on unseen levels. Uses a 7×7 egocentric observation (player-centered window) so the same policy works across any grid size. D₄ symmetry augmentation gives 8× more training data for free.

Statusuntrained

Source levels0

Training samples—

Train solve rate—

Test solve rate—

Difficulty: Sources:

Train Generalist generates N source puzzles, runs A* on each, and does imitation learning with D₄ augmentation. Test Zero-Shot generates fresh unseen puzzles and reports the greedy solve rate. Apply to Current plays the generalist policy on whatever level is loaded on the board.

Sokoban RL

Legend

Rewards (gym-sokoban)

PPO Agent

Generalist (Transfer)

Event Log

✓ Solved!