HW3 — DQN 及其變體

以 Deep Reinforcement Learning in Action 第三章 Gridworld 為底，三階段實作並比較 DQN 的演進。

GitHub 原始碼 · PyTorch 2.5 + Lightning 2.6 · Windows / CPU 全程可重現

HW3-1 · Naive DQN + Experience Replay static 模式

改進	重點
Double DQN	online 網路挑動作、target 網路估值 ⇒ 緩解 max-operator 帶來的過度估計偏差。
Dueling DQN	拆成 V(s) + A(s, a) 兩個分支 ⇒ 即使各動作 Q 值差很小，也能正確學到「這個狀態本身多好」。

Experience Replay 對 static 的 4×4 Gridworld 影響不大（環境本身太簡單），但會在 player / random mode 顯著加速收斂。
player mode 下 Vanilla / Double / Dueling DQN 收斂後差異不大（上限 = 100%），差別在中段穩定性；換到 random mode 才能明顯看出 Dueling 與 Double 的優勢。
Lightning 本身不會讓 agent 學得更好，但封裝後的 trainer-level 設定（梯度裁剪、LR schedule、soft update）一次掛上去就明顯穩定 random mode 的訓練。