A connect 4 autonomous agent. This project features a Dueling Double Deep Q-Network (D3QN) architecture optimized through Self-Play and a Minimax baseline for performance benchmarking.
- Deep RL Architecture: Implements a Dueling Double DQN with Convolutional layers (CNN) to capture spatial patterns in the 6x7 grid.
- Self-Play Training: The agent learns by competing against historical versions of itself, ensuring continuous policy improvement and robustness.
- Vectorized Environment: Custom Gymnasium environment utilizing 2D convolutions for ultra-fast win detection and reward shaping.
- Symmetry-Based Data Augmentation: Experience Replay buffer automatically mirrors transitions (horizontal flip) to double training data and improve generalization.
- Minimax Baseline: Includes a Minimax agent with Alpha-Beta pruning for performance validation and benchmarking.
- Interactive Mode: Play against the trained AI directly in the terminal with a colored UI.
The agent uses a Dueling DQN structure to decouple state-value estimation from action-advantage. This is particularly effective in Connect 4, where many actions in a given state might not significantly change the outcome.
The Q-value is computed using the following aggregation logic:
- Input: 6x7 game board (1x6x7 tensor).
- Backbone: 3 Convolutional layers (64 to 128 filters) with Batch Normalization and LeakyReLU activation.
-
Heads: Two separate fully connected streams for State Value (
$V$ ) and Action Advantage ($A$ ).
To avoid overfitting against a specific strategy, the training script implements Self-Play:
- Opponent Selection: In 20% of episodes, the agent faces a randomly selected "past version" of itself stored in the history directory.
- Symmetry Invariance: The Replay Memory stores both the original move and its horizontal mirror, teaching the agent that the game is axially symmetrical.
To accelerate convergence, the environment provides a composite reward signal:
- Terminal States: Win (+1.0), Loss (-1.0), Draw (0.0).
- Intermediate Heuristics: Small rewards for controlling the center column and creating "3-in-a-row" threats, with penalties for failing to block opponent threats.
Install dependencies:
bash pip install -r requirements.txt
Start the training process using a hyperparameter set defined in config/hyperparameters.yml:
python train.py 4MasterDuelingD3QNVisualize training metrics (win-rate, loss, rewards) in your browser at http://localhost:6006 :
tensorboard --logdir=runs/logsEvaluate the performance of a trained model against a classical Minimax baseline :
python play_minimax.py runs/4MasterDuelingD3QN.pt -d 2 -n 20 Watch a few games between the AI and Minimax to analyze the agent's strategy' :
python play_minimax.py runs/4MasterDuelingD3QN.pt -d 2 -n 2 --renderPlay against your trained model directly in the terminal (you start first) :
python play_human.py runs/4MasterDuelingD3QN.ptPlay against the model where the AI takes the first move :
python play_human.py runs/4MasterDuelingD3QN.pt --first modelEvaluate the performance of a trained model against another model:
python compare_model.py runs/model_v1.pt runs/model_v2.pt -n 100Watch a few games between two models :
python compare_model.py runs/model_v1.pt runs/model_v2.pt -n 2 --render