Q-Learning is a model-free, value-based reinforcement learning algorithm that learns the quality (Q-value) of actions in each state. The agent explores the environment and updates Q-values using the Bellman equation.
Episode
0
Total Reward
0
Steps
0
Success Rate
0%
Grid World Environment
🤖 Agent ·
🎯 Goal ·
💀 Danger ·
⬛ Wall
Training Progress
// Q-Learning Update Rule
Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
where:
α = learning rate
γ = discount factor
r = reward
s = current state
s' = next state
🎲 Policy Gradient Demo
What is Policy Gradient?
Policy Gradient methods directly optimize the policy function π(a|s) that maps states to action probabilities. Instead of learning value functions, we learn a parameterized policy and update it using gradient ascent.
The reward function is crucial for RL success. Experiment with different reward structures to guide agent behavior effectively. Good reward shaping accelerates learning without changing the optimal policy.
Train multiple agents to cooperate or compete. Explore emergent behaviors, communication protocols, and coordination strategies in complex multi-agent systems.
Episode
0
Team Reward
0
Cooperation Score
0
Communication Rate
0%
Multi-Agent Environment
🔵 Agent 1 ·
🟢 Agent 2 ·
🟡 Agent 3 ·
🔴 Target
Agent Performance
// Multi-Agent RL Architectures
// 1. Independent Q-Learning
for agent in agents:
Q_agent(s, a) ← update using agent's local observation
// 2. Centralized Training, Decentralized Execution (CTDE)
// Training: Use global state
Q(s_global, a₁, a₂, ..., aₙ)
// Execution: Use local observations
π_i(a_i | o_i) for each agent i
// 3. QMIX (Value Decomposition)
Q_total = f(Q₁, Q₂, ..., Qₙ)
where f is a monotonic mixing function