🎯 Q-Learning Demo

What is Q-Learning?

Q-Learning is a model-free, value-based reinforcement learning algorithm that learns the quality (Q-value) of actions in each state. The agent explores the environment and updates Q-values using the Bellman equation.

Episode
0
Total Reward
0
Steps
0
Success Rate
0%

Grid World Environment

🤖 Agent · 🎯 Goal · 💀 Danger · ⬛ Wall

Training Progress

// Q-Learning Update Rule
Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

where:
  α = learning rate
  γ = discount factor
  r = reward
  s = current state
  s' = next state

🎲 Policy Gradient Demo

What is Policy Gradient?

Policy Gradient methods directly optimize the policy function π(a|s) that maps states to action probabilities. Instead of learning value functions, we learn a parameterized policy and update it using gradient ascent.

Episode
0
Average Return
0
Policy Loss
0
Entropy
0

CartPole Environment

Balance the pole as long as possible!

Policy Performance

// REINFORCE Algorithm
∇J(θ) = E[∑ᵗ ∇log π(aₜ|sₜ,θ) Gₜ]
θ ← θ + α∇J(θ)

// PPO Objective
L(θ) = E[min(rₜ(θ)Âₜ, clip(rₜ(θ), 1-ε, 1+ε)Âₜ)]

where:
  θ = policy parameters
  Gₜ = return from time t
  Âₜ = advantage estimate

🌍 Environment Sandbox

Custom Environment Design

Create your own reinforcement learning environments! Define states, actions, transitions, and rewards. Compatible with OpenAI Gym interface.

Environment Configuration

Actions

Click grid cells to:
• Set start position
• Place obstacles
• Add danger zones
• Set goal location

Environment Editor

// Custom Gym Environment
class CustomGridWorld(gym.Env):
    def __init__(self):
        self.grid_size = 8
        self.action_space = spaces.Discrete(4)
        self.observation_space = spaces.Box(
            low=0, high=self.grid_size-1,
            shape=(2,), dtype=np.int32
        )

    def step(self, action):
        # Execute action and return (obs, reward, done, info)
        pass

    def reset(self):
        # Reset environment to initial state
        pass

⭐ Reward Shaping

Reward Function Design

The reward function is crucial for RL success. Experiment with different reward structures to guide agent behavior effectively. Good reward shaping accelerates learning without changing the optimal policy.

Reward Components

Shaping Strategy

Learning Speed
-
Sample Efficiency
-
Convergence
-

Reward Comparison

// Reward Shaping Examples

// 1. Sparse Rewards
reward = 100 if goal_reached else 0

// 2. Dense Rewards with Distance
reward = -distance_to_goal - step_penalty

// 3. Potential-Based Shaping
Φ(s) = -distance_to_goal(s)
shaped_reward = reward + γΦ(s') - Φ(s)

// 4. Curiosity-Driven
intrinsic_reward = prediction_error(s, a, s')
total_reward = extrinsic_reward + β * intrinsic_reward

🤝 Multi-Agent RL

Multi-Agent Reinforcement Learning

Train multiple agents to cooperate or compete. Explore emergent behaviors, communication protocols, and coordination strategies in complex multi-agent systems.

Episode
0
Team Reward
0
Cooperation Score
0
Communication Rate
0%

Multi-Agent Environment

🔵 Agent 1 · 🟢 Agent 2 · 🟡 Agent 3 · 🔴 Target

Agent Performance

// Multi-Agent RL Architectures

// 1. Independent Q-Learning
for agent in agents:
    Q_agent(s, a) ← update using agent's local observation

// 2. Centralized Training, Decentralized Execution (CTDE)
// Training: Use global state
Q(s_global, a₁, a₂, ..., aₙ)

// Execution: Use local observations
π_i(a_i | o_i) for each agent i

// 3. QMIX (Value Decomposition)
Q_total = f(Q₁, Q₂, ..., Qₙ)
where f is a monotonic mixing function