Q-Learning Algorithm Demo

Interactive implementation of the Q-Learning reinforcement learning algorithm with visualization

Try Demo Learn More
Q-Learning Visualization
Q-Learning Interactive Demo
Environment
Speed
0%
Episode: 0 Steps: 0
Average Reward: 0 Success Rate: 0%
Environment Visualization
Q-Table Visualization (Current State)
How Q-Learning Works
Key Features of Q-Learning
  • Model-Free: Doesn't need a model of the environment
  • Q-Table: Maintains a table of state-action values
  • Off-Policy: Learns optimal policy independently
  • Exploration vs Exploitation: Uses ε-greedy strategy
  • Convergence: Guaranteed to find optimal policy
  • Discrete: Works best with discrete spaces
The Q-Learning Algorithm
  1. Initialize Q-table with zeros (or small random values)
  2. For each episode:
    1. Initialize the starting state (s)
    2. Repeat until terminal state:
      1. Choose action (a) using ε-greedy strategy
      2. Take action (a), observe reward (r), and next state (s')
      3. Update Q(s, a) using the update rule
      4. Set state s ← s'
Q-Value Update Rule

The core update formula is:

Q(s,a) ← Q(s,a) + α[r + γ⋅maxa'Q(s',a') − Q(s,a)]

Where:

  • Q(s,a): Current Q-value
  • α: Learning rate (how much new info overrides old info)
  • r: Immediate reward
  • γ: Discount factor (importance of future rewards)
  • maxa'Q(s',a'): Best future Q-value in the next state
Frequently Asked Questions

Q-Learning is a reinforcement learning algorithm used for solving Markov decision processes. It's commonly used in:
  • Game AI (e.g., simple board games)
  • Robot movement and navigation
  • Inventory optimization
  • Elevator dispatch systems
  • Any problem that can be modeled as a discrete state-action space

While powerful, Q-Learning has some limitations:
  • Not ideal for large/continuous state spaces (Q-table becomes huge)
  • Struggles with partial observability
  • Needs a lot of episodes to learn well
  • Classic Q-learning works best with small, discrete action and state spaces
  • For large/continuous spaces, Deep Q-Learning (DQN) is typically used

Q-Learning has several distinctive features:
  • Model-Free: Doesn't need a model of the environment (no need to know transition probabilities)
  • Off-Policy: Learns the optimal policy independent of the agent's actions
  • Tabular Method: Uses a Q-table to store state-action values (unlike policy gradient methods)
  • Temporal Difference: Updates estimates based on other learned estimates (bootstrapping)
Compared to SARSA (another popular algorithm), Q-Learning is more aggressive in always updating towards the best possible action, while SARSA follows the current policy.

Choosing hyperparameters is crucial for Q-Learning performance:
  • Learning Rate (α): Typically between 0.1 and 0.5. Higher values learn faster but may overshoot.
  • Discount Factor (γ): Between 0.9 and 0.99 for long-term planning, lower for short-term focus.
  • Exploration Rate (ε): Start high (0.9-1.0) and decay over time (to 0.01-0.1).
  • Decay Rates: Linear or exponential decay for ε and α often works well.
The best approach is to experiment with different values and observe the learning curves. Our demo lets you adjust these parameters to see their effects.
Current Parameters
  • Learning Rate (α) 0.1
  • Discount Factor (γ) 0.9
  • Exploration Rate (ε) 0.1
  • Exploration Decay 0.995
  • Episodes 1000
Learning Statistics
Last Episode Reward: 0
Max Q-Value: 0
Exploration Rate: 0.1