In this article, we'll take a look at
Show
6. REINFORCEMENT LEARNING (RL)
6.1) Introduction to Reinforcement Learning
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. It is inspired by how humans learn from experience — by trial and error.
- The agent performs an action
- The environment responds with a reward
- The agent uses this feedback to learn better actions over time
Unlike supervised learning, RL doesn’t rely on labeled data. Instead, it uses rewards or penalties to guide learning.
6.2) Key Concepts
- Agent: The learner or decision maker (e.g., a robot, self-driving car).
- Environment: Everything the agent interacts with (e.g., a maze, a game).
- State: A snapshot of the current situation (e.g., position in a maze).
- Action: A move or decision made by the agent (e.g., turn left).
- Reward: Feedback from the environment (e.g., +1 for reaching goal, -1 for hitting a wall).
The goal of the agent is to maximize cumulative rewards over time.
6.3) Markov Decision Process (MDP)
RL problems are often modeled as Markov Decision Processes (MDPs). An MDP includes: S: Set of states, A: Set of actions, P: Transition. probabilities (P(s’ | s, a)) R: Reward function γ (gamma): Discount factor (how much future rewards are valued). The “Markov” property means the next state only depends on the current state and action, not on previous ones.
6.4) Q-Learning and Deep Q-Networks (DQN)
Q-Learning: A model-free algorithm that learns the value (Q-value) of taking an action in a state. Uses the formula:
Q(s, a) <- Q(s, a) + α [r + γ max Q(s', a') - Q(s, a)]Where:
- s: current state
- a: action
- r: reward
- s’: next state
- α: learning rate
- γ: discount factor
Deep Q-Networks (DQN): When the state/action space is too large, a neural network is used to approximate Q-values. Combines Q-Learning with Deep Learning. Used in Atari games and robotics.
6.5) Policy Gradients and Actor-Critic Methods
Policy Gradients:
- Instead of learning value functions, learn the policy directly (probability distribution over actions).
- Use gradient ascent to improve the policy based on the reward received.
Actor-Critic Methods:
- Combine the best of both worlds:
- Actor: chooses the action
- Critic: evaluates how good the action was (value function)
- More stable and efficient than pure policy gradients.
6.6 Exploration vs. Exploitation
Exploration: Trying new actions to discover their effects (important early in training).
Exploitation: Choosing the best-known action for maximum reward.
Exploitation: Choosing the best-known action for maximum reward.
RL must balance both:
- Too much exploration = slow learning
- Too much exploitation = stuck in local optimum
Common strategy: ε-greedy
- Choose a random action with probability ε
- Otherwise, choose the best-known action

Leave a Comment