AI/ML CheatSheet : Must-Know Tips & Tricks For AI Engineers

In this article, we'll take a look at Show

6. REINFORCEMENT LEARNING (RL)

6.1) Introduction to Reinforcement Learning

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. It is inspired by how humans learn from experience — by trial and error.

The agent performs an action
The environment responds with a reward
The agent uses this feedback to learn better actions over time

Unlike supervised learning, RL doesn’t rely on labeled data. Instead, it uses rewards or penalties to guide learning.

6.2) Key Concepts

Agent: The learner or decision maker (e.g., a robot, self-driving car).
Environment: Everything the agent interacts with (e.g., a maze, a game).
State: A snapshot of the current situation (e.g., position in a maze).
Action: A move or decision made by the agent (e.g., turn left).
Reward: Feedback from the environment (e.g., +1 for reaching goal, -1 for hitting a wall).

The goal of the agent is to maximize cumulative rewards over time.

6.3) Markov Decision Process (MDP)

RL problems are often modeled as Markov Decision Processes (MDPs). An MDP includes: S: Set of states, A: Set of actions, P: Transition. probabilities (P(s’ | s, a)) R: Reward function γ (gamma): Discount factor (how much future rewards are valued). The “Markov” property means the next state only depends on the current state and action, not on previous ones.

6.4) Q-Learning and Deep Q-Networks (DQN)

Q-Learning: A model-free algorithm that learns the value (Q-value) of taking an action in a state. Uses the formula: Q(s, a) <- Q(s, a) + α [r + γ max Q(s', a') - Q(s, a)]

Where:

s: current state
a: action
r: reward
s’: next state
α: learning rate
γ: discount factor

Deep Q-Networks (DQN): When the state/action space is too large, a neural network is used to approximate Q-values. Combines Q-Learning with Deep Learning. Used in Atari games and robotics.

6.5) Policy Gradients and Actor-Critic Methods

Policy Gradients:

Instead of learning value functions, learn the policy directly (probability distribution over actions).
Use gradient ascent to improve the policy based on the reward received.

Actor-Critic Methods:

Combine the best of both worlds:
- Actor: chooses the action
- Critic: evaluates how good the action was (value function)
More stable and efficient than pure policy gradients.

6.6 Exploration vs. Exploitation

Exploration: Trying new actions to discover their effects (important early in training).
Exploitation: Choosing the best-known action for maximum reward.