In this article, we'll take a look at
Show
4. SUPERVISED LEARNING ALGORITHMS
Supervised learning uses labeled data, meaning the model learns from input-output pairs (X → y). The algorithm tries to map inputs (features) to correct outputs (targets/labels).
4.1) Linear Regression
Used for predicting continuous values (e.g., predicting house prices, temperature).
4.1.1) Simple vs. Multiple Linear Regression
Simple Linear Regression: One input (X) to predict one output (Y). Example: Predicting salary from years of experience.
Multiple Linear Regression: Multiple inputs (X1, X2, …, Xn). Example: Predicting price based on area, location, and age.
Multiple Linear Regression: Multiple inputs (X1, X2, …, Xn). Example: Predicting price based on area, location, and age.
4.1.2) Gradient Descent and Normal Equation
Gradient Descent: An iterative method to minimize error (cost function).
Normal Equation: A Direct way to find weights using linear algebra: 0=(XTX)-1 XTY. Works for small datasets.
Normal Equation: A Direct way to find weights using linear algebra: 0=(XTX)-1 XTY. Works for small datasets.
4.1.3) Regularization (L1, L2)
Prevents overfitting by adding a penalty:
- L1 (Lasso): Can reduce coefficients to 0 (feature selection).
- L2 (Ridge): Shrinks coefficients but doesn’t make them 0.
4.2) Logistic Regression
Used for classification problems (e.g., spam vs. not spam).
4.2.1) Binary vs. Multiclass Classification
- Binary: 2 outcomes (e.g., 0 or 1)
- Multiclass: More than 2 classes (handled using One-vs-Rest or Softmax)
4.2.2) Sigmoid and Cost Function
Sigmoid Function: Converts outputs to values between 0 and 1. Formula: sigmoid(z) = 1 / (1 + e-z)
Cost Function: Log loss used to measure prediction error.
4.2.3) Regularization
L1 and L2 regularization help prevent overfitting in logistic regression as well.
4.3) K-Nearest Neighbors (KNN)
A simple classification (or regression) algorithm that uses proximity.
4.3.1) Distance Metrics
- Euclidean Distance: Straight line between two points.
- Manhattan Distance: Sum of absolute differences.
4.3.2) Choosing K
- K is the number of neighbors to consider.
- Too low K → sensitive to noise
- Too high K → model becomes less flexible
4.3.3) Advantages & Disadvantages
- Simple and easy to implement
- Slow for large datasets, sensitive to irrelevant features
4.4) Support Vector Machines (SVM)
Powerful classification model for small to medium-sized datasets.
4.4.1) Hyperplanes and Margins
SVM finds the best hyperplane that separates data with maximum margin.
4.4.2) Linear vs. Non-Linear SVM
- Linear SVM: Works when data is linearly separable.
- Non-linear SVM: Uses kernel trick for complex datasets.
4.4.3) Kernel Trick
- Transforms data into higher dimensions to make it separable.
- Common kernels: RBF (Gaussian), Polynomial, Sigmoid
4.5) Decision Trees
Tree-like structure used for classification and regression.
4.5.1) Gini Impurity and Entropy
- Measures how pure a node is:
- Gini Impurity: Probability of misclassification.
- Entropy: Measure of randomness/information.
4.5.2) Overfitting and Pruning
- Overfitting: The Tree memorizes the training data.
- Pruning: Removes unnecessary branches to reduce overfitting.
4.6) Random Forest
An ensemble of decision trees to improve accuracy and reduce overfitting.
4.6.1) Bootstrapping
Randomly selects subsets of data to train each tree.
4.6.2) Bagging
Combines predictions of multiple trees (majority vote or average).
4.6.3) Feature Importance
Measures which feature contribute most to model prediction.
4.7) Gradient Boosting Machines (GBM)
Boosting is an ensemble method where models are trained sequentially.
4.7.1) XGBoost, LightGBM, CatBoost
Advanced boosting libraries:
- XGBoost: Popular, fast, and accurate
- LightGBM: Faster, uses leaf-wise growth
- CatBoost: Handles categorical features automatically
4.7.2) Hyperparameter Tuning
- Adjust parameters like:
- Learning rate
- Number of estimators (trees)
- Max depth
- Tools: GridSearchCV, RandomSearchCV
4.7.3) Early Stopping
Stops training if the model stops improving on the validation set.
4.8) Naive Bayes
Probabilistic classifier based on Bayes’ Theorem and strong independence assumption.
4.8.1) Gaussian, Multinomial, Bernoulli
- Gaussian NB: For continuous features (assumes normal distribution)
- Multinomial NB: For text data, counts of words
- Bernoulli NB: For binary features (0/1)
4.8.2) Assumptions and Applications
- Assumes all features are independent (rarely true, but still works well)
- Commonly used in spam detection, sentiment analysis, and document classification

Leave a Comment