In this article, we'll take a look at
Show
3. DATA PREPROCESSING
Before feeding data into any machine learning model, it must be cleaned, transformed, and prepared. This step is called data preprocessing, and it is one of the most important stages in building accurate ML models.
3.1) Data Cleaning
Real-world data is often messy. Data cleaning means identifying and fixing errors in the dataset.
3.1.1) Missing Values
- Missing values can be due to incomplete forms, sensor errors, etc.
- Techniques to handle missing data:
- Remove rows/columns with too many missing values
- Fill (impute) missing values using:
- Mean/Median/Mode
- Forward/Backward fill
- Predictive models (like KNN)
3.1.2) Outliers
- Outliers are data points that are very different from the others.
- They can distort results and reduce model performance.
- Detection methods:
- Box plot, Z-score, IQR method
- Handling outliers:
- Remove them
- Transform data (e.g., log scaling)
- Cap them (set a maximum/minimum)
3.2) Data Normalization and Standardization
Helps scale numeric data so that features contribute equally to the model.
- Normalization (Min-Max Scaling): Scales all values between 0 and 1
- Formula:
123(x - min)/(max - min)Used when the data is not normally distributed
- Standardization (Z-score Scaling): Centers the data around mean = 0 and standard deviation = 1
- Formula:
123(x - mean) / stdUsed when data is normally distributed
3.3) Encoding Categorical Variables
ML models work with numbers, not text. Categorical data needs to be converted into numerical form.
Label Encoding: Assigns each unique category a number.
Example:
|
1 2 3 |
Red ->0, Blue -> 1, Green -> 2 |
One-Hot Encoding: Creates new binary columns for each category.
Example:
|
1 2 3 |
Red -> [1,0,0], Blue -> [0,1,0], Green -> [0,0,1] |
Label encoding is good for ordinal data (ranked), while one-hot encoding is best for nominal data (non-ranked).
3.4) Feature Scaling
Ensures features are on the same scale so the model can learn effectively.
Min-Max Scaling:
Min-Max Scaling:
- Scales feature between 0 and 1.
- Good for algorithms like KNN, neural networks.
Z-score Scaling (Standardization):
- Useful for models that assume normality, like linear regression or logistic regression.
Scaling is crucial for models that use distance or gradient-based optimization.
3.5) Feature Engineering
Creating new features or modifying existing ones to improve model performance.
Polynomial Features:
Polynomial Features:
- Create new features by raising existing features to a power.
- Example: From x, create x², x³
Binning (Discretization):
- Convert continuous data into categories.
- Example: Age → [0–18], [19–35], [36–60], 60+
Feature engineering can significantly boost the predictive power of a model.
3.6) Handling Imbalanced Data
In classification, if one class dominates (e.g., 95% non-fraud, 5% fraud), models may ignore the minority class. This is called class imbalance.
SMOTE (Synthetic Minority Oversampling Technique): Creates synthetic examples of the minority class using nearest neighbors.
Undersampling: Remove some samples from the majority class.
Oversampling: Duplicate or generate more samples of the minority class.
Balancing data improves the ability of the model to correctly predict both classes.
Undersampling: Remove some samples from the majority class.
Oversampling: Duplicate or generate more samples of the minority class.
Balancing data improves the ability of the model to correctly predict both classes.

Leave a Comment