AI/ML CheatSheet : Must-Know Tips & Tricks For AI Engineers

In this article, we'll take a look at Show

3. DATA PREPROCESSING

Before feeding data into any machine learning model, it must be cleaned, transformed, and prepared. This step is called data preprocessing, and it is one of the most important stages in building accurate ML models.

3.1) Data Cleaning

Real-world data is often messy. Data cleaning means identifying and fixing errors in the dataset.

3.1.1) Missing Values

Missing values can be due to incomplete forms, sensor errors, etc.
Techniques to handle missing data:
- Remove rows/columns with too many missing values
- Fill (impute) missing values using:
  - Mean/Median/Mode
  - Forward/Backward fill
  - Predictive models (like KNN)

3.1.2) Outliers

Outliers are data points that are very different from the others.
They can distort results and reduce model performance.
Detection methods:
- Box plot, Z-score, IQR method
Handling outliers:
- Remove them
- Transform data (e.g., log scaling)
- Cap them (set a maximum/minimum)

3.2) Data Normalization and Standardization

Helps scale numeric data so that features contribute equally to the model.

Normalization (Min-Max Scaling): Scales all values between 0 and 1
Formula:

(x - min)/(max - min)

1
2
3

(x - min)/(max - min)

Used when the data is not normally distributed
Standardization (Z-score Scaling): Centers the data around mean = 0 and standard deviation = 1
Formula:

(x - mean) / std

1
2
3

(x - mean) / std

Used when data is normally distributed

3.3) Encoding Categorical Variables

ML models work with numbers, not text. Categorical data needs to be converted into numerical form.
Label Encoding: Assigns each unique category a number.
Example:


Red ->0, Blue -> 1, Green -> 2

Red ->0, Blue -> 1, Green -> 2

One-Hot Encoding: Creates new binary columns for each category.
Example:


Red -> [1,0,0], Blue -> [0,1,0], Green -> [0,0,1]

Red -> [1,0,0], Blue -> [0,1,0], Green -> [0,0,1]

Label encoding is good for ordinal data (ranked), while one-hot encoding is best for nominal data (non-ranked).

3.4) Feature Scaling

Ensures features are on the same scale so the model can learn effectively.
Min-Max Scaling:

Scales feature between 0 and 1.
Good for algorithms like KNN, neural networks.

Z-score Scaling (Standardization):

Useful for models that assume normality, like linear regression or logistic regression.

Scaling is crucial for models that use distance or gradient-based optimization.

3.5) Feature Engineering

Creating new features or modifying existing ones to improve model performance.
Polynomial Features:

Create new features by raising existing features to a power.
Example: From x, create x², x³

Binning (Discretization):

Convert continuous data into categories.
Example: Age → [0–18], [19–35], [36–60], 60+

Feature engineering can significantly boost the predictive power of a model.

3.6) Handling Imbalanced Data

In classification, if one class dominates (e.g., 95% non-fraud, 5% fraud), models may ignore the minority class. This is called class imbalance.

SMOTE (Synthetic Minority Oversampling Technique): Creates synthetic examples of the minority class using nearest neighbors.
Undersampling: Remove some samples from the majority class.
Oversampling: Duplicate or generate more samples of the minority class.
Balancing data improves the ability of the model to correctly predict both classes.