8. NATURAL LANGUAGE PROCESSING (NLP)

NLP helps computers understand, interpret, and generate human language. It’s widely used in applications like chatbots, translation tools, and voice assistants.

8.1) Text Preprocessing

Before using text in machine learning models, we need to clean and convert it into a format the computer understands.
8.1.1) Tokenization
Breaking text into smaller parts, like words or sentences. Example: “I love AI” → [“I”, “love”, “AI”]
8.1.2) Stopwords
Removing common words that do not add much meaning (like “is”, “the”, “and”).
8.1.3) Stemming
Cutting words down to their root form. Example: “playing”, “played” → “play”
8.1.4) Lemmatization
Similar to stemming, but uses grammar to find the proper base word. Example: “better” → “good”
8.1.5) Bag of Words (BoW)
Converts text into numbers based on word counts in a document.
8.1.6) TF-IDF
Gives importance to words that appear often in one document but not in others. Helps identify keywords.
 
 

8.2) Word Embeddings

Word embeddings turn words into vectors (numbers) so that a machine can understand their meaning and context.
8.2.1) Word2Vec
A model that learns how words are related based on their surrounding words.
8.2.2) GloVe
Learns word meanings by looking at how often words appear together.
8.2.3) FastText
Similar to Word2Vec, but also looks at parts of words, which helps with unknown words.
8.2.4) Sentence Embeddings (BERT, RoBERTa, GPT)
These models convert full sentences into vectors. They understand context much better than older models.
 
 

8.3) Sequence Models

These models are good for processing data where order matters, like text.
8.3.1) RNN (Recurrent Neural Networks)
Good for learning from sequences, such as sentences.
8.3.2) LSTM (Long Short-Term Memory)
An advanced RNN that remembers long-term information.
8.3.3) GRU (Gated Recurrent Unit)
A simpler version of LSTM that works faster and often just as well.
 
 

8.4) Transformer Architecture

Transformers are a powerful model used in almost all modern NLP systems.
8.4.1) Self-Attention Mechanism
This allows the model to focus on important words in a sentence, no matter where they appear.
8.4.2) Encoder-Decoder Model
Used in tasks like translation, where the model reads input (encoder) and generates output (decoder).
8.4.3) Examples:
  • BERT: Great for understanding text.
  • GPT: Great for generating text.
  • T5: Can both understand and generate text for many tasks.

8.5) Text Classification

Classify text. Examples:
  • Sentiment Analysis: Is a review positive or negative?
  • Named Entity Recognition (NER): Find names, places, dates, etc. in text.

 

8.6) Language Generation

Generate new text from existing input.
8.6.1) Text Summarization
Shortens a long document while keeping important points.
8.6.2) Machine Translation
Translates text from one language to another (like English to Hindi).
 

Leave a Comment