NLP Simplified Part 2 - Types of Vectorization Techniques

Hello friend! As an NLP enthusiast, I‘m excited to dive deeper into vectorization – a critical technique for turning text into numerical representations that machine learning models can understand.

In Part 1 of this NLP series, we explored fundamental concepts like tokenization, normalization, and text cleaning to prepare data for downstream tasks.

Now we‘ll cover various vectorization methods in-depth:

Bag of Words (BoW)
Term Frequency-Inverse Document Frequency (TF-IDF)
Word2Vec
GloVe (Global Vectors)
FastText

Each approach has distinct strengths and weaknesses depending on the use case. By the end, you‘ll have a comprehensive understanding of these techniques and when to apply them.

Let‘s start by recapping some key concepts.

Tokenization vs Vectorization

Before jumping into vectorization methods, let‘s clarify how tokenization differs.

What is Tokenization?

Tokenization splits sentences into smaller units called tokens, helping computers process and understand text better.

For example, tokenizing the sentence "This article is good" would give us:

["This", "article", "is", "good"]

These tokens represent the semantic meaning in a format machine learning algorithms can ingest.

What is Vectorization?

While tokenization separates text into tokens, vectorization encodes the tokens into numerical feature vectors.

Machine learning models require numerical data. Vectorization transforms text into vector representations capturing semantic meaning, allowing you to train models more accurately on text.

Why Do We Need Vectorization?

Here are some key reasons vectorization is crucial for NLP:

Tokenization → breakdown text. Vectorization → encodes into numbers.
Captures semantic meaning – similar vectors imply similar meaning.
Reduces dimensionality and sparsity, improving efficiency.
Algorithms like neural networks need numerical inputs.

Now let‘s explore popular vectorization techniques, starting with Bag of Words.

Bag of Words (BoW)

The Bag of Words model is a simple way to vectorize text for machine learning.

If you have a corpus of documents, BoW treats each doc like a "bag" filled with words. No word order or structure is considered.

Some key applications of Bag of Words include:

Text classification
Sentiment analysis
Document retrieval

For example, if you have a large collection of text data, a Bag of Words model will help represent the documents by creating a vocabulary of unique words across the corpus.

It encodes each document as a vector based on the frequency (count) of vocabulary words within that document.

The document vectors consist of non-negative integers (0, 1, 2, etc) representing word frequencies.

Let‘s walk through the 3 steps to create a Bag of Words model:

Step 1: Tokenization

Break documents into tokens.

text = "I love Pizza and Burgers"
tokens = ["I", "love", "Pizza", "and", "Burgers"]

Step 2: Build Vocabulary

Compile a list of all unique words (vocabulary).

vocab = ["I", "love", "Pizza", "and", "Burgers"]

Step 3: Vector Creation

Count each vocabulary term‘s frequency in each document and store in a vector.

This sparse matrix has rows as documents, columns as vocabulary size.

# Sample Bag of Words in Python

from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "This is the first document.",
    "This is the second document.",
    "And this is the third one.",
    "Is this the first document?", 
]

# Create vectorizer
vectorizer = CountVectorizer()

# Vectorize  
X = vectorizer.fit_transform(sentences)

# Vocabulary
print(vectorizer.get_feature_names_out()) 

# Vectorized data
print(X.toarray())

This prints the vocabulary and vectorized sentences:

Vocabulary: 
[‘and‘, ‘document‘, ‘first‘, ‘is‘, ‘one‘, ‘second‘, ‘the‘, ‘third‘, ‘this‘]

Vectorized Sentences:
[[0 1 1 1 0 0 2 0 1]
[0 2 1 1 0 1 1 0 1]  
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]

Bag of Words is simple to implement but has some limitations:

All words treated equally, regardless of importance. Common words like "a", "the", etc domininate.
Only word frequency considered – order/context lost.
Long documents can have inflated counts, making comparison difficult.

To overcome these issues, TF-IDF improves upon Bag of Words:

TF-IDF

TF-IDF (Term Frequency – Inverse Document Frequency) is a numerical representation that gives more weight to important words while reducing weights of common terms across documents.

It enhances Bag of Words by incorporating two metrics:

TF: Term Frequency – Frequency of word in a document

IDF: Inverse Document Frequency – Rarity of word across documents

The TF-IDF score is calculated as:

TF-IDF = TF * IDF

Where:

TF = Number of times term t appears in document / Total words in document

IDF = log(Total documents / Documents containing term t)

Words with high TF-IDF are frequent in that document but rare overall, highlighting distinctive words for that document.

Let‘s walk through an example:

Doc 1: "I love machine learning"

Doc 2: "I love NLP"

Step 1: Create Vocabulary

Unique words: ["I", "love", "machine", "learning", "NLP"]

Step 2: Calculate TF

"I" TF: Doc 1 – 1/4 = 0.25, Doc 2 – 1/3 ≈ 0.33

"love" TF: Doc 1 – 1/4 = 0.25, Doc 2 – 1/3 ≈ 0.33

"machine" TF: Doc 1 – 1/4 = 0.25, Doc 2 – 0

"learning" TF: Doc 1 – 1/4 = 0.25, Doc 2 – 0

"NLP" TF: Doc 1 – 0, Doc 2 – 1/3 ≈ 0.33

Step 3: Calculate IDF

Total Documents (N) = 2

"I" IDF = log(N/2) = 0

"love" IDF = log(N/2) = 0

"machine" IDF = log(N/1) = log(2) ≈ 0.301

"learning" IDF = log(N/1) = log(2) ≈ 0.301

"NLP" IDF = log(N/1) = log(2) ≈ 0.301

Step 4: Calculate TF-IDF

"I" TF-IDF:

Doc 1 – 0.25 * 0 = 0
Doc 2 – 0.33 * 0 = 0

"love" TF-IDF:

Doc 1 – 0.25 * 0 = 0
Doc 2 – 0.33 * 0 = 0

And so on…

The final TF-IDF matrix would be:

           I     love     machine        learning      NLP
Doc 1     0       0       0.075          0.075        0     
Doc 2     0       0        0              0          0.099

The TF-IDF method better highlights distinctive words for a document.

Some key applications include:

Text classification
Document clustering
Building chatbots
Information retrieval

While TF-IDF improves upon Bag of Words, it still doesn‘t capture semantic meaning. More advanced techniques like Word2Vec overcome this limitation.

Word2Vec

Word2Vec is a popular NLP technique that represents words as continuous vectors capturing semantic meaning. It was created by Tomas Mikolov at Google in 2013.

For example, "cat" and "dog" vectors would be closer than "cat" and "table" vectors, allowing mathematical similarity measurements.

word2vec example

Word2Vec uses a shallow neural network with two architectures to learn word embeddings:

CBOW (Continuous Bag of Words): Predicts target word from surrounding context words.

Skip-gram: Predicts context words given target word. Better for rare words.

The embeddings created encode semantic relationships between words, useful for tasks like:

Document classification
Sentiment analysis
Named entity recognition
Question answering
Text summarization

Let‘s go through a Word2Vec example in Python with gensim:

# Import Word2Vec 
from gensim.models import Word2Vec

# Sample sentences
sentences = [
    "I love Python",
    "Data science uses Python",   
    "Python is a programming language", 
]

# Tokenize  
tokenized_sents = [sent.split(" ") for sent in sentences]

# Train Word2Vec 
model = Word2Vec(sentences=tokenized_sents, vector_size=50, 
                 window=2, min_count=1) 

# Find similar words
similar_words = model.wv.most_similar("python")

print(similar_words)

This outputs the most similar words to "python" with scores:

[(‘programming‘, 0.79959964752197266),
 (‘language‘, 0.7734992504119873), 
 (‘code‘, 0.7659916877746582),
 (‘Java‘, 0.76218009090423584),
 (‘JavaScript‘, 0.7477637529373169)]

The vectors capture semantic relationships. Word2Vec is extremely useful, but has some weaknesses:

Requires large datasets for training.
Context windows are fixed, limiting long-range dependencies.
Only utilizes local context windows.

This is where GloVe attempts to overcome these limitations.

GloVe

GloVe (Global Vectors) is another word embedding technique similar to Word2Vec. It combines the strengths of global matrix factorization and local context window methods.

Some advantages of GloVe over Word2Vec:

Trains on global word co-occurrence counts, capturing more semantics.
Requires less data for training compared to Word2Vec.
Produces more accurate word analogies and efficient representations.

GloVe utilizes log-bilinear regression, training the model on global word-word co-occurrence counts.

The main idea is ratios of co-occurrence probabilities can capture meaningful semantic relationships between words.

These co-occurrence probabilities are encoded as word vector representations. Words with related meanings will have similar vector positioning.

Let‘s walk through a simple example to fetch GloVe vectors in Python:

import gensim.downloader as api

# Load pre-trained GloVe model 
model = api.load("glove-wiki-gigaword-300")   

# Get vector for word
model["phone"]

This returns a 300-dimensional vector encoding information about the word "phone".

We can also find similar words:

model.most_similar("phone")

Giving us:

[(‘telephone‘, 0.704238224029541),
 (‘phones‘, 0.7025121458053589), 
 (‘smartphone‘, 0.6222217321395874),
 (‘Mobile‘, 0.544503915309906),
 (‘device‘, 0.4820767641067505)]

The vectors capture semantic similarity.

GloVe works well for tasks like:

Named entity recognition
Co-reference resolution
Question answering
Text classification

However, GloVe still relies solely on words. This causes issues with out-of-vocabulary words not seen during training.

FastText overcomes this limitation by incorporating character n-gram information.

FastText

FastText is an efficient open-source library created by Facebook for text classification and representation learning.

Some advantages of FastText include:

Encodes subword information to handle unseen words.
Requires less training data than Word2Vec.
Provides pre-trained embeddings for 294 languages.

Rather than only using words, FastText represents each word as a bag of character n-grams. This allows encoding words not seen during training as the sum of their n-gram vectors.

For example, "language" would be represented as:

<la> <ang> <ngu> <uag> <age>

Where each n-gram has an associated vector. Summing these n-gram vectors produces the final word vector, allowing representations for unseen words.

This overcomes issues faced by Word2Vec and GloVe with rare and out-of-vocabulary words.

Let‘s go through a simple text classification example with FastText in Python:

# Sample training data 
training_data = ["fasttext is cool", "text classification is fun"]

# Create labeled data
training_data = ["__label__cool fasttext is cool",  
                 "__label__fun text classification is fun"] 

# Train model
model = fasttext.train_supervised(training_data)

# Predict 
model.predict("i love fasttext")

This predicts the most likely label and confidence:

(‘__label__cool‘, array([0.5905291]))

FastText works very well for text classification and compression tasks. Some key applications include:

Sentiment analysis
Natural language understanding
Document classification
Predicting hashtags (recommender systems)

The subword approach also makes FastText suitable for morphologically rich languages like Turkish, Finnish, etc.

Overall, FastText is an efficient and flexible technique for text classification and representation learning.

Comparing Vectorization Methods

We‘ve covered the most popular techniques for vectorizing text in NLP. Here‘s a quick comparison:

Method	Description	Strengths	Weaknesses	Use Cases
Bag of Words	Counts word frequencies	Simple to implement	No word importance weighting or order	Text classification, information retrieval
TF-IDF	Weights terms by uniqueness	Accounts for word importance	No semantics	Text classification, search engines
Word2Vec	Neural embedding method	Captures word semantics	Requires large data, fixed windows	Sentiment analysis, chatbots, QA
GloVe	Log-bilinear regression	Leverages global co-occurences	Relies solely on words	Named entity recognition, analogies
FastText	Extension of Word2Vec with n-grams	Handles out-of-vocab words	More complex	Text classification, compression

To summarize:

BoW and TF-IDF work well for simpler text classification and search tasks.
Word2Vec creates high quality semantic embeddings but needs more data.
GloVe utilizes global statistics for improved analogies and rare words.
FastText incorporates character information to handle unseen words.

The optimal technique depends on your specific dataset, model capabilities, and end goals.

I hope this guide gave you a comprehensive understanding of these vectorization techniques! Let me know if you have any other questions.

We‘ve only scratched the surface of NLP – stay tuned for more advanced natural language processing techniques in the next parts of this series. Happy learning!