
Hello friend! As an NLP enthusiast, I‘m excited to dive deeper into vectorization – a critical technique for turning text into numerical representations that machine learning models can understand.
In Part 1 of this NLP series, we explored fundamental concepts like tokenization, normalization, and text cleaning to prepare data for downstream tasks.
Now we‘ll cover various vectorization methods in-depth:
- Bag of Words (BoW)
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Word2Vec
- GloVe (Global Vectors)
- FastText
Each approach has distinct strengths and weaknesses depending on the use case. By the end, you‘ll have a comprehensive understanding of these techniques and when to apply them.
Let‘s start by recapping some key concepts.
Tokenization vs Vectorization
Before jumping into vectorization methods, let‘s clarify how tokenization differs.
What is Tokenization?
Tokenization splits sentences into smaller units called tokens, helping computers process and understand text better.
For example, tokenizing the sentence "This article is good" would give us:
["This", "article", "is", "good"]
These tokens represent the semantic meaning in a format machine learning algorithms can ingest.
What is Vectorization?
While tokenization separates text into tokens, vectorization encodes the tokens into numerical feature vectors.
Machine learning models require numerical data. Vectorization transforms text into vector representations capturing semantic meaning, allowing you to train models more accurately on text.
Why Do We Need Vectorization?
Here are some key reasons vectorization is crucial for NLP:
-
Tokenization → breakdown text. Vectorization → encodes into numbers.
-
Captures semantic meaning – similar vectors imply similar meaning.
-
Reduces dimensionality and sparsity, improving efficiency.
-
Algorithms like neural networks need numerical inputs.
Now let‘s explore popular vectorization techniques, starting with Bag of Words.
Bag of Words (BoW)
The Bag of Words model is a simple way to vectorize text for machine learning.
If you have a corpus of documents, BoW treats each doc like a "bag" filled with words. No word order or structure is considered.
Some key applications of Bag of Words include:
- Text classification
- Sentiment analysis
- Document retrieval
For example, if you have a large collection of text data, a Bag of Words model will help represent the documents by creating a vocabulary of unique words across the corpus.
It encodes each document as a vector based on the frequency (count) of vocabulary words within that document.
The document vectors consist of non-negative integers (0, 1, 2, etc) representing word frequencies.
Let‘s walk through the 3 steps to create a Bag of Words model:
Step 1: Tokenization
Break documents into tokens.
text = "I love Pizza and Burgers"
tokens = ["I", "love", "Pizza", "and", "Burgers"]
Step 2: Build Vocabulary
Compile a list of all unique words (vocabulary).
vocab = ["I", "love", "Pizza", "and", "Burgers"]
Step 3: Vector Creation
Count each vocabulary term‘s frequency in each document and store in a vector.
This sparse matrix has rows as documents, columns as vocabulary size.
# Sample Bag of Words in Python
from sklearn.feature_extraction.text import CountVectorizer
sentences = [
"This is the first document.",
"This is the second document.",
"And this is the third one.",
"Is this the first document?",
]
# Create vectorizer
vectorizer = CountVectorizer()
# Vectorize
X = vectorizer.fit_transform(sentences)
# Vocabulary
print(vectorizer.get_feature_names_out())
# Vectorized data
print(X.toarray())
This prints the vocabulary and vectorized sentences:
Vocabulary:
[‘and‘, ‘document‘, ‘first‘, ‘is‘, ‘one‘, ‘second‘, ‘the‘, ‘third‘, ‘this‘]
Vectorized Sentences:
[[0 1 1 1 0 0 2 0 1]
[0 2 1 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
Bag of Words is simple to implement but has some limitations:
-
All words treated equally, regardless of importance. Common words like "a", "the", etc domininate.
-
Only word frequency considered – order/context lost.
-
Long documents can have inflated counts, making comparison difficult.
To overcome these issues, TF-IDF improves upon Bag of Words:
TF-IDF
TF-IDF (Term Frequency – Inverse Document Frequency) is a numerical representation that gives more weight to important words while reducing weights of common terms across documents.
It enhances Bag of Words by incorporating two metrics:
TF: Term Frequency – Frequency of word in a document
IDF: Inverse Document Frequency – Rarity of word across documents
The TF-IDF score is calculated as:
TF-IDF = TF * IDF
Where:
TF = Number of times term t appears in document / Total words in document
IDF = log(Total documents / Documents containing term t)
Words with high TF-IDF are frequent in that document but rare overall, highlighting distinctive words for that document.
Let‘s walk through an example:
Doc 1: "I love machine learning"
Doc 2: "I love NLP"
Step 1: Create Vocabulary
Unique words: ["I", "love", "machine", "learning", "NLP"]
Step 2: Calculate TF
"I" TF: Doc 1 – 1/4 = 0.25, Doc 2 – 1/3 ≈ 0.33
"love" TF: Doc 1 – 1/4 = 0.25, Doc 2 – 1/3 ≈ 0.33
"machine" TF: Doc 1 – 1/4 = 0.25, Doc 2 – 0
"learning" TF: Doc 1 – 1/4 = 0.25, Doc 2 – 0
"NLP" TF: Doc 1 – 0, Doc 2 – 1/3 ≈ 0.33
Step 3: Calculate IDF
Total Documents (N) = 2
"I" IDF = log(N/2) = 0
"love" IDF = log(N/2) = 0
"machine" IDF = log(N/1) = log(2) ≈ 0.301
"learning" IDF = log(N/1) = log(2) ≈ 0.301
"NLP" IDF = log(N/1) = log(2) ≈ 0.301
Step 4: Calculate TF-IDF
"I" TF-IDF:
- Doc 1 – 0.25 * 0 = 0
- Doc 2 – 0.33 * 0 = 0
"love" TF-IDF:
- Doc 1 – 0.25 * 0 = 0
- Doc 2 – 0.33 * 0 = 0
And so on…
The final TF-IDF matrix would be:
I love machine learning NLP
Doc 1 0 0 0.075 0.075 0
Doc 2 0 0 0 0 0.099
The TF-IDF method better highlights distinctive words for a document.
Some key applications include:
- Text classification
- Document clustering
- Building chatbots
- Information retrieval
While TF-IDF improves upon Bag of Words, it still doesn‘t capture semantic meaning. More advanced techniques like Word2Vec overcome this limitation.
Word2Vec
Word2Vec is a popular NLP technique that represents words as continuous vectors capturing semantic meaning. It was created by Tomas Mikolov at Google in 2013.
For example, "cat" and "dog" vectors would be closer than "cat" and "table" vectors, allowing mathematical similarity measurements.

Word2Vec uses a shallow neural network with two architectures to learn word embeddings:
CBOW (Continuous Bag of Words): Predicts target word from surrounding context words.
Skip-gram: Predicts context words given target word. Better for rare words.
The embeddings created encode semantic relationships between words, useful for tasks like:
- Document classification
- Sentiment analysis
- Named entity recognition
- Question answering
- Text summarization
Let‘s go through a Word2Vec example in Python with gensim:
# Import Word2Vec
from gensim.models import Word2Vec
# Sample sentences
sentences = [
"I love Python",
"Data science uses Python",
"Python is a programming language",
]
# Tokenize
tokenized_sents = [sent.split(" ") for sent in sentences]
# Train Word2Vec
model = Word2Vec(sentences=tokenized_sents, vector_size=50,
window=2, min_count=1)
# Find similar words
similar_words = model.wv.most_similar("python")
print(similar_words)
This outputs the most similar words to "python" with scores:
[(‘programming‘, 0.79959964752197266),
(‘language‘, 0.7734992504119873),
(‘code‘, 0.7659916877746582),
(‘Java‘, 0.76218009090423584),
(‘JavaScript‘, 0.7477637529373169)]
The vectors capture semantic relationships. Word2Vec is extremely useful, but has some weaknesses:
- Requires large datasets for training.
- Context windows are fixed, limiting long-range dependencies.
- Only utilizes local context windows.
This is where GloVe attempts to overcome these limitations.
GloVe
GloVe (Global Vectors) is another word embedding technique similar to Word2Vec. It combines the strengths of global matrix factorization and local context window methods.
Some advantages of GloVe over Word2Vec:
-
Trains on global word co-occurrence counts, capturing more semantics.
-
Requires less data for training compared to Word2Vec.
-
Produces more accurate word analogies and efficient representations.
GloVe utilizes log-bilinear regression, training the model on global word-word co-occurrence counts.
The main idea is ratios of co-occurrence probabilities can capture meaningful semantic relationships between words.
These co-occurrence probabilities are encoded as word vector representations. Words with related meanings will have similar vector positioning.
Let‘s walk through a simple example to fetch GloVe vectors in Python:
import gensim.downloader as api
# Load pre-trained GloVe model
model = api.load("glove-wiki-gigaword-300")
# Get vector for word
model["phone"]
This returns a 300-dimensional vector encoding information about the word "phone".
We can also find similar words:
model.most_similar("phone")
Giving us:
[(‘telephone‘, 0.704238224029541),
(‘phones‘, 0.7025121458053589),
(‘smartphone‘, 0.6222217321395874),
(‘Mobile‘, 0.544503915309906),
(‘device‘, 0.4820767641067505)]
The vectors capture semantic similarity.
GloVe works well for tasks like:
- Named entity recognition
- Co-reference resolution
- Question answering
- Text classification
However, GloVe still relies solely on words. This causes issues with out-of-vocabulary words not seen during training.
FastText overcomes this limitation by incorporating character n-gram information.
FastText
FastText is an efficient open-source library created by Facebook for text classification and representation learning.
Some advantages of FastText include:
- Encodes subword information to handle unseen words.
- Requires less training data than Word2Vec.
- Provides pre-trained embeddings for 294 languages.
Rather than only using words, FastText represents each word as a bag of character n-grams. This allows encoding words not seen during training as the sum of their n-gram vectors.
For example, "language" would be represented as:
<la> <ang> <ngu> <uag> <age>
Where each n-gram has an associated vector. Summing these n-gram vectors produces the final word vector, allowing representations for unseen words.
This overcomes issues faced by Word2Vec and GloVe with rare and out-of-vocabulary words.
Let‘s go through a simple text classification example with FastText in Python:
# Sample training data
training_data = ["fasttext is cool", "text classification is fun"]
# Create labeled data
training_data = ["__label__cool fasttext is cool",
"__label__fun text classification is fun"]
# Train model
model = fasttext.train_supervised(training_data)
# Predict
model.predict("i love fasttext")
This predicts the most likely label and confidence:
(‘__label__cool‘, array([0.5905291]))
FastText works very well for text classification and compression tasks. Some key applications include:
- Sentiment analysis
- Natural language understanding
- Document classification
- Predicting hashtags (recommender systems)
The subword approach also makes FastText suitable for morphologically rich languages like Turkish, Finnish, etc.
Overall, FastText is an efficient and flexible technique for text classification and representation learning.
Comparing Vectorization Methods
We‘ve covered the most popular techniques for vectorizing text in NLP. Here‘s a quick comparison:
| Method | Description | Strengths | Weaknesses | Use Cases |
|---|---|---|---|---|
| Bag of Words | Counts word frequencies | Simple to implement | No word importance weighting or order | Text classification, information retrieval |
| TF-IDF | Weights terms by uniqueness | Accounts for word importance | No semantics | Text classification, search engines |
| Word2Vec | Neural embedding method | Captures word semantics | Requires large data, fixed windows | Sentiment analysis, chatbots, QA |
| GloVe | Log-bilinear regression | Leverages global co-occurences | Relies solely on words | Named entity recognition, analogies |
| FastText | Extension of Word2Vec with n-grams | Handles out-of-vocab words | More complex | Text classification, compression |
To summarize:
-
BoW and TF-IDF work well for simpler text classification and search tasks.
-
Word2Vec creates high quality semantic embeddings but needs more data.
-
GloVe utilizes global statistics for improved analogies and rare words.
-
FastText incorporates character information to handle unseen words.
The optimal technique depends on your specific dataset, model capabilities, and end goals.
I hope this guide gave you a comprehensive understanding of these vectorization techniques! Let me know if you have any other questions.
We‘ve only scratched the surface of NLP – stay tuned for more advanced natural language processing techniques in the next parts of this series. Happy learning!