Last year (before I vanished off the face of the planet), I took a Natural Language Processing course that had us working out of Jurafsky and Martin’s brilliant Speech and Language Processing textbook.
I love NLP, but NLP instruction can be (imo) opaque and unnecessarily math heavy. This textbook is a happy medium, but I wanted to work through the exercises in code instead of equations, which I hate π
This post is based on material from chapter 6 of the 2019 edition.
So, tf is tf-idf?#
tf: Term frequencyidf: Inverse document frequency
That’s the acronym, but tf-idf is also used to describe a model for representing text - more specifically, documents.
tf-idf is best known for its role in information retrieval.
Early search engines used the tf-idf model to support user queries: given a query like “Ginger Spice”, which documents are relevant?
Once we’ve found relevant documents, how do we rank them? (Is the user looking for information about the Spice Girls, or information about the root?)
Here’s how we create a tf-idf model, in two simple steps:
- Index all documents. For each document, count how often each word appears.
- A document like
I really, really like cake.is broken up into a map like:{'I': 1, 'really': 2, 'like': 1, 'cake': 1}
- A document like
- Calculate
tf,idf, andtf-idffor all documents. This is the creepy part o_o
In other words, this is how we create a term-document matrix. Let’s walk through it.
Creating the term-document matrix#
Indexing all documents#
Given a list of document file paths:
def get_cnt_string(str):
"""Count up word occurrences for a single string."""
counter = Counter()
# remove punctuation
regex = re.compile(r"[,\.!?]")
words = regex.sub("", str.lower()).split()
return Counter(words)
def get_per_doc_cnt(documents):
"""
Count up word occurrences for each document.
"""
doc_counters = {}
for document in documents:
with open(document, "r", encoding="utf-8") as f:
text = f.read()
counter = get_cnt_string(text)
# Clean document name
# (I put my documents under a sibling 'input' dir)
doc_name = document \
.replace("input/", "") \
.replace(".txt", "")
doc_counters[doc_name] = counter
return doc_counters
We store each document’s counter alongside its name.
Calculating tf and tf-idf for all documents#
Calculating tf#
Term frequency refers to the frequency of a word in the document. This is different from the word counts that we calculated a second ago.
For a single document, given a single word and its count, we calculate the word’s tf with:
tf = 1 + np.log10(count)
Put simply, we compute tf by taking the common logarithm of the word count.
Since the common logarithm of 0 is undefined, we add 1 to our count to be safe.
(It’s also valid to set tf = 0 if count == 0).
Note
log10 appears in many “magical formulas” of natural language processing / machine learning / etc. Why?
log10 lets us translate counts into orders of magnitude.
If the word ’the’ appears in my document 5000 times and ‘dog’ appears 5 times, then:
So, the raw count of βtheβ is three orders of magnitude larger than βdogβ, which log10 encodes as a difference of about 3 in term frequency.
You can think of log10 as a convenient way to compress large numbers and make radically different counts easier to compare across a dataset.
Calculating idf#
Document frequency of a term refers to the number of documents that contain a term. Inverse document frequency, then, refers to the rarity of a term across all documents, or across the corpus. If a term appears in many documents, then its inverse document frequency should be lower.
\[ \log_{10}\!\left(\frac{\text{number of documents}}{\text{number of documents containing the term}}\right) \](Again, we’re using log10 to compress the scale.)
Given a list of per-document word counts, this function computes tf, idf, and tf-idf for each term in each document:
def calculate_tf_idf_for_docs(doc_counters)
"""
Calculate tf, idf, and tf-idf for each term in each document.
"""
data = []
N = len(doc_counters)
for doc_name, counter in doc_counters.items():
for word, count in counter.items():
# Calculate tf using log scaling
tf = 1 + np.log10(count)
# Calculate document frequency
df = sum(1 for c in doc_counters.values() if word in c)
# Avoid division by zero if the word doesn't appear in any document
idf = np.log10(N / df) if df > 0 else 0
tf_idf = tf * idf
# Pandas dataframes will make it easier to query this data later
data.append(
{
"document": doc_name,
"word": word,
"count": count,
"tf": round(tf, 3),
"tf_idf": round(tf_idf, 3),
}
)
return pd.DataFrame(data)
tf-idf explained#
The above code snippet calculates tf-idf, and it’s almost laughably simple: tf_idf = tf * idf
What is this value actually supposed to represent?
If we think about how multiplication works:
- A term with high
tfand highidfwill have a hightf-idf. The term appears frequently in this document, but rarely appears in other documents. (e.g., the termfilmin the Wikipedia page for Hollywood, Los Angeles) - A term with high
tfand lowidfwill have a mediumtf-idf. The term appears frequently in this document, but also appears frequently in other documents. (e.g., the termthein any random Wikipedia page) - A term with low
tfand highidfwill have a mediumtf-idf. The term appears infrequently in this document, and appears infrequently in other documents too. (e.g., the termgoogolplexin the Wikipedia page for Live at the Googolplex). - A term with low
tfand lowidfwill have a lowtf-idf. The term appears infrequently in this document, and appears frequently in other documents. (e.g., the termalsoappearing once in any random Wikipedia page)
Remember, tf-idf is used in information retrieval.
If a term has a high tf-idf score in a document, then we can assume that the term is especially relevant to that document.
If a user’s query contains this term, then the document is probably relevant, and we should consider including it in results.
Preparing the query#
After creating the term-document matrix, we’ll effectively repeat our work for our query.
The query is expected to be a string, something like: school supplies.
First, we get counts for the query. We can use the get_cnt_string helper directly:
def get_cnt_string(str):
"""Count up word occurrences for a single string."""
counter = Counter()
# remove punctuation
regex = re.compile(r"[,\.!?]")
words = regex.sub("", str.lower()).split()
return Counter(words)
Then, calculate tf, idf, and tf-idf for the query:
def calculate_tf_idf_for_query(query, doc_counters):
"""
Calculate tf, idf, and tf-idf for each term in the query, referencing corpus counts.
"""
data = []
# Wrap the query counter in a dict with 'query' as the key
query = {"query": query}
corpus_counters = doc_counters
N = len(doc_counters)
for doc_name, counter in query.items():
for word, count in counter.items():
# Calculate tf using log scaling
tf = 1 + np.log10(count)
# Calculate idf based on the document corpus
df = sum(1 for c in corpus_counters.values() if word in c)
# Avoid division by zero if the word doesn't appear in any document
idf = np.log10(N / df) if df > 0 else 0
tf_idf = tf * idf
data.append(
{
"document": doc_name,
"word": word,
"count": count,
"tf": round(tf, 3),
"tf_idf": round(tf_idf, 3),
}
)
return pd.DataFrame(data)
This is very similar to the calculate_tf_idf function that we defined earlier.
(You can collapse both tf_idf calculations into a single function, but I split it out for the sake of clarity.)
Hopefully this helps to demystify the intuitions behind tf, idf, and tf-idf!
If we wanted to extend this script to support actual information retrieval and document ranking, we’d need to delve into word vectors, vector math, cosine similarity, and other lunacy. I’ll cover these topics in a follow-up post, since I think they deserve a focused explanation.
