Building a Bayesian Spam Classifier from First Principles

We work through the original Bayesian Spam Classifier paper from Microsoft Research. And then make it practical by applying to the Enron spam dataset

Dec 7, 2025 — 23 min read

For me, this exploration of Applied Bayesian Networks started with a 1998 paper from Microsoft Research titled "A Bayesian Approach to Filtering Junk E-Mail". You can read the PDF by clicking on the image below:

Rule Based Systems Didn’t Work

It’s a historic document, showing how researchers and engineers began to recognize spam for what it is and devised ways to tame it. Previously…

We work through the original Bayesian Spam Classifier paper from Microsoft Research. And then make it practical by applying to the Enron spam dataset

Dec 7, 2025 — 23 min read

Building a Bayesian Spam Classifier from First Principles

Rule Based Systems Didn’t Work

It’s a historic document, showing how researchers and engineers began to recognize spam for what it is and devised ways to tame it. Previously, people relied on tons of rigid manual rules, but they were not the right tool for the job. These manual filters were often either too harsh or too lenient, but rarely "just right".

Judea Pearl’s Bayesian Networks to the Rescue

By 1998, Judea Pearl’s pioneering work in AI, particularly Bayesian networks—which described how a "chain of events" leads from Event A to Event B via a path of influences—was well-known enough to have occurred to the paper’s researchers as a valid option to combat spam.

Pearl’s framework is of profound value, as it deals with the larger topic of Causality at a higher level. For now, however, we’ll restrict ourselves to the "mundane" application of spam fighting.

Representing A Network of Events with Graphs (and a Table)

Bayesian Networks Represent Events Influencing Other Events

The underlying mathematical structure from which Bayesian Spam Filters were invented is called Bayesian Networks. It was Judea Pearl - who was a major figure in pushing the idea of Bayesian Networks forward. Something like:

A -> B

A is a random variable. B is a random variable. A (parent) influences B (child).

What is the point of Bayesian Networks? It helps us represent “join probabilities” in an easy-to-understand way. Joint probability means - there are multiple connected events that need to be represented, and we are interested in the probability of each happening, based on the probabilities of the other related events. Bayesian Networks make it possible for us to put these events and probabilities into a graph (a Directed Acyclic Graph) and a table of probabilities.

In the real world - where many events happen, with all sorts of sequences - we end up with complex graph representations. How does Bayesian Network (BN) ideas help us cope?

The core idea in BN is that not every variable or node depends on every other variable. Most relationships in the real world are local. One node influence only the nearby nodes, and those nodes in turn affect further neighboring nodes - like ripples pushing through water.

DAG: Nodes, Parents, Arrows

BN is a Directed Acyclic Graph (DAG, as is commonly known in CS)

Each node is a random variable.

Each arrow is an influence - that is parent helps “determine” the child.

Example:

Rain → WetGrass → Slip

This says: Rain affects whether the grass is wet, and wet grass affects whether someone slips.

Parents Explain the Node (if They’re Known)

BN is governed by the “local Markov Property” which says something like:

Once you know a node’s parents, nothing else in the network can help you predict that node any better.

This is the heart of the entire theory.

It tells us that all the information needed to determine a variable is already captured in its parents. Once the parents become known, there is no better way to determine the node than via the parents.

For example, if you know whether it rained and whether the sprinkler was on, then:

the stock market,
the temperature,
or whether someone slipped

add no new information for predicting whether the grass is wet.

The direct causes are enough.

Conditional Probability Tables (CPT)

Another technical component of BN is the Conditional Probability Table (CPT).

It defines the parents of a node, and the node itself. It mentions the state of the parents, and then gives a probability for the present node.

BN is concise - because each node has its own table - there is no need to look at all the other nodes and their concerns. CPT is a focused and simple device at the nodal level.

How BN simplifies Joint Probability Calculation

By sticking to “parents explain the node”, BN simplify the calculation for joint probability significantly:

This factorization:

makes learning easier,
makes inference faster,
and lets us model complex systems with far fewer parameters.

Intuitive Representations of Real World Events - A Few Examples

Here - we can think of “weather” as the “main input”, traffic as an “intermediate step” and the “late for work” factor as the “output”.

Informal “Scenarios”

So one can imagine many scenarios happening - informally first.

If the weather is rainy, traffic more likely to be slower, and one’s more likely to get late to work 1.

If the weather is sunny, traffic is less likely to be slower, and one’s more likely to reach on time (or not get late)

The above two are the “main cases”, but then there’s also the “less likely” outcomes which we can imagine:

If the weather is rainy, although less likely, a person can still get to work on time (rarer case) 1.

If the weather is sunny, although less likely, a person can still get late to work

A DAG

Now that we have these “related events”, their “relative probabilities” and “consequences” spelled out - we can try to represent them a bit more concretely with a DAG and some numbers.

We can represent the events (or variables) as the following DAG:

It says - Weather influences Traffic, and Traffic influences Late to Work.

We can use the Bayesian network rule to represent all the “informal scenarios” we have mentioned above with a factorization:

Getting the CPT (Conditional Probability Table)

Now we have the DAG represented, the high level formula defined, and so on. But we still need to fill up CPT (Conditional Probability Table) - which gives probability of events. We need to specify probabilities for:

Prior on Weather (when sunny and rainy) 1.

Traffic given Weather (light or heavy) 1.

Late given Traffic (probability of being late)

There are ways to define these values:

Observation: For example - observe weather and traffic at the same time every day for 3 months. Record the values. Then count all the variations: rainy/light, rainy/heavy, sunny/light, sunny heavy. These will give numbers to rely on. 1.

Domain Experts: In domains such as medical or experiential fields such as shipbuilding (getting more scientific by the day) - one may ask experts to define the numbers based on their developed sense in the field. 1.

Bayesian Learning: If a network and data exists you can use certain algorithms to optimize the factors. Tools such as bnlearn (R), pgmpy (Python), Hugin, Netica implement such algorithms.

In our case - we can come up with CPT with some reflection on our common experiences.

Now - we have the DAG and CPT, and so are ready for doing some calculations.

Example 1: Joint Probability of a “Full Assignment”

What’s the probability of the following scenario?

We use the factorization formula:

We plug in the values:

So there’s a 21.6% chance it is rainy, traffic is heavy, and you are late.

Formally - we call this sort of calculation a “join probability of full assignment”. Full assignment roughly means - all the variables are used and filled up with values.

Example 2: marginal probability P(L=late)

In marginal probability - we are interested in “looping through” all the scenarios - get the “Late” probability, and then add them up. As mentioned in the “informal analysis” section, there are 4 scenarios - which need to be taken care of, and then we add them up.

The formula:

Bayes Theorem and Naive Bayes Assumption

In more philosophical terms - Bayes’ theorem answers the question:

“Given evidence X - how much should I believe in hypothesis C?”

Here:

C = The Class (Spam/Not Spam)

X = Features (“email content contains the word FREE”, “has an email attachment”, etc.)

Bayes’ Equation (1) Components

Here’s the equation (1) for the Bayes’ Theorem again:

At a high level, we can see 4 “components” to this formula.

(1) The Expectation: Our goal is to find the Class ck, given the feature x. So this is in the left-hand side. In the spam classification task, we are trying to find P(is_spam | has_free)

(2) In the right-hand side, in the numerator, that is proportional to the classification, or “strengthening the thesis” is the following. Essentially, it says how prevalent finding feature x is in the given class ck in the historical data available to us. In the spam classification task - this is the factor that says P(has_free | is_spam) from the observation dataset.

(3) Again in the right-hand side, in the numerator, proportional to the classification, or “strengthening the thesis” is the following. Essentially, it says - how prevalent the class ck itself is. In our spam classification case, it is expressing P(is_spam) in the general dataset.

(4) Finally in the right-hand side, in the denominator, inversely proportional to the classification, or “weakening the thesis” is the following. It says - how prevalent the feature x itself is. In our case - it is expressing P(has_free)

Some Questions

Time for a bit of intuitive exploration and explanation. These factors may look a bit cryptic at the outset. But, all these can be easily made sense - if we try to imagine and start with a spam classification dataset.

Imagine a dataset like this:

We take a bunch of emails - and manually grade - does it have the keyword “free” in them, and are they spam.

Looking at this table - we can figure out some “common-sensical” patterns out of it:

Total: How many rows/data items are there? Sample count 1.

Spam: How many of (1) Total, are spam? Spam count 1.

“Free” in Spam: How many of (2) Spam, have “free” in them? 1.

“Free” in Total: How many of (1) Total, have “free” in them?

From the equation (1), we can draw these conclusions. We are trying to find - given “free” in email content - what’s the probability of it being spam?

Chance of the above is directly proportional to (3) “Free” in Spam. The more “Free” is in Spam. 1.

Chance of the above is directly proportional to (2). Spam. That is - the more we see spam in general, the more likely to see spam in the case of “free” as well. 1.

Chance of the above is inversely proportional to (4) “Free” in Total. This may seem surprising. That is - if we see “Free” usually - then the word loses any characteristic or meaning - and it is likely to be effective in finding out whether an email is spam or not.

Overall, the formula goes like this:

Probability of class given features

how likely the features are inside that class × how common that class is ÷ how likely the features are overall.

How much evidence did this feature actually provide?

It’s important to understand why certain factors go on the numerator while one goes to the denominator.

We can think from the perspective of - “how special is this feature X to this class C?”

So the more “exclusive” feature X is to class C - stronger an indicator it is.

Consider 2 cases for the email example.

Scenario 1:

P(has_free | is_spam) = 0.8

P(has_free) = 0.1

That means 0.8/0.1 = 8

That means the feature has_free is 8x more likely to appear in a spam email than not.

Scenario 2

P(has_free | is_spam) = 0.8

P(has_free) = 0.75

That means 0.8/0.1 = 1.066

That means the feature has_free is as likely to appear in a spam email as it is to appear in a non-spam email. The feature doesn’t give us any information!\

Let’s look at this question from another angle. Consider the word “hello”. As you can imagine - this word probably appears in both spam and non-spam to the same degree. Even scammers start with a polite “hello” :)

So we get P(hello | spam) = 0.95 AND P(hello | not spam) = 0.95. Both cases - same behavior.

Numerator is - P(hello | spam) * P(spam).

The first factor is already 0.95 - so it’s a big push.

But consider denominator - which is P(hello) - which is 0.95 as well. So these 0.95, cancel out. Only P(spam) remains in the result.

This mathematically proves that the word “hello” has no value in predicting spam or not spam - it’s just common everywhere.

Essentially - the formula is looking for “surprise factors” or “distinctive” factors - and gets rid of “common factors”. We are trying to find - what’s strongly predictive and what’s just distraction.

A worked out numerical example

We want:

Apply Bayes:

We know:

P(free|Spam) = 0.8
P(Spam) = 0.3
P(free) = ???

Compute the denominator:

Insert values:

Now compute Bayes:

That is…

If an email contains “FREE”, then it is spam with about 77% probability, even though overall only 30% of all emails are spam.

Why Equation (1) is not sufficient for real world calculations?

Because for many features:

X = (x₁, x₂, x₃, … xₙ),

the joint distribution

becomes a huge table you’d have to estimate.

If you have 20 Boolean features, that joint table needs 1,048,576 probabilities.

Impossible to learn from real-world data.

This is why Naive Bayes exists.

Equation 2: Naive Bayes Independence Assumption

The Naive Bayes Independence Assumption states that - each feature is independent of the others once you know the class

This is obviously false in the real world, but useful.

Suppose we classify spam using:

X₁ = email contains “FREE”

X₂ = email has an attachment

Instead of estimating:

Naive Bayes Says:

This collapses a giant 2D table into two 1D numbers.

A Toy Spam Classifier Implementation (In Simple Python Code)

Our goal is to find spam or ham - given the email contents:

And to do that - we need “features” such as “contains the word free”, “is from an .edu domain”, etc. Once we have a feature - we can get the individual probability for that particular feature via Bayes’s Theorem:

And then we use the Naive Bayes assumption so that - we can simply multiply the probabilities of all these features into a unified yet sensible result. Simple enough!

Consider a toy dataset like this:

There are 6 emails in the dataset. 4 are spam, 2 are ham. We have two main tasks based on this:

“Learn” how each word influences the outcome of spam/ham using Bayesian formulae 1.

Use the “learning” to label a few new messages

Learning from the dataset

To learn from the dataset, we need a way to apply Bayes theorem to each word in the corpus. That is, for each of the words - we want to find how it influences the spam/ham result.

First, we need the right mathematical structure for it. At the top of this page - we find that a “binary vector space” is the right structure for it. So we will try to write some code to get a “binary vector space” for this toy dataset and see how it looks.

We go through each word and build a dict:

We go through each word in each email, and tick which words are present in a 2-d array:

We pretty-print the results, and we see that we get a “binary vector space”, basically a 2-d array - where each item column count is equal to the number of unique words in the email corpus:

Training Naive Bayes

We have the binary vector space ready. Now we must apply the Bayes formula for every single word. Look at previous sections to recall the formula. Since we have “scale up” the calculation for all the features - we do some “optimizations”.

Our goal is - we want to find the probability of spam - given multiple words in an email:

The true formula is:

The denominator P(x) is the same for both. So if we compute relative probabilities:

Then we normalize:

So really - what we want to aim for is the calculation of the above formula. This avoids the need to use P(x) again and again, simplifies the calculations overall.

The actual training code looks like this:

Point (1) is simple - we just find what % of all is spam/ham.

Point (2) is where we set up a “spam and ham counters” for each word.

In point (3) - we update the counter for each word in spam/ham count. So at the end of this loop - we have “influence points” whether to the spam side or ham side.

In point (4) - we perform a small mathematical trick called “Laplace Smoothing” for dealing with 0 values. Since we are using naive bayes - at next step - we will have to multiple these probability values. And multiplying by 0 makes the whole term 0 which is undesirable. So Laplace transform is needed.

Why “+2” in the denominator?

Because the binary variable wiw_iwi has two possible values:

1 = present

0 = absent

Laplace smoothing adds 1 fake count to each outcome. So the denominator gets +2.

Back to the training explanation - in point (5) - we return all the values required for predicting spam/ham for new values. Training is complete.

Using the trained model to predict spam/ham for new emails

Here’s the code and results for prediction:

At the outset - in (5), (6) and (7) - you can see us giving a set of new messages, and predicting the labels - and seems like the algorithm is doing quite well!

In the prediction function, in (1) - we use the existing vocabulary to vectorize the new email. Essentially - this will return 2-d array with 1 row and 11 columns standing for 11 unique words.

In (2) and (3) and (4) we see some weird “log” manipulation. This is just a computation trick to convert multiplication of small numbers into logarithmic addition (and back with exponentiation).

We take logs so we can add small numbers instead of multiplying them, preventing numerical underflow and making Naive Bayes stable.

Example:

Imagine we want to multiply:

0.1 * 0.2 * 0.3

Actual multiplication:

0.1 * 0.2 = 0.02

0.02 * 0.3 = 0.006

Now, with logs:

log(0.1) = -2.302585

log(0.2) = -1.609438

log(0.3) = -1.203972

Sum:

-2.302585 + (-1.609438) + (-1.203972)

= -5.115995

Now exponentiate to get original:

exp(-5.115995) = 0.006

Works exactly.

If you multiply even 100 probabilities like 0.1, the result is:

This is far below what Python can represent.

So this log-based computational trick is needed.

Try it yourself

From the mathematical representation of emails into binary vector space to turning the Bayes Theorem and Naive Bayes assumption into computable functions, we’ve demonstrated a working toy spam classification model. Although it’s a toy model - hopefully it sheds some light of understanding on those who’d like to understand the topic of Bayes Spam classifiers.

Try it for yourself in this notebook.

A More Realistic Implementation with Real-World Enron Spam Data

The toy Bayesian Spam Classifier demonstrated above used oversimplified examples such as:

3-4 word email placeholders

reusing existing words

representing sparse matrices in the inefficient way

etc

All the above things help one see the “big idea” fast - but probably won’t be able to deal with a more real-world dataset.

For a more realistic implementation, we’ll use the “Enron Spam Dataset”, which has a substantial number of examples.

You can find the dataset in shrsv/bayesian-spam-classifier along with a notebook exploring the dataset.

A Typical Message

A typical message in the dataset looks like this:

{ “message_id”: 31329, “text”: “expande tu imagen ! ! ! ! ! ! ! ! ! si no puede ver este mail , entre a : http : / / www . supermedios . com / admin / mailing / proyecto . php ? id = 160\neste mensaje se enva bajo los artculos 2 y 4 de la ley\n19 . 628 y 28 b de la ley 19 . 955 de la constitucin de la repblica\nde chile actualizada el 14 de julio 2004 . su direccin ha sido extrada\nmanualmente por personal de nuestra compaa desde su sitio\nweb en internet , o ha sido introducida por usted al aceptar el envo\nde mensajes publicitarios al inscribirse en alguno de los sitios o foros\nde nuestra red de trabajo .\npara ser removido presione borrar”, “label”: 1, “label_text”: “spam”, “subject”: “expande tu imagen ! ! ! ! ! ! ! ! !”, “message”: “si no puede ver este mail , entre a : http : / / www . supermedios . com / admin / mailing / proyecto . php ? id = 160\neste mensaje se enva bajo los artculos 2 y 4 de la ley\n19 . 628 y 28 b de la ley 19 . 955 de la constitucin de la repblica\nde chile actualizada el 14 de julio 2004 . su direccin ha sido extrada\nmanualmente por personal de nuestra compaa desde su sitio\nweb en internet , o ha sido introducida por usted al aceptar el envo\nde mensajes publicitarios al inscribirse en alguno de los sitios o foros\nde nuestra red de trabajo .\npara ser removido presione borrar”, “date”: “2005-01-19” }

We are particularly interested in the "label_text" and "text" field. Label text gives the ham/spam value, whereas the text field is a concatenation of subject and message fields.

Dataset Size

The dataset keeps ham and spam almost perfectly balanced. The test set includes 2,000 messages, split about evenly between ham (49.6%) and spam (50.4%). The training set follows the same pattern across 31,716 messages, with 49.04% ham and 50.96% spam. When you combine everything, you get 33,716 messages with a near-even distribution: 49.07% ham and 50.93% spam. All label totals check out, so the dataset’s internal consistency holds up.

The Full Notebook

Find the full analysis in this notebook - here I’ll give an overview of the process.

Dataset Distribution

Implementation Overview

Let’s walk through how to turn raw email text into a functioning Bayesian spam classifier. The goal is to expose the underlying mechanics, not rely on libraries that abstract everything away.

1. Loading and Repairing JSONL Data

Real datasets are messy. A JSONL file may contain malformed JSON fragments, stray quotes, or truncated lines. We use json_repair to reconstruct valid JSON objects rather than discarding samples.

Code excerpt:

from json_repair import repair_json

with open("test.jsonl") as f:
for line in f:
fixed = repair_json(line)
obj = json.loads(fixed)

Why this matters: If you silently skip broken lines, your priors, token frequencies, and performance metrics all shift in unpredictable ways. Bayesian models depend heavily on correct counts, so defensive parsing is mandatory.

2. Inspecting Label Distribution

A Bayesian classifier needs priors. The simplest prior comes from the relative frequencies of spam vs ham in the dataset.

Code excerpt:

from collections import Counter

labels = Counter(item["label_text"] for item in dataset)
print(labels)  # {'spam': 1008, 'ham': 992}

Math intuition: The prior P(spam) is simply:

count(spam) / total_messages

If spam dominates the dataset, the classifier should be biased toward predicting spam unless the evidence strongly contradicts it. This is built directly into Naive Bayes.

3. Building the Vocabulary

We convert text into a numerical space by assigning each unique token an index.

Code excerpt:

def build_vocab(corpus):
vocab = {}
for item in corpus:
for token in item["text"].split():
if token not in vocab:
vocab[token] = len(vocab)
return vocab

This produces a word → integer ID mapping.
The tokenizer here is intentionally naive. Sophisticated tokenization changes the model but not the underlying math.
Vocabulary order must be deterministic; otherwise, training and prediction vectors won’t align.

Math idea: Each email becomes a vector x where x[i] means “word i is present.” This transforms text classification into a plain vector classification problem.

4. Sparse Binary Vectorization

Emails rarely contain more than a tiny fraction of the vocab. Storing full dense vectors wastes memory and compute.

Code excerpt:

from scipy.sparse import lil_matrix

def vectorize(item, vocab):
vec = lil_matrix((1, len(vocab)), dtype=int)
for token in item["text"].split():
if token in vocab:
vec[0, vocab[token]] = 1
return vec.tocsr()

Why sparse matters: If the vocab has 50k words, but an email contains only ~120, then 99.7% of the entries are zero. Sparse matrices store only the non-zero entries.

Math concept: This corresponds to Bernoulli features:

x[i] = 1  if word i appears
x[i] = 0  otherwise

This is exactly what Bernoulli Naive Bayes expects.

5. Training the Naive Bayes Model

A Bernoulli Naive Bayes classifier needs:

Prior probability of each class.
Probability that each word appears in spam vs ham.

Code excerpt (core logic):

import math

def train_nb(X, y):
# X is CSR sparse matrix, y is list of labels (0=ham, 1=spam)
num_docs, vocab_size = X.shape

spam_docs = sum(y)
ham_docs = num_docs - spam_docs

p_spam = spam_docs / num_docs
p_ham  = ham_docs  / num_docs

# Count word occurrences in each class
spam_counts = X[y == 1].sum(axis=0)
ham_counts  = X[y == 0].sum(axis=0)

# Laplace smoothing applied
p_word_given_spam = (spam_counts + 1) / (spam_docs + 2)
p_word_given_ham  = (ham_counts  + 1) / (ham_docs  + 2)

return {
"p_spam": math.log(p_spam),
"p_ham":  math.log(p_ham),
"log_p_spam_word":  np.log(p_word_given_spam),
"log_p_ham_word":   np.log(p_word_given_ham),
"log_p_not_spam_word": np.log(1 - p_word_given_spam),
"log_p_not_ham_word":  np.log(1 - p_word_given_ham)
}

Math explanation

For each class (spam/ham), we compute:

P(word appears | class) = (count_in_class + 1) / (documents_in_class + 2)

This is Laplace smoothing, which prevents probabilities from becoming zero when a word never appears in one of the classes.

The model works in log-space:

log P(class | x) = log P(class) +
sum over i where x[i] = 1: log P(word_i | class) +
sum over i where x[i] = 0: log (1 - P(word_i | class))

Using logs avoids numerical underflow when multiplying many small probabilities.

6. Predicting with Sparse Vectors

Prediction loops only over present words, which is efficient with sparse matrices.

Code excerpt:

def predict(vec, model):
score_spam = model["p_spam"]
score_ham  = model["p_ham"]

indices = vec.indices  # only non-zero entries

# Add log probabilities for word presence
score_spam += model["log_p_spam_word"][0, indices].sum()
score_ham  += model["log_p_ham_word"][0, indices].sum()

# Add log probabilities for word absence
# This is the expensive part; usually approximated or precomputed
score_spam += model["log_p_not_spam_word"].sum() - model["log_p_not_spam_word"][0, indices].sum()
score_ham  += model["log_p_not_ham_word"].sum() - model["log_p_not_ham_word"][0, indices].sum()

return "spam" if score_spam > score_ham else "ham"

Math intuition: You compare two quantities:

log P(x | spam) + log P(spam)
vs
log P(x | ham)  + log P(ham)

Whichever is larger wins.

7. Evaluating the Model

We compute standard confusion-matrix metrics:

TP, TN, FP, FN = ...

A spam model that flags legitimate mail (false positives) is worse than one that occasionally misses spam. You need all four counts to judge the classifier correctly.

Performance Analysis

Confusion Matrix 101 (Simple and Easy to Remember Explanation)

To understand performance, we need to make sense of something like this (Confusion Matrix)

To judge the results, we need to understand two technical terms: precision and recall.

First, we must understand the idea of a “relevance”. Relevance in classification is a predefined concept. Say our goal is to retrieve all “only dog photos” in a collection of “only dog and only cat photos”. Here, the definition of “relevant” is - “only dog photos”. As a human, I can go through each photo and perfectly create two sets: “only dog photos”, and “only cat photos”. Now the goal is to get a machine to do it perfectly as well. It’s not easy to get right all the time, and there will be mistakes.

Now, how do we quantify the mistakes? First, we must know what the “ideal result” is.

// training dataset

ideal_dogs = [...]

ideal_cats = [...]

collection = ideal_dogs + ideal_cats

// run algorithm/model

retrieved = retrieve(”dog”, collection)

// calculate precision and recall

actual_dogs = match(ideal_dogs, retrieved)

precision = len(actual_dogs)/len(retrieved)

recall = len(actual_dogs)/len(ideal_dogs)

To summarize:

Precision: What % of retrieved images are of dogs only? 1.

Recall: What % of all dogs only images have been successfully retrieved?

Both the above terms come from a confusion matrix. And these are the formulae for the terms in the terminology of true positive, true negatives, false positives, false negatives.

Sometimes these 4 terms can become a bit confusing, so here’s a helpful “memory aid” to simplify remembering these terms.

Truth Precedes Prediction (TP2) Model

A memorable guide on how to interpret confusion matrix terms such as:

True Positive 1.

True Negative 1.

False Positive 1.

False Negative

Each phrase has two words - as captured by the phrase “Truth Precedes Prediction”:

Word 1 - True/False - The Truth or the Ground Truth 1.

Word 2 - Positive/Negative - The Prediction or Model Output

When the Ground Truth and Model Output match - Relevance happens.

That is,

True Positive: Here ground truth is True, and the prediction is Positive. A plus-plus match. Output is relevant. We want more of this. 1.

True Negative: Here, ground truth is True, and the prediction is Negative. A plus-minus mismatch. Output is irrelevant noise. We want less of this. 1.

False Positive: Here, ground truth is False, and the prediction is Positive. A minus-plus mismatch. Output is irrelevant noise. We want less of this. 1.

False Negative: Here, ground truth is False and the prediction is Negative. A minus-minus match. Output is relevant. We want more of this.

A higher pattern emerges to aid recall (in this case, human recall :)):

Same polarity Plus-Plus (True Positive) and Minus-Minus (False Negative) pairs mean relevant desirable outputs. 1.

Different polarity Plus-Minus (True Negative) and Minus-Plus (False Positive) pairs means irrelevant, undesirable outputs.

Here’s a nice infographic to go with it:

Analysing Result Graphs from the Real-World Enrod Dataset

This is our overall performance summary:

============================================================
PERFORMANCE SUMMARY
============================================================
Overall Accuracy: 98.05%
→ Correctly classified: 1,961/2,000 emails

Spam Detection:
Precision: 97.08% (reliability of spam predictions)
Recall:    99.11% (spam catching rate)
F1 Score:  0.9809

Error Analysis:
False Positives: 30 (ham marked as spam)
False Negatives: 9 (spam that got through)
False Positive Rate: 3.02%
False Negative Rate: 0.89%

Not bad! Some example predictions:

============================================================
EVALUATING MODEL ON TEST SET
============================================================
Evaluating on Test Set (2,000 emails)...
Processing email 0/2,000...
Processing email 500/2,000...
Processing email 1,000/2,000...
Processing email 1,500/2,000...

============================================================
TEST SET RESULTS
============================================================
Total emails: 2,000

Confusion Matrix:
True Positives (Spam correctly identified):  999
False Positives (Ham wrongly marked as spam): 30
True Negatives (Ham correctly identified):   962
False Negatives (Spam missed):               9

Performance Metrics:
Accuracy:  98.05%
Precision: 97.08% (of predicted spam, how many were actually spam)
Recall:    99.11% (of actual spam, how many did we catch)
F1 Score:  0.9809 (harmonic mean of precision & recall)

============================================================
EXAMPLE PREDICTIONS (First 10 emails)
============================================================

Email 1: ✓
Subject: expande tu imagen ! ! ! ! ! ! ! ! !...
True label: spam
Predicted: spam
Probabilities: Spam=1.0000, Ham=0.0000

Email 2: ✓
Subject: paliourg learning for life...
True label: spam
Predicted: spam
Probabilities: Spam=1.0000, Ham=0.0000

Email 3: ✓
Subject: cure premature ejaculation...
True label: spam
Predicted: spam
Probabilities: Spam=1.0000, Ham=0.0000

Email 4: ✓
Subject: re : noms / actual flow for 3 / 19 / 01...
True label: ham
Predicted: ham
Probabilities: Spam=0.0000, Ham=1.0000

Email 5: ✓
Subject: ehronline web address change...
True label: ham
Predicted: ham
Probabilities: Spam=0.0000, Ham=1.0000

Email 6: ✓
Subject: re : cusip...
True label: ham
Predicted: ham
Probabilities: Spam=0.0000, Ham=1.0000

Email 7: ✓
Subject: energy : oil drilling : survey finds producers pla...
True label: ham
Predicted: ham
Probabilities: Spam=0.0000, Ham=1.0000

Email 8: ✓
Subject: supersavings on all pain medications - no prescrip...
True label: spam
Predicted: spam
Probabilities: Spam=1.0000, Ham=0.0000

Email 9: ✓
Subject: tw pnr activity thru december 12 th...
True label: ham
Predicted: ham
Probabilities: Spam=0.0000, Ham=1.0000

Email 10: ✓
Subject: talon...
True label: ham
Predicted: ham
Probabilities: Spam=0.0000, Ham=1.0000

Result Grand Overview

On the top left we have a confusion matrix, the numbers of which were explained earlier. All 4 quadrants together add up to 100. Of the "mistakes" the model made, 30 are "serious" because they send Ham -> Spam. These are false positives. Thankfully that is just 1.5% of the dataset. Overall, the filter works pretty well for the effort put into constructing it.

Full Code

Find the complete implementation on GitHub

Rule Based Systems Didn’t Work

Rule Based Systems Didn’t Work

Judea Pearl’s Bayesian Networks to the Rescue

Representing A Network of Events with Graphs (and a Table)

Bayesian Networks Represent Events Influencing Other Events

DAG: Nodes, Parents, Arrows

Parents Explain the Node (if They’re Known)

Conditional Probability Tables (CPT)

How BN simplifies Joint Probability Calculation

Intuitive Representations of Real World Events - A Few Examples

Informal “Scenarios”

A DAG

Getting the CPT (Conditional Probability Table)

Example 1: Joint Probability of a “Full Assignment”

Example 2: marginal probability P(L=late)

Bayes Theorem and Naive Bayes Assumption

C = The Class (Spam/Not Spam)

Bayes’ Equation (1) Components

Some Questions

Probability of class given features

How much evidence did this feature actually provide?

P(has_free | is_spam) = 0.8

P(has_free) = 0.1

That means 0.8/0.1 = 8

That means the feature has_free is 8x more likely to appear in a spam email than not.

P(has_free | is_spam) = 0.8

P(has_free) = 0.75

That means 0.8/0.1 = 1.066

A worked out numerical example

Why Equation (1) is not sufficient for real world calculations?

X = (x₁, x₂, x₃, … xₙ),

Equation 2: Naive Bayes Independence Assumption

X₁ = email contains “FREE”

A Toy Spam Classifier Implementation (In Simple Python Code)

Learning from the dataset

Training Naive Bayes

1 = present

Using the trained model to predict spam/ham for new emails

A More Realistic Implementation with Real-World Enron Spam Data

3-4 word email placeholders

reusing existing words

representing sparse matrices in the inefficient way

A Typical Message

Dataset Size

The Full Notebook

Dataset Distribution

Implementation Overview

1. Loading and Repairing JSONL Data

2. Inspecting Label Distribution

3. Building the Vocabulary

4. Sparse Binary Vectorization

5. Training the Naive Bayes Model

Math explanation

6. Predicting with Sparse Vectors

7. Evaluating the Model

Performance Analysis

Confusion Matrix 101 (Simple and Easy to Remember Explanation)

Truth Precedes Prediction (TP2) Model

Analysing Result Graphs from the Real-World Enrod Dataset

Result Grand Overview

Full Code

Similar Posts