The One-Line Summary: Entropy measures how "uncertain" or "mixed" a set is (high entropy = lots of uncertainty), and Information Gain measures how much a question reduces that uncertainty — so decision trees choose questions with the highest information gain to learn as efficiently as possible.
The Tale of Two Game Show Hosts
The hit TV show "Guess the Animal" had a simple format: contestants asked yes/no questions to identify a mystery animal from a list of 8 possibilities.
The show had two hosts with very different strategies.
Host #1: Random Randy
Randy asked whatever questions popped into his head:
THE ANIMALS: Dog, Cat, Eagle, Penguin, Shark, Goldfish, Snake, Frog
RANDY'S GAME (Mystery Animal: Snake)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━...
The One-Line Summary: Entropy measures how "uncertain" or "mixed" a set is (high entropy = lots of uncertainty), and Information Gain measures how much a question reduces that uncertainty — so decision trees choose questions with the highest information gain to learn as efficiently as possible.
The Tale of Two Game Show Hosts
The hit TV show "Guess the Animal" had a simple format: contestants asked yes/no questions to identify a mystery animal from a list of 8 possibilities.
The show had two hosts with very different strategies.
Host #1: Random Randy
Randy asked whatever questions popped into his head:
THE ANIMALS: Dog, Cat, Eagle, Penguin, Shark, Goldfish, Snake, Frog
RANDY'S GAME (Mystery Animal: Snake)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Q1: "Does it have fur?"
A: NO
Remaining: Eagle, Penguin, Shark, Goldfish, Snake, Frog
(Eliminated only 2 of 8 = 25%)
Q2: "Can it fly?"
A: NO
Remaining: Penguin, Shark, Goldfish, Snake, Frog
(Eliminated only 1 more)
Q3: "Does it live in water?"
A: NO
Remaining: Penguin, Snake, Frog
(Hmm, penguin is tricky...)
Q4: "Is it a reptile?"
A: YES
Remaining: Snake
(Finally!)
Randy needed 4 questions. Not terrible, but not great.
Host #2: Efficient Emma
Emma had studied information theory. She asked questions strategically:
EMMA'S GAME (Mystery Animal: Snake)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
THE ANIMALS: Dog, Cat, Eagle, Penguin, Shark, Goldfish, Snake, Frog
Q1: "Is it warm-blooded?"
A: NO
YES group: Dog, Cat, Eagle, Penguin (4 animals)
NO group: Shark, Goldfish, Snake, Frog (4 animals) ✓
PERFECT 50-50 SPLIT! Eliminated exactly half.
Q2: "Does it have fins?"
A: NO
YES group: Shark, Goldfish (2 animals)
NO group: Snake, Frog (2 animals) ✓
PERFECT 50-50 SPLIT again! Eliminated half of remaining.
Q3: "Does it have legs?"
A: NO
YES group: Frog (1 animal)
NO group: Snake (1 animal) ✓
FOUND IT! Snake has no legs.
Emma needed only 3 questions!
The Secret: Emma’s Questions Split Evenly
EMMA'S STRATEGY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Each question splits the remaining options as close
to 50-50 as possible.
8 animals → 4 animals → 2 animals → 1 animal
↓ ↓ ↓
Q1 (÷2) Q2 (÷2) Q3 (÷2)
With perfect splits: log₂(8) = 3 questions needed!
RANDY'S PROBLEM:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
His questions had uneven splits:
"Does it have fur?"
YES: 2 animals (Dog, Cat) = 25%
NO: 6 animals (everything else) = 75%
This is a BAD question! Even if the answer is YES,
you only eliminated 75%. If NO, only 25%.
A 50-50 split GUARANTEES you eliminate 50% every time.
This is exactly what ENTROPY and INFORMATION GAIN measure!
![Entropy and Information Gain Overview]
The complete picture: Emma’s strategy, entropy formula, information gain formula, and key takeaways
What is Entropy?
Entropy measures uncertainty or disorder in a set.
ENTROPY INTUITION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
HIGH ENTROPY = High uncertainty = Hard to predict
LOW ENTROPY = Low uncertainty = Easy to predict
EXAMPLE: Predicting tomorrow's weather
Location A: 50% sunny, 50% rainy
→ HIGH entropy (could go either way!)
→ Very uncertain
Location B: 99% sunny, 1% rainy
→ LOW entropy (almost certainly sunny)
→ Very predictable
Location C: 100% sunny, 0% rainy
→ ZERO entropy (definitely sunny)
→ No uncertainty at all!
The Entropy Formula
ENTROPY FORMULA:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
H(S) = -Σ pᵢ × log₂(pᵢ)
Where:
• S is the set (e.g., a node in a decision tree)
• pᵢ is the proportion of class i in the set
• log₂ is the base-2 logarithm
• The sum is over all classes
WHY log₂?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Because entropy is measured in "BITS" — the number
of yes/no questions needed to identify something.
8 equally likely options → log₂(8) = 3 bits
(Need 3 yes/no questions)
16 equally likely options → log₂(16) = 4 bits
(Need 4 yes/no questions)
2 equally likely options → log₂(2) = 1 bit
(Need 1 yes/no question)
Entropy Examples: Step by Step
import numpy as np
import matplotlib.pyplot as plt
def entropy(proportions):
"""Calculate entropy from a list of proportions."""
# Filter out zeros (log(0) is undefined)
proportions = np.array([p for p in proportions if p > 0])
return -np.sum(proportions * np.log2(proportions))
print("ENTROPY EXAMPLES")
print("="*60)
examples = [
("Pure (all same class)", [1.0]),
("50-50 split", [0.5, 0.5]),
("75-25 split", [0.75, 0.25]),
("90-10 split", [0.9, 0.1]),
("99-1 split", [0.99, 0.01]),
("3-way equal split", [1/3, 1/3, 1/3]),
("4-way equal split", [0.25, 0.25, 0.25, 0.25]),
]
for name, props in examples:
h = entropy(props)
print(f"{name:<25} → Entropy = {h:.4f} bits")
print(f"""
INTERPRETATION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• Pure set (all one class): H = 0.00 bits
→ No uncertainty! We KNOW the answer.
• 50-50 binary split: H = 1.00 bit
→ Maximum uncertainty for 2 classes.
→ Need exactly 1 yes/no question.
• 4-way equal split: H = 2.00 bits
→ Need 2 yes/no questions (2² = 4).
• Skewed splits (90-10, 99-1): H < 1.00 bit
→ Less uncertain. Often can guess correctly.
""")
Output:
ENTROPY EXAMPLES
============================================================
Pure (all same class) → Entropy = 0.0000 bits
50-50 split → Entropy = 1.0000 bits
75-25 split → Entropy = 0.8113 bits
90-10 split → Entropy = 0.4690 bits
99-1 split → Entropy = 0.0808 bits
3-way equal split → Entropy = 1.5850 bits
4-way equal split → Entropy = 2.0000 bits
INTERPRETATION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• Pure set (all one class): H = 0.00 bits
→ No uncertainty! We KNOW the answer.
• 50-50 binary split: H = 1.00 bit
→ Maximum uncertainty for 2 classes.
→ Need exactly 1 yes/no question.
• 4-way equal split: H = 2.00 bits
→ Need 2 yes/no questions (2² = 4).
• Skewed splits (90-10, 99-1): H < 1.00 bit
→ Less uncertain. Often can guess correctly.
Visualizing Entropy
import numpy as np
import matplotlib.pyplot as plt
# Create entropy curve for binary classification
p = np.linspace(0.001, 0.999, 1000)
h = -p * np.log2(p) - (1-p) * np.log2(1-p)
plt.figure(figsize=(12, 6))
plt.plot(p, h, 'b-', linewidth=3)
plt.fill_between(p, 0, h, alpha=0.2)
# Mark key points
plt.scatter([0.5], [1.0], s=200, c='red', zorder=5, marker='*')
plt.scatter([0.1, 0.9], [entropy([0.1, 0.9])]*2, s=100, c='orange', zorder=5)
plt.scatter([0.01, 0.99], [entropy([0.01, 0.99])]*2, s=100, c='green', zorder=5)
plt.xlabel('Proportion of Class 1 (p)', fontsize=12)
plt.ylabel('Entropy (bits)', fontsize=12)
plt.title('Entropy: Maximum Uncertainty at 50-50 Split', fontsize=14)
plt.xlim(0, 1)
plt.ylim(0, 1.1)
plt.grid(True, alpha=0.3)
plt.annotate('Maximum entropy!\n(most uncertain)', xy=(0.5, 1.0),
xytext=(0.65, 0.85), fontsize=10,
arrowprops=dict(arrowstyle='->', color='red'))
plt.annotate('Low entropy\n(fairly certain)', xy=(0.9, 0.47),
xytext=(0.75, 0.6), fontsize=10,
arrowprops=dict(arrowstyle='->', color='orange'))
plt.savefig('entropy_curve.png', dpi=150, bbox_inches='tight')
plt.show()
![Entropy Curve]
Entropy is maximized when the split is 50-50 and minimized (zero) when all samples belong to one class
The Story Behind the Formula
Let’s understand WHY entropy has this particular formula.
THE INTUITION BEHIND -p × log₂(p):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Imagine you need to communicate which animal was chosen.
SCENARIO 1: 8 equally likely animals
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Each has probability 1/8 = 0.125
To specify one of 8, you need log₂(8) = 3 bits.
(Like asking 3 yes/no questions)
Example binary codes:
Dog: 000
Cat: 001
Eagle: 010
Penguin: 011
Shark: 100
Goldfish: 101
Snake: 110
Frog: 111
3 bits to specify any animal.
Entropy = -8 × (1/8) × log₂(1/8) = 3 bits ✓
SCENARIO 2: Skewed probabilities
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
What if Dog appears 50% of the time, and others split the rest?
Dog: 50% (1/2)
Others: ~7% each (1/14 each)
Now we can be CLEVER with our code:
Dog: 0 (1 bit - it's common!)
Cat: 100 (3 bits)
Eagle: 101 (3 bits)
...
On average, we need FEWER bits because Dog is common.
This is exactly what entropy calculates — the AVERAGE
number of bits needed, weighted by probability!
What is Information Gain?
Now we understand entropy (uncertainty). Information Gain is simply:
How much does a question REDUCE uncertainty?
INFORMATION GAIN FORMULA:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
IG(S, A) = H(S) - H(S|A)
= H(parent) - Weighted Average H(children)
= Entropy BEFORE - Entropy AFTER
Where:
• S is the set before splitting
• A is the attribute/feature we split on
• H(S|A) is the conditional entropy after splitting
IN PLAIN ENGLISH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Information Gain = How much UNCERTAINTY did we REMOVE?
High IG → Great question! (Removed lots of uncertainty)
Low IG → Bad question! (Didn't help much)
Zero IG → Useless question! (Learned nothing)
Back to the Game Show
Let’s calculate Information Gain for Emma’s and Randy’s questions:
import numpy as np
def entropy(labels):
"""Calculate entropy from a list of labels."""
if len(labels) == 0:
return 0
_, counts = np.unique(labels, return_counts=True)
probs = counts / len(labels)
return -np.sum(probs * np.log2(probs))
def information_gain(parent_labels, children_labels_list):
"""Calculate information gain from a split."""
parent_entropy = entropy(parent_labels)
# Weighted average of children's entropy
total = len(parent_labels)
children_entropy = sum(
(len(child) / total) * entropy(child)
for child in children_labels_list
)
return parent_entropy - children_entropy
# The animals and their properties
animals = {
'Dog': {'warm_blooded': True, 'fur': True, 'fins': False, 'legs': True},
'Cat': {'warm_blooded': True, 'fur': True, 'fins': False, 'legs': True},
'Eagle': {'warm_blooded': True, 'fur': False, 'fins': False, 'legs': True},
'Penguin': {'warm_blooded': True, 'fur': False, 'fins': False, 'legs': True},
'Shark': {'warm_blooded': False, 'fur': False, 'fins': True, 'legs': False},
'Goldfish': {'warm_blooded': False, 'fur': False, 'fins': True, 'legs': False},
'Snake': {'warm_blooded': False, 'fur': False, 'fins': False, 'legs': False},
'Frog': {'warm_blooded': False, 'fur': False, 'fins': False, 'legs': True},
}
print("COMPARING QUESTIONS: INFORMATION GAIN")
print("="*60)
# All animals as parent
all_animals = list(animals.keys())
print(f"Starting with {len(all_animals)} animals (all equally likely)")
print(f"Parent entropy: {entropy(all_animals):.4f} bits")
print(f"(log₂(8) = 3 bits - need 3 yes/no questions ideally)\n")
# Compare different questions
questions = ['warm_blooded', 'fur', 'fins', 'legs']
print(f"{'Question':<20} {'YES group':<20} {'NO group':<25} {'IG (bits)':<10}")
print("-"*75)
for q in questions:
yes_group = [a for a, props in animals.items() if props[q]]
no_group = [a for a, props in animals.items() if not props[q]]
ig = information_gain(all_animals, [yes_group, no_group])
yes_str = f"{len(yes_group)} animals"
no_str = f"{len(no_group)} animals"
print(f"{q:<20} {yes_str:<20} {no_str:<25} {ig:.4f}")
print(f"""
ANALYSIS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
'warm_blooded': 4 YES, 4 NO → PERFECT 50-50 split!
IG = 1.0 bit (maximum possible!)
'fur': 2 YES, 6 NO → Uneven split
IG = 0.81 bits (good, but not optimal)
'fins': 2 YES, 6 NO → Same as fur
IG = 0.81 bits
'legs': 6 YES, 2 NO → Uneven split
IG = 0.81 bits
WINNER: 'Is it warm-blooded?'
This is EXACTLY what Emma asked first!
""")
Output:
COMPARING QUESTIONS: INFORMATION GAIN
============================================================
Starting with 8 animals (all equally likely)
Parent entropy: 3.0000 bits
(log₂(8) = 3 bits - need 3 yes/no questions ideally)
Question YES group NO group IG (bits)
---------------------------------------------------------------------------
warm_blooded 4 animals 4 animals 1.0000
fur 2 animals 6 animals 0.8113
fins 2 animals 6 animals 0.8113
legs 6 animals 2 animals 0.8113
ANALYSIS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
'warm_blooded': 4 YES, 4 NO → PERFECT 50-50 split!
IG = 1.0 bit (maximum possible!)
'fur': 2 YES, 6 NO → Uneven split
IG = 0.81 bits (good, but not optimal)
WINNER: 'Is it warm-blooded?'
This is EXACTLY what Emma asked first!
Information Gain for Decision Trees
Now let’s apply this to a real classification problem:
import numpy as np
import pandas as pd
# Classic "Play Tennis" dataset
data = {
'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain',
'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast',
'Overcast', 'Rain'],
'Temperature':['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool',
'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal',
'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High',
'Normal', 'High'],
'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong',
'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong',
'Weak', 'Strong'],
'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No',
'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}
df = pd.DataFrame(data)
print("THE PLAY TENNIS DATASET")
print("="*60)
print(df.to_string(index=False))
print(f"\nTotal: {len(df)} days, {sum(df['PlayTennis']=='Yes')} Yes, {sum(df['PlayTennis']=='No')} No")
# Calculate parent entropy
parent_labels = df['PlayTennis'].values
parent_entropy = entropy(parent_labels)
print(f"Parent Entropy: {parent_entropy:.4f} bits")
THE PLAY TENNIS DATASET
============================================================
Outlook Temperature Humidity Wind PlayTennis
Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rain Mild High Strong No
Total: 14 days, 9 Yes, 5 No
Parent Entropy: 0.9403 bits
Calculating Information Gain for Each Feature
def calc_ig_for_feature(df, feature, target='PlayTennis'):
"""Calculate information gain for a feature."""
parent_entropy = entropy(df[target].values)
# Get unique values
values = df[feature].unique()
# Calculate weighted entropy of children
weighted_entropy = 0
split_details = []
for val in values:
subset = df[df[feature] == val]
weight = len(subset) / len(df)
child_entropy = entropy(subset[target].values)
weighted_entropy += weight * child_entropy
# Count Yes/No
yes_count = sum(subset[target] == 'Yes')
no_count = sum(subset[target] == 'No')
split_details.append((val, yes_count, no_count, child_entropy))
ig = parent_entropy - weighted_entropy
return ig, split_details
print("\nINFORMATION GAIN FOR EACH FEATURE")
print("="*60)
print(f"Parent Entropy: {entropy(df['PlayTennis'].values):.4f} bits")
print(f"(9 Yes, 5 No out of 14 samples)\n")
features = ['Outlook', 'Temperature', 'Humidity', 'Wind']
results = []
for feature in features:
ig, details = calc_ig_for_feature(df, feature)
results.append((feature, ig, details))
print(f"\n{feature}: Information Gain = {ig:.4f} bits")
print("-" * 50)
for val, yes, no, h in details:
print(f" {val:<12}: {yes} Yes, {no} No (H = {h:.4f})")
# Find best feature
best_feature = max(results, key=lambda x: x[1])
print(f"\n{'='*60}")
print(f"🏆 BEST SPLIT: {best_feature[0]} (IG = {best_feature[1]:.4f} bits)")
print(f"{'='*60}")
Output:
INFORMATION GAIN FOR EACH FEATURE
============================================================
Parent Entropy: 0.9403 bits
(9 Yes, 5 No out of 14 samples)
Outlook: Information Gain = 0.2467 bits
--------------------------------------------------
Sunny : 2 Yes, 3 No (H = 0.9710)
Overcast : 4 Yes, 0 No (H = 0.0000)
Rain : 3 Yes, 2 No (H = 0.9710)
Temperature: Information Gain = 0.0292 bits
--------------------------------------------------
Hot : 2 Yes, 2 No (H = 1.0000)
Mild : 4 Yes, 2 No (H = 0.9183)
Cool : 3 Yes, 1 No (H = 0.8113)
Humidity: Information Gain = 0.1518 bits
--------------------------------------------------
High : 3 Yes, 4 No (H = 0.9852)
Normal : 6 Yes, 1 No (H = 0.5917)
Wind: Information Gain = 0.0481 bits
--------------------------------------------------
Weak : 6 Yes, 2 No (H = 0.8113)
Strong : 3 Yes, 3 No (H = 1.0000)
============================================================
🏆 BEST SPLIT: Outlook (IG = 0.2467 bits)
============================================================
![Information Gain Comparison]
Outlook has the highest information gain because the "Overcast" branch becomes completely pure (all Yes)
Why Outlook Wins: The Power of Pure Nodes
WHY OUTLOOK HAS THE HIGHEST INFORMATION GAIN:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
When we split on Outlook:
[All 14 days]
9 Yes, 5 No
H = 0.940
│
Split on Outlook
│
┌───────────┼───────────┐
↓ ↓ ↓
[Sunny] [Overcast] [Rain]
2Y, 3N 4Y, 0N 3Y, 2N
H=0.971 H=0.000 H=0.971
↑
PURE NODE!
(Entropy = 0)
The "Overcast" branch is PERFECTLY PURE!
All 4 Overcast days resulted in "Yes" (play tennis).
This is gold! We can immediately say:
"If Overcast → Always play tennis"
The other features don't create any pure nodes,
so they have lower information gain.
Step-by-Step: Building the Tree
print("BUILDING THE DECISION TREE")
print("="*60)
print("""
STEP 1: Split on Outlook (highest IG = 0.247)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[Root: All 14 days]
9 Yes, 5 No
H = 0.940
│
Outlook = ?
┌───────────┼───────────┐
↓ ↓ ↓
[Sunny] [Overcast] [Rain]
2Y, 3N 4Y, 0N 3Y, 2N
H=0.971 H=0.000 H=0.971
│
DONE!
(Pure: Yes)
""")
# Now split Sunny branch
sunny_df = df[df['Outlook'] == 'Sunny']
print("\nSTEP 2: Split the Sunny branch")
print("-"*50)
for feature in ['Temperature', 'Humidity', 'Wind']:
ig, details = calc_ig_for_feature(sunny_df, feature)
print(f"{feature}: IG = {ig:.4f}")
print("\nHumidity wins! Split Sunny on Humidity:")
print("""
[Sunny]
2Y, 3N
│
Humidity = ?
┌─────────┴─────────┐
↓ ↓
[High] [Normal]
0Y, 3N 2Y, 0N
DONE! DONE!
(Pure: No) (Pure: Yes)
""")
# Split Rain branch
rain_df = df[df['Outlook'] == 'Rain']
print("\nSTEP 3: Split the Rain branch")
print("-"*50)
for feature in ['Temperature', 'Humidity', 'Wind']:
ig, details = calc_ig_for_feature(rain_df, feature)
print(f"{feature}: IG = {ig:.4f}")
print("\nWind wins! Split Rain on Wind:")
print("""
[Rain]
3Y, 2N
│
Wind = ?
┌─────────┴─────────┐
↓ ↓
[Weak] [Strong]
3Y, 0N 0Y, 2N
DONE! DONE!
(Pure: Yes) (Pure: No)
""")
The Final Tree
THE COMPLETE DECISION TREE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[Outlook?]
/ | \
/ | \
Sunny Overcast Rain
/ | \
[Humidity?] YES [Wind?]
/ \ / \
High Normal Weak Strong
| | | |
NO YES YES NO
DECISION RULES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. If Outlook = Overcast → Play Tennis (YES)
2. If Outlook = Sunny AND Humidity = High → Don't Play (NO)
3. If Outlook = Sunny AND Humidity = Normal → Play Tennis (YES)
4. If Outlook = Rain AND Wind = Weak → Play Tennis (YES)
5. If Outlook = Rain AND Wind = Strong → Don't Play (NO)
These rules achieve 100% accuracy on the training data!
![Decision Tree for Tennis]
The final decision tree built using information gain to select the best splits
Entropy vs Gini: A Comparison
Decision trees can use either entropy or Gini impurity. How do they compare?
import numpy as np
import matplotlib.pyplot as plt
# Compare entropy and Gini for binary classification
p = np.linspace(0.001, 0.999, 1000)
entropy_vals = -p * np.log2(p) - (1-p) * np.log2(1-p)
gini_vals = 2 * p * (1 - p)
# Normalize Gini to same scale for comparison
gini_scaled = gini_vals * 2 # Max Gini is 0.5, max entropy is 1
plt.figure(figsize=(12, 6))
plt.plot(p, entropy_vals, 'b-', linewidth=3, label='Entropy')
plt.plot(p, gini_vals, 'r--', linewidth=3, label='Gini Impurity')
plt.plot(p, gini_scaled, 'r:', linewidth=2, label='Gini (scaled 2x)')
plt.xlabel('Proportion of Class 1 (p)', fontsize=12)
plt.ylabel('Impurity', fontsize=12)
plt.title('Entropy vs Gini Impurity: Very Similar Shapes!', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xlim(0, 1)
plt.savefig('entropy_vs_gini.png', dpi=150, bbox_inches='tight')
plt.show()
ENTROPY vs GINI IMPURITY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ENTROPY GINI
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Formula: -Σ p log₂(p) 1 - Σ p²
Range: [0, log₂(k)] [0, 1-1/k]
For binary: [0, 1] [0, 0.5]
At 50-50: 1.0 0.5
Computation: Slower (log) Faster (no log)
Origin: Information Statistical
theory classification
WHICH TO USE?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
In practice: Results are nearly identical!
• Gini is DEFAULT in scikit-learn (slightly faster)
• Entropy has nice interpretation (bits of information)
• Both prefer balanced splits
• Both produce similar trees 99% of the time
Choose either — the difference rarely matters.
![Entropy vs Gini]
Entropy and Gini impurity have very similar shapes — both are maximized at 50-50 and zero when pure
The Mathematics: Why This Works
THE INFORMATION-THEORETIC VIEW:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Claude Shannon (1948) asked:
"What's the minimum number of bits needed to
communicate a message?"
Answer: It depends on PROBABILITY!
If something is CERTAIN (p=1):
→ 0 bits needed (you already know!)
If two outcomes are EQUALLY LIKELY (p=0.5 each):
→ 1 bit needed (just say "yes" or "no")
If 8 outcomes are equally likely:
→ 3 bits needed (log₂(8) = 3)
ENTROPY IS THE AVERAGE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
H = Σ pᵢ × (bits needed for outcome i)
= Σ pᵢ × log₂(1/pᵢ)
= -Σ pᵢ × log₂(pᵢ)
Each outcome i:
- Occurs with probability pᵢ
- Needs log₂(1/pᵢ) bits to specify
- Contributes pᵢ × log₂(1/pᵢ) to the average
INFORMATION GAIN = BITS SAVED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Before question: Need H(parent) bits to specify class
After question: Need H(children) bits on average
Information Gain: H(parent) - H(children) bits SAVED!
A question with IG = 0.5 bits saves you half a yes/no question on average!
Code: Complete Implementation
import numpy as np
from collections import Counter
class DecisionTreeWithEntropy:
"""Decision tree using information gain (entropy) for splits."""
def __init__(self, max_depth=None, min_samples_split=2):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.tree = None
def _entropy(self, y):
"""Calculate entropy of a label array."""
if len(y) == 0:
return 0
counts = Counter(y)
probs = [count / len(y) for count in counts.values()]
return -sum(p * np.log2(p) for p in probs if p > 0)
def _information_gain(self, y, y_left, y_right):
"""Calculate information gain from a split."""
if len(y_left) == 0 or len(y_right) == 0:
return 0
parent_entropy = self._entropy(y)
n = len(y)
child_entropy = (
(len(y_left) / n) * self._entropy(y_left) +
(len(y_right) / n) * self._entropy(y_right)
)
return parent_entropy - child_entropy
def _best_split(self, X, y):
"""Find the best feature and threshold to split on."""
best_ig = 0
best_feature = None
best_threshold = None
for feature in range(X.shape[1]):
thresholds = np.unique(X[:, feature])
for threshold in thresholds:
left_mask = X[:, feature] <= threshold
right_mask = ~left_mask
if sum(left_mask) == 0 or sum(right_mask) == 0:
continue
ig = self._information_gain(y, y[left_mask], y[right_mask])
if ig > best_ig:
best_ig = ig
best_feature = feature
best_threshold = threshold
return best_feature, best_threshold, best_ig
def _build_tree(self, X, y, depth=0):
"""Recursively build the tree."""
n_samples = len(y)
n_classes = len(set(y))
# Stopping conditions
if (self.max_depth and depth >= self.max_depth) or \
n_classes == 1 or \
n_samples < self.min_samples_split:
return {'leaf': True, 'class': Counter(y).most_common(1)[0][0],
'samples': n_samples, 'entropy': self._entropy(y)}
# Find best split
feature, threshold, ig = self._best_split(X, y)
if feature is None or ig == 0:
return {'leaf': True, 'class': Counter(y).most_common(1)[0][0],
'samples': n_samples, 'entropy': self._entropy(y)}
# Split
left_mask = X[:, feature] <= threshold
right_mask = ~left_mask
return {
'leaf': False,
'feature': feature,
'threshold': threshold,
'ig': ig,
'entropy': self._entropy(y),
'samples': n_samples,
'left': self._build_tree(X[left_mask], y[left_mask], depth + 1),
'right': self._build_tree(X[right_mask], y[right_mask], depth + 1)
}
def fit(self, X, y):
self.tree = self._build_tree(np.array(X), np.array(y))
return self
def _predict_one(self, x, node):
if node['leaf']:
return node['class']
if x[node['feature']] <= node['threshold']:
return self._predict_one(x, node['left'])
return self._predict_one(x, node['right'])
def predict(self, X):
return [self._predict_one(x, self.tree) for x in np.array(X)]
def print_tree(self, node=None, indent="", feature_names=None):
"""Pretty print the tree with entropy and IG."""
if node is None:
node = self.tree
if node['leaf']:
print(f"{indent}🎯 Predict: {node['class']} "
f"(samples={node['samples']}, H={node['entropy']:.3f})")
else:
fname = feature_names[node['feature']] if feature_names else f"X[{node['feature']}]"
print(f"{indent}📊 {fname} <= {node['threshold']}? "
f"(IG={node['ig']:.3f}, H={node['entropy']:.3f}, n={node['samples']})")
print(f"{indent}├── Yes:")
self.print_tree(node['left'], indent + "│ ", feature_names)
print(f"{indent}└── No:")
self.print_tree(node['right'], indent + " ", feature_names)
# Test on iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=42
)
tree = DecisionTreeWithEntropy(max_depth=3)
tree.fit(X_train, y_train)
print("DECISION TREE WITH ENTROPY (INFORMATION GAIN)")
print("="*60)
print("\nTree Structure:")
tree.print_tree(feature_names=iris.feature_names)
# Accuracy
predictions = tree.predict(X_test)
accuracy = sum(p == t for p, t in zip(predictions, y_test)) / len(y_test)
print(f"\nTest Accuracy: {accuracy:.2%}")
Quick Reference
ENTROPY AND INFORMATION GAIN: CHEAT SHEET
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ENTROPY:
H(S) = -Σ pᵢ log₂(pᵢ)
• Measures uncertainty/disorder
• H = 0 → Pure (all one class)
• H = 1 → Maximum uncertainty (binary 50-50)
• Higher H → More mixed → Harder to predict
INFORMATION GAIN:
IG(S, A) = H(S) - Σ (|Sᵥ|/|S|) × H(Sᵥ)
• Measures reduction in uncertainty
• IG = 0 → Question tells us nothing
• High IG → Great question!
• Decision trees pick highest IG at each split
COMPARISON WITH GINI:
• Both measure impurity
• Both prefer balanced splits
• Gini: 1 - Σ pᵢ² (faster, no log)
• Entropy: -Σ pᵢ log₂(pᵢ) (information theory)
• Results usually identical
SCIKIT-LEARN:
DecisionTreeClassifier(criterion='entropy') # Use entropy
DecisionTreeClassifier(criterion='gini') # Use Gini (default)
Key Takeaways
Entropy measures uncertainty — High entropy = mixed classes = hard to predict 1.
Pure nodes have zero entropy — All samples belong to one class 1.
Maximum entropy at 50-50 — Most uncertain when perfectly balanced 1.
Information Gain = entropy reduction — How much does a question help? 1.
Trees choose highest IG — Ask the most informative question first 1.
50-50 splits are ideal — They maximize information gain 1.
Entropy uses bits — The number of yes/no questions needed 1.
Gini and entropy are similar — Both work well, Gini is slightly faster
The One-Sentence Summary
Efficient Emma won the game show because she asked questions that split possibilities in half, maximizing information gain — and this is exactly how decision trees choose their questions: by picking the feature that reduces entropy (uncertainty) the most, learned from studying the training data to ask questions that most effectively separate the classes.
What’s Next?
Now that you understand entropy and information gain, you’re ready for:
- Gini Impurity Deep Dive — The other splitting criterion
- Random Forests — Many trees voting together
- Pruning Strategies — Preventing overfitting
- Feature Importance — Which features matter most?
Follow me for the next article in the Tree Based Models series!
Let’s Connect!
If Emma’s game show strategy made entropy click, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What’s your favorite "maximum information" question? Mine is "Is it bigger than a breadbox?" — a classic 20 Questions opener that splits the world roughly in half! 📦
The difference between Random Randy and Efficient Emma? Randy asked whatever came to mind; Emma asked questions that maximized information gain. Decision trees learned the same lesson — always ask the question that teaches you the most.
Share this with someone confused by entropy. After meeting Emma, they’ll never forget it!
Happy splitting! 🌲