12 min readOct 17, 2025
–
Press enter or click to view image in full size
Every journey in cybersecurity, whether you’re a seasoned pentester or just starting out, leads you back to the fundamentals. And few fundamentals are as crucial or as misunderstood as hashing.
For decades, one algorithm reigned supreme as the digital fingerprint for files across the internet: the Message Digest Algorithm 5 (MD5). It was fast, it was simple, and it was everywhere. But just like a faulty lock, MD5 was eventually broken.
This is the story of MD5: what it is, how it works with a level of detail you can use, and why, for any security-critical task, it’s now a vulnerability waiting to happen.
Chapter 1: What is MD5? The Digital Fingerprint and the Avalanche
Press enter o…
12 min readOct 17, 2025
–
Press enter or click to view image in full size
Every journey in cybersecurity, whether you’re a seasoned pentester or just starting out, leads you back to the fundamentals. And few fundamentals are as crucial or as misunderstood as hashing.
For decades, one algorithm reigned supreme as the digital fingerprint for files across the internet: the Message Digest Algorithm 5 (MD5). It was fast, it was simple, and it was everywhere. But just like a faulty lock, MD5 was eventually broken.
This is the story of MD5: what it is, how it works with a level of detail you can use, and why, for any security-critical task, it’s now a vulnerability waiting to happen.
Chapter 1: What is MD5? The Digital Fingerprint and the Avalanche
Press enter or click to view image in full size
MD5 is a cryptographic hash function. Think of it as a one-way mathematical blender. You throw any input into it a file, a password, a single line of text and it spits out a fixed-size, unique output.
This output is a 128-bit hash value, which we typically see as 32 hexadecimal characters.
Input vs. Output (The Core Concept) Input: ”hello” MD5 Hash (The Digest): 5d41402abc4b2a76b9719d911017c592
The crucial takeaway: No matter if your input is a 1KB text file or a 10GB video, the output is always 32 hex characters. This fixed-length output is what makes it perfect for checking if a file has been tampered with a single bit flip in the input should change the entire output hash. This is the avalanche effect in action.
The concept of a hash function is rooted in the Pigeonhole Principle: since the input space is infinite (any file size) and the output space is finite **(only 2^(128) = 340282370000000000000000000000000000000 possible hashes) **, collisions are mathematically guaranteed to exist. The goal of a cryptographic hash function is to make finding those collisions computationally infeasible. MD5 failed this test spectacularly.
Chapter 2: The Historical Context From Luhn to Rivest
Press enter or click to view image in full size
To truly appreciate MD5, we must look at the lineage of hash functions that preceded it. MD5 didn’t appear in a vacuum; it was the culmination of decades of research driven by the need for efficient data integrity checks.
The Genesis of Hashing
The idea of using a short code to represent a large piece of data dates back to the 1950s. **1953: *Hans Peter Luhn **(IBM) suggested using a small code to represent data for faster searching, essentially inventing the concept of a hash table. 1978: Rabin’s Hash *The introduction of cryptographic properties to hashing began with the work of Michael O. Rabin, focusing on making the hash output unpredictable.
The MD Family: MD2, MD4, and the Birth of MD5
Ronald Rivest, a key figure in modern cryptography (of RSA fame), developed the Message Digest family specifically for digital signature applications.
The MD Family Lineage
MD2 (1989): 128-bit hash. Designed for 8-bit processors. Later found to have collision vulnerabilities. MD4 (1990): 128-bit hash. Designed for speed in software. Weaknesses were found almost immediately, leading to its rapid deprecation. **MD5 (1991): **128-bit hash. A refinement of MD4, designed to be slightly slower but more secure. It was the standard for over a decade.
The evolution from MD4 to MD5 was a direct response to the discovery of weaknesses in MD4’s design. Rivest intentionally made MD5 more complex, hoping to fix the flaws of its predecessor. Ironically, this complexity only delayed the inevitable.
Chapter 3: The Inner Workings The Six Steps to a Hash (The Deep Dive)
Press enter or click to view image in full size
When you run an** md5sum** command, what exactly is the processor doing? It’s a beautifully complex process involving six distinct steps. This is the core of the algorithm, and understanding it is key to understanding its weakness.
Step 1: Convert Input to Binary
The first thing MD5 does is strip away the human-readable format. Your text, image, or file content is converted into a raw stream of binary (0s and 1s).
H = 01001000E = 01100101L = 01101100L = 01101100O = 01101111
Step 2: Padding (The 512-Bit Rule)
MD5 cannot process data of arbitrary length. It requires the total length of the message to be a multiple of 512 bits.
If the data isn’t long enough, the algorithm performs padding:
1. A single ‘1’ bit is appended. 2. Enough ‘0’ bits are added to bring the message length to a size that is congruent to 448 (mod 512). This means the length is exactly 64 bits less than the next multiple of 512. 3. The original message length (in 64 bits) is appended.
This padding scheme is crucial and is one of the areas where early hash functions often faced length extension attacks, though MD5’s specific design mitigated some of these.
01001000 01100101 01101100 01101100 01101111 100000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000 00000000 00000000 00101000
MD5 uses little-endian to store the length, which means the least significant byte comes first.
40 in binary (64-bit) =
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00101000
- Notice the last byte is
00101000→ this is decimal 40 in binary. - All the preceding bytes =
0(because 40 is small).
Step 3: Divide Into 512-bit Blocks
If your file is large (and most are), MD5 splits the padded message into multiple 512-bit chunks. Each 512-bit block is then further divided into sixteen 32-bit words (M[0] to M[15]).
From previous padding steps, the full block in binary is:
01001000 01100101 01101100 01101100 01101111 100000000 00000000 ... (407 zeros)00000000 00000000 00000000 00000000 00000000 00000000 00000000 00101000
Total = 512 bits
Let’s break it into sixteen 32-bit words.
We’ll take 32 bits at a time:
M[0]
01001000 01100101 01101100 01101100
M[1]
01101111 10000000 00000000 00000000
M[2–14]
00000000 00000000 00000000 00000000
M[15]
00000000 00000000 00000000 00101000
MD5 processes data in fixed-size blocks of 512 bits.
If your message (after padding) is larger than 512 bits, it is simply split into multiple 512-bit blocks.
For example:
Block 1 bits 0–511
Block 2 bits 512–1023
Block 3 bits 1024–1535 ……
Each block is processed separately using the MD5 algorithm, but the result of one block carries over to the next.
Imagine a file whose padded length = 1024 bits:
Number of blocks = 1024 ÷ 512 = 2 blocks
MD5 splits it into:
Block 1 → M[0..15] (16 × 32-bit words)Block 2 → M[0..15] (next 16 × 32-bit words)
Each block is treated the same way: divided into 16 words, processed with A, B, C, D registers, and the hash state is updated after each block
Step 4: Initialize 4 Registers (The IV)
MD5 uses four 32-bit variables, often called registers, which are initialized with fixed hexadecimal constants. These are the Initialization Vectors (IVs): MD5 Initialization Vectors (IVs)
Register A: 0x67452301 (Binary: 01100111 01000101 00100011 00000001) Register B: 0xEFCDAB89 (Binary: 11101111 11001101 10101011 10001001) Register C: 0x98BADCFE (Binary: 10011000 10111010 11011100 11111110) Register D: 0x10325476 (Binary: 00010000 00110010 01010100 01110110)
They are four fixed 32-bit hexadecimal constants chosen by MD5’s creator, Ron Rivest, and defined in **RFC 1321 **that serve as the initial state (A, B, C, D) for the hashing process. They weren’t picked randomly but carefully selected to ensure good data mixing, avoid predictable patterns, and provide a secure, consistent starting point for every MD5 computation.
These values are constant for every single MD5 hash ever calculated. They are the starting state for the entire process, and the final hash is simply these registers after they have been thoroughly mixed with the data.
Step 5: Main Processing (The 64-Round Compression Function)
This is the core of the algorithm, the** compression function**, where the 512-bit message block is mixed into the 128-bit state (A, B, C, D). It is an iterative process of 64 rounds, grouped into four distinct rounds of 16 operations each. The entire process for a single round can be summarized by the equation:
a = b + ((a + F(b, c, d) + M[i] + T[j]) <<<s)
Where:
{ a, b, c, d } are the four registers. { F } is a non-linear function specific to the round. { M[i] } is a 32-bit word from the current 512-bit message block. { T[j] } is a 32-bit constant, derived from the sine function. { <<< s } is a left bitwise rotation by s bits. All additions are modulo 2^{32}.
The Four Non-Linear Functions ($F, G, H, I$)
The non-linear functions are what give MD5 its scrambling power. They ensure that the relationship between the input and output is complex and non-linear, which is a requirement for a secure hash function.
MD5 Non-Linear Functions Round 1 (Operations 1–16):
F(B, C, D) = (B & C) | (~B & D)
Explanation: Mixes B and C, adds bits from D where B is 0.
Round 2 (Operations 17–32):
G(B, C, D) = (B & D) | (C & ~D)
Explanation: Mixes B and D, adds bits from C where D is 0.
Round 3 (Operations 33–48):
H(B, C, D) = B ^ C ^ D
Explanation: XOR of B, C, D flips bits where an odd number of inputs are
Round 4 (Operations 49–64):
I(B, C, D) = C ^ (B | ~D)
Explanation: Combines OR, NOT, XOR to mix the final bits strongly.
These functions are designed to maximize the avalanche effect. In the first round, the function {F} acts like an “if-then-else” operation, introducing a high degree of non-linearity. The other functions, {G, H, I} continue this mixing process, ensuring that the final hash is dependent on a complex interplay of all input bits.
What is s in MD5?
In MD5, s is the number of bits each 32-bit word is rotated left in every operation.
This is called a left rotation (circular shift): bits shifted out from the left come back on the right.
Rotation helps mix the bits so that small changes in the input drastically change the hash (avalanche effect).
The values of s are fixed constants specified in the MD5 algorithm (RFC 1321).
MD5 shift amounts per round:
RoundShift values (s) repeated every 4 steps
1 7, 12, 17, 22 5, 9, 14, 203 4, 11, 16, 234 6, 10, 15, 21
So, Round 1 Step 1 → s = 7, Step 2 → s = 12, and so on.
Round 1:
***Message = “hello” *** After padding to 512 bits and splitting into 16 words ( M[0]–M[15] ):
M[0] = 0x6C6C6568M[1] = 0x0000806F...
Initial registers (IVs):
A = 0x67452301B = 0xEFCDAB89C = 0x98BADCFED = 0x10325476
Constants (T[1], T[2]):
T[1] = 0xD76AA478T[2] = 0xE8C7B756
a = b + ((a + F(b, c, d) + M[i] + T[j]) <<<s)
F(B,C,D) = (B AND C) OR ((NOT B) AND D)
s = 7 (first value from Round 1 table)
Substitute:
A = 0x67452301B = 0xEFCDAB89C = 0x98BADCFED = 0x10325476M[0] = 0x6C6C6568T[1] = 0xD76AA478
Step-by-step (mod 2³²):
Compute F(B,C,D) → 0x98BADCFE
Add: A + F + M[0] + T[1] = 0x352C6DDF
Rotate left 7 bits → 0x9636EEF9
Add B → 0x86449A82
Update registers (rotate):
A → DD → CC → BB → new A
Round 2 :
M[1] = 0x0000806FT[2] = 0xE8C7B756s = 12 (next value from Round 1 table)Compute:
F(B,C,D) = 0x98CADBFE (approximate for explanation)
Add: A + F + M[1] + T[2] = 0x82CC57C9
Rotate left 12 bits → 0xC57C982C
Add B → 0x4BC132AE
Registers after 2 rounds:
A = 0x4BC132AEB = 0x86449A82C = 0xEFCDAB89D = 0x98BADCFE
The 64 Constants (T[i])
The MD5 algorithm uses 64 pre-calculated 32-bit constants,** T[1]** through T[64], which are derived from the sine function. Specifically, is the integer part of 2^{32} times sin(i), where** { i } **is in radians. These constants are added in each step to break up any potential symmetries in the data, further enhancing the randomness of the output.
This level of detail the non-linear functions, the constants, and the rotations is precisely where the algorithm was eventually broken, as researchers found “differential paths” through these rounds that allowed them to predict and control the output.
Step 6: Combine the Results
After all 64 rounds are complete for the current 512-bit block, the final values of the four registers (A, B, C, D) are added to their initial values from Step 4. This result then becomes the new initial state for the next 512-bit block. Once the last block is processed, the final concatenated values of A, B, C, and D form the 128-bit message digest.
A=A+A initial
B=B+B initial
C=C+C initial
D=D+D initial
This ensures that the hash depends on both the current block and all previous blocks.
Example: Single Block (“hello”)
Initial IVs:
A_init = 0x67452301B_init = 0xEFCDAB89C_init = 0x98BADCFED_init = 0x10325476
After 64 rounds (example values):
A = 0xDEADBEEFB = 0xFEEDFACEC = 0xCAFEBABED = 0x8BADF00D
Add initial IVs:
A_final = 0xDEADBEEF + 0x67452301 = 0x451000F0 (mod 2^32)B_final = 0xFEEDFACE + 0xEFCDAB89 = 0xEE9B9807 (mod 2^32)C_final = 0xCAFEBABE + 0x98BADCFE = 0x6369977C (mod 2^32)D_final = 0x8BADF00D + 0x10325476 = 0x9BE23083 (mod 2^32)
MD5 hash = concatenation of
A_final || B_final || C_final || D_final.
Multi-Block Messages
If the message is larger than 512 bits, it is split into multiple 512-bit blocks:
First block: Process 64 rounds → add IVs → update
A, B, C, D.Second block: Use the updated A, B, C, D from the previous block as the new initial state.
Repeat for all remaining blocks.
Effect:
Every block depends on all previous blocks.
Changing even one bit in the first block changes the final hash completely.
Final Step After Last Block
After the last 512-bit block is processed:
Take the final A, B, C, D registers.
Concatenate them (little-endian) → 128-bit MD5 hash.
Represent as hexadecimal → standard MD5 digest.
Chapter 4: The Final Lesson Security is a Moving Target
The story of MD5 is a vital lesson:** security is a moving target**. We must constantly audit our tools and replace them when they are no longer fit for the job. MD5 was a pioneer, but its time in the cryptographic spotlight is over.
The Modern Fix: What You Should Be Using
If you are building a system today, you need to move past MD5. The replacements are faster, stronger, and designed to withstand the collision attacks that killed MD5.
For Cryptographic Security (Passwords, Signatures, Authentication)
*You need slow and strong hashes. *Recommended Cryptographic Hashes SHA-2 (SHA-256, SHA-512): Strong collision resistance, industry standard for modern digital signatures and TLS certificates. SHA-3 (Keccak): A completely new design, offering a strong alternative to the SHA-2 family. Argon2 (Password Hashing):Intentionally slow and memory-hard. Designed specifically to resist brute-force attacks on passwords, making it the current gold standard. BLAKE2/BLAKE3: Very fast, yet cryptographically secure. Often outperforms SHA-3 while maintaining strong security.
For Fast Data Integrity (Non-Cryptographic, Speed-Critical)
You need fast and reliable hashes. Recommended Fast Integrity Hashes **xxHash:**Blazing fast, often significantly quicker than MD5, making it the superior choice for high-speed integrity checks where cryptographic strength is not required [3].
The story of MD5 is a vital lesson: security is a moving target. We must constantly audit our tools and replace them when they are no longer fit for the job. MD5 was a pioneer, but its time in the cryptographic spotlight is over.
If you have any questions or require further clarification, don’t hesitate to reach out. Additionally, you can stay connected for more advanced cybersecurity insights and updates:
🔹 GitHub: @0xEhab 🔹 Instagram: @pjo_ 🔹 LinkedIn: https://www.linkedin.com/in/ehxb/
Stay tuned for more comprehensive write-ups and tutorials to deepen your cybersecurity expertise. 🚀