Challenges, takeaways, and findings of coding up a famous text compression algorithm.
2018-01-15
programming
I hadn’t put much thought into how text compression works , until I watched Tom Scott’s video on YouTube. He always does a great job explaining all sorts of computer-related topics.
Being inspired by the video, I made my own working version of a program that does text compression, using an approach called Huffman Coding.
This algorithm has its roots all the way back to 1951, which is pretty amazing. David A. Huffman was a graduate student at MIT. One of Huffman’s professors allowed his students to choose between solving a difficult problem and taking the final exam. Huffman opted to solve the difficult problem, which was to find the most efficient way to represent pieces of data. He was able to come up with a solution that proved to be more optimal than anyone else had found at the time¹. He later published a paper, and Huffman Coding was born.
To help illustrate the value of Huffman Coding, take this question I ran across on Stack Overflow the other day. Someone asked what the real-world applications of Huffman Coding were. They were under the impression that there were more modern ways of doing lossless text compression. Someone replied that their question was like asking, “Give me an example of a car made out of steel.”
From what I understand, Huffman Coding is still widely used in text compression. It’s used in conjunction with other techniques to be even more efficient, but still stands as a valuable algorithm.
Fair warning: My implementation, described below, is very basic compared to popular text compression tools.
I will describe how the algorithm works from a high-level perspective, and also share a few things I learned while building my program. My code will be in Elixir, but the concepts can be applied to any programming language.
How digital text is represented
An important “gotcha” is to remember that a computer cannot actually store letters, numbers, pictures, nor anything else. Computers store only bits. We think of bits as “0” and “1”, but a bit is actually just some electricity that is either on or off. Some sort of context needs to exist in order to make any sense of the bits. ASCII is an agreed upon standard for representing text electronically.
When dealing with ASCII, a single character always takes up one byte, or eight bits, on a computer. Creating a plain text file, with just one character of content, would result in the file being one byte in size.