3 min read1 day ago
–
For years, I wrote code, built APIs, parsed JSON, printed logs, stored strings in databases…
…and I had absolutely no idea how characters actually work.
Until one day, I decided to learn it. So I sat down searching on Internet, and finally understood how text works in computers.
This article is that whole journey.
Let’s start
Everyone knows computer understand only 0s and 1s. If you want to store a number like 7, it simply becomes:
7 → 111 (binary)
Easy.
But what about characters?
Note: The number 7 (as a value) is different from the character ‘7’ (as text).
So I thought…
“What if we just assign every character a number and store that number in binary?”
That’s it.
This is what ASCII do
**ASCII (*…
3 min read1 day ago
–
For years, I wrote code, built APIs, parsed JSON, printed logs, stored strings in databases…
…and I had absolutely no idea how characters actually work.
Until one day, I decided to learn it. So I sat down searching on Internet, and finally understood how text works in computers.
This article is that whole journey.
Let’s start
Everyone knows computer understand only 0s and 1s. If you want to store a number like 7, it simply becomes:
7 → 111 (binary)
Easy.
But what about characters?
Note: The number 7 (as a value) is different from the character ‘7’ (as text).
So I thought…
“What if we just assign every character a number and store that number in binary?”
That’s it.
This is what ASCII do
**ASCII (**American Standard Code for Information Interchange)
ASCII defines both a character set (mapping characters to values) and an encoding format.
It maps 128 characters to integer values (code points) ranging from 0 to 127. It includes the uppercase letters (A-Z), lowercase letters (a-z), digits (0–9), punctuations and control codes.
It uses uses a fixed-width encoding that occupies 1 byte per character. The maximum value fits within 7 bits, but it is standardly stored in an 8-bit byte.
Press enter or click to view image in full size
But what about other languages, emojis etc.
ASCII couldn’t help.
We needed something bigger.
And That’s where Unicode comes to picture.
Unicode provides a vast, universal map of nearly all written languages, assigning each character a unique identifier called a code point.
Examples:
- A → U+0041
- क → U+0915
- 你 → U+4F60
- 💙 → U+1F499
Unicode itself provides the mapping but it relies on specific Unicode Transformation Formats (UTFs) to define how those maps are encoded into bytes.
UTF-8
It’s a variable-width encoding that is backward compatible with ASCII (ASCII characters take 1 byte) but uses up to** 4 bytes** for other characters.
UTF-8 is perfect for Networks. Binary parsing is fast.
1-byte ASCII = small 1–4 bytes dynamic = compact
That’s why this is dominant encoding on the internet.
It is streaming-friendly!
If one byte gets messed up, you lose only that character, not the whole string.
This is why:
- JSON is UTF-8
- HTML is UTF-8
- Linux terminals default to UTF-8
- APIs expect UTF-8
- Databases encourage UTF-8
UTF-8 can stores virtually all major languages and scripts character used globally today, covering a massive range of character sets:
- Global Languages: It includes all European languages (English, French, Spanish, German, Greek, Russian, etc.), major Asian languages (Chinese, Japanese, Korean, Hindi, Thai, Vietnamese, etc.), and Middle Eastern languages (Arabic, Hebrew, Persian, etc.).
- Specialised Content: It also handles mathematical symbols, musical notation, comprehensive emojis, and historical scripts.
UTF-16
It’s a variable-width encoding using 2 bytes or 4 bytes per character.
Back when Unicode was new, developers thought:
“We have 65,536 possible characters. Let’s store each one in 2 bytes. Done.”
Simple idea, right?
A character = 1 unit = 2 bytes No complexity No variable length
Except… Unicode didn’t stop at *65k ***characters. Languages kept adding scripts. Emoji came. Historic texts came. Symbols came.
Suddenly 65,536 wasn’t enough.
UTF-16 Had to Extend… and Everything Changed
When Unicode ran out of space, engineers introduced:
**Surrogate Pairs, **A fancy way of saying:
“Some characters need 4 bytes now.”
So UTF-16 ended up like this:
- Most characters → 2 bytes
- Rare ones / emojis → 4 bytes
Just like UTF-8 has 1–4 bytes… UTF-16 also has 2 or 4 bytes.
UTF-16 ended up doing this:
- English letter → 2 bytes (wasteful)
- Hindi → 2 bytes (efficient)
- Chinese → 2 bytes (efficient)
- Emoji → 4 bytes (painful)
UTF-8 did this:
- English → 1 byte
- Hindi → 3 bytes
- Chinese → 3 bytes
- Emoji → 4 bytes
And because the web was English-first, UTF-8 won long before** UTF-16.**
UTF-16 excels in Asian-language-heavy systems because loads of Chinese/Japanese text fits beautifully in **2 bytes **while UTF-8 uses 3 bytes.
Java & JavaScript still use UTF-16.
This article is long enough so ended here.
My Final Thought
Unicode is one of those small things that touches your work every single day.
And once you understand it, you’ll be confident and write safer code.