When counting characters in JavaScript, the surrogate pair issue (where emoji and certain CJK characters count as 2) is fairly well-known. You’ve probably seen advice like "just use [...str].length" to solve it.
However, when I implemented a character counter for a textarea on the frontend, I discovered there are other causes of mismatches beyond surrogate pairs. It turned out to be deeper than I expected, so I’d like to share what I learned. I might have some things wrong, so please take this as a reference rather than absolute truth.
What Happened
I was implementing a character counter for a <textarea>. I was already using [...str].length as a surrogate pair countermeasure, but Laravel’s max validation was still rejecting some inputs.
After investigating, I foun…
When counting characters in JavaScript, the surrogate pair issue (where emoji and certain CJK characters count as 2) is fairly well-known. You’ve probably seen advice like "just use [...str].length" to solve it.
However, when I implemented a character counter for a textarea on the frontend, I discovered there are other causes of mismatches beyond surrogate pairs. It turned out to be deeper than I expected, so I’d like to share what I learned. I might have some things wrong, so please take this as a reference rather than absolute truth.
What Happened
I was implementing a character counter for a <textarea>. I was already using [...str].length as a surrogate pair countermeasure, but Laravel’s max validation was still rejecting some inputs.
After investigating, I found the culprit: newline normalization.
Causes Beyond Surrogate Pairs
| Cause | JS Side | PHP/MySQL Side | Discrepancy |
|---|---|---|---|
| Emoji (surrogate pairs) | "🎉".length → 2 | mb_strlen("🎉") → 1 | JS counts more |
| Newline normalization | "a\nb".length → 3 | After HTTP: "a\r\nb" → 4 | PHP counts more |
| Lone surrogates | Counted | MySQL error or warning | Possible save failure |
| NFD form (older macOS) | "é".length → 2 | mb_strlen("é") → 2 | Matches, but unintentionally 2 chars |
Surrogate pairs can be handled with [...str].length, but the newline issue isn’t solved by that alone. The more newlines in the text, the bigger the discrepancy—for example, 10 newlines means a +10 character difference.
About Newline Normalization
This is where I got stuck the most.
Browser → Server
When you submit the contents of a <textarea> via HTTP, newlines are normalized from \n to \r\n.
According to MDN, for wrap="soft" (the default):
the browser ensures that all line breaks in the entered value are a CR+LF pair
The WHATWG Blog also explains this in detail—newlines are normalized to CRLF during form submission per the spec.
// In browser
textarea.value // "a\nb\nc" → 5 characters
// Received by PHP
$_POST['field'] // "a\r\nb\r\nc" → 7 characters
Laravel’s max rule uses mb_strlen(), so it checks against the character count after normalization. This was why text with many newlines kept getting rejected.
The Round-Trip Problem During Editing
This issue doesn’t just occur with new input—it also happens on edit screens.
- Initial input: Browser (
\n) → HTTP submission converts to\r\n→ Saved to DB (\r\n) - Edit screen display: DB (
\r\n) → Displayed in textarea →textarea.valueis normalized to\n - JS count: Counts with
\n→ "5 characters" - Re-submission: Normalized to
\r\nagain → PHP/MySQL → "7 characters"
The textarea’s .value property is called the "API value," and it’s normalized to \n (LF) when retrieved. So even if \r\n is stored in the database, when you get it via JavaScript, it becomes \n.
This round-trip causes a constant mismatch between JS and server-side counting, which is why you need to normalize to \r\n on the JS side before counting.
Database Storage
Major databases like MySQL, PostgreSQL, and SQL Server store \r\n (CRLF) as-is as 2 bytes of data.
- Text created in Windows environments is stored as
\r\n - Input from Linux/Mac environments (or modern web browsers) uses
\n(LF)
In other words, databases don’t automatically convert newline codes, so the data varies depending on the input source.
Database-Specific Notes
MySQL / MariaDB
Data is basically stored as-is. However, when importing CSV files via LOAD DATA INFILE from the command line, if the default newline code setting doesn’t match, \r can remain as garbage at the end of data.
SQL Server
Due to its Windows affinity, \r\n is treated as standard. When copying and pasting results in SSMS (management tool), newline codes might get changed automatically, so be careful.
Other Causes
Lone Surrogates
Corrupted data can sometimes get mixed in through copy-paste. MySQL treats these as invalid UTF-8 and may throw an error or truncate with a warning.
The behavior varies depending on sql_mode settings—STRICT mode causes an error and stops, while non-STRICT mode issues a warning and performs truncation or replacement.
NFD Form
This is a decomposed form that was used in filenames on older macOS (HFS+). Characters like "é" or "ñ" can have different internal representations while looking the same.
// NFC (composed) vs NFD (decomposed)
"é".length // → 1 (NFC: single code point U+00E9)
"é".normalize("NFD").length // → 2 (NFD: "e" + combining acute accent)
"ñ".length // → 1 (NFC: single code point U+00F1)
"ñ".normalize("NFD").length // → 2 (NFD: "n" + combining tilde)
This also affects other languages with combining marks—Vietnamese, Korean Hangul, Japanese kana, and many others.
According to Apple’s official documentation, APFS (since macOS High Sierra in 2017) stores the normalization form as-is, unlike HFS+. However, normalization may still occur at the macOS layer, so NFD issues haven’t completely disappeared.
In practice, NFD coming through textarea input or normal web input is quite rare—it mostly becomes an issue when dealing with filenames or copy-paste.
MySQL doesn’t normalize by default, so it stores data in whatever form it receives. Depending on the collation, MySQL can search and compare NFD and NFC as "the same thing," but the stored byte count and CHAR_LENGTH results differ, so you still need to be careful from a character counting perspective.
Comparing JS Character Counting Methods
Using [...str].length for surrogate pair handling is common advice, but let me organize the differences between various methods.
Basic Comparison
| Method | Unit | "🎉" | "👨👩👧" | "\r\n" |
|---|---|---|---|---|
str.length | UTF-16 code units | 2 | 8 | 2 |
[...str].length | Code points | 1 | 5 | 2 |
Intl.Segmenter | Grapheme clusters | 1 | 1 | 1 |
TextEncoder | Byte count | 4 | 18 | 2 |
MySQL CHAR_LENGTH() | Code points | 1 | 5 | 2 |
MySQL’s CHAR_LENGTH() counts \r\n as 2 characters, so using Intl.Segmenter (which counts it as 1) will cause a mismatch.
Intl.Segmenter vs Spread Syntax
Here’s a more detailed comparison:
| Character Type | Intl.Segmenter | Spread [...] |
|---|---|---|
Windows newline (\r\n) | 1 character | Split into 2 |
| CJK variant selectors | 1 character | Split into 2 |
| Private Use Area (U+E000, etc.) | 1 character | 1 character |
| ZWJ emoji sequences (👩👩👧) | 1 character | Split apart |
The Intl.Segmenter Trap
Intl.Segmenter returns "the number of characters as humans visually perceive them."
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
[...segmenter.segment("👨👩👧")].length // → 1 (as it looks)
[...segmenter.segment("\r\n")].length // → 1 (CRLF as 1 character)
This seems good at first, but it can’t be used for MySQL VARCHAR limits. 👨👩👧 is 1 character in Intl.Segmenter, but MySQL’s CHAR_LENGTH() counts it as 5, so even if you’re within limits according to Segmenter, you might overflow in the database.
Interestingly, X (formerly Twitter) uses its own method. According to Emojipedia, in 2018 they changed to count all emoji uniformly as 2 characters. This differs from both Intl.Segmenter and [...str].length.
Private Use Area (PUA)
For the area commonly used for "custom characters" in Windows (Private Use Area: E000-F8FF), both Intl.Segmenter and spread syntax can handle them as 1 character without issues.
However, they only recognize it as "one code point"—Intl.Segmenter doesn’t know what the character actually looks like.
Lone Surrogate Handling
The handling of lone surrogates also differs by method:
| Method | Lone Surrogates | Combining Characters |
|---|---|---|
[...str] | Kept | Split |
str.match(/./gu) | Excluded | Split |
Intl.Segmenter | Kept | Kept as 1 character |
TextEncoder
Returns the UTF-8 byte count. Useful for checking MySQL’s LENGTH() or index byte limits (like 3072 bytes).
new TextEncoder().encode("abc").length // → 3 (1 byte × 3 characters)
new TextEncoder().encode("café").length // → 5 (c, a, f each 1 byte + é is 2 bytes in UTF-8)
new TextEncoder().encode("你好").length // → 6 (CJK characters are 3 bytes each in UTF-8)
By Use Case
| Use Case | Method |
|---|---|
| MySQL VARCHAR(N) limit | [...str].length (excluding lone surrogates, etc.) |
| MySQL index byte limit | TextEncoder().encode(str).length |
| Visual character count | Intl.Segmenter |
Common Mistakes
Being Satisfied with Just Surrogate Pair Handling
// Surrogate pairs are handled, but newlines cause mismatch
const count = [...str].length;
Using str.length Directly
// Emoji counted as 2 characters
"🎉".length // → 2
Looping by Index
// Surrogate pairs get split
for (let i = 0; i < str.length; i++) {
const char = str[i]; // 🎉 gets broken
}
// for...of works by code point
for (const char of str) {
console.log(char); // 🎉 stays intact
}
Using Intl.Segmenter for VARCHAR Limits
As mentioned above, it doesn’t match MySQL’s counting.
Using strlen() in PHP
// Returns byte count
strlen("café") // → 5 (é is 2 bytes in UTF-8)
strlen("你好") // → 6 (CJK characters are 3 bytes each)
// Use this instead
mb_strlen("café", 'UTF-8') // → 4
mb_strlen("你好", 'UTF-8') // → 2
Example Solution
Here’s a function I use to sync with MySQL’s VARCHAR(N) limit. There might be better approaches, though.
export function countForMySQL(str: string, charLimit: number) {
// 1. Exclude lone surrogates from count (MySQL treats them as invalid UTF-8)
const validChars = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[^\uD800-\uDFFF]/gu) ?? [];
// 2. Normalize newlines to CRLF for counting (to match post-HTTP-submission state)
const normalized = validChars.join('').replace(/\r?\n/g, "\r\n");
// 3. Count code points (= MySQL CHAR_LENGTH())
const charLength = [...normalized].length;
// 4. Byte count (for index limits)
const byteLength = new TextEncoder().encode(normalized).length;
return {
isValid: charLength <= charLimit,
charLength,
byteLength,
};
}
The key point is that this returns only "how MySQL would count it" without modifying the original data. Lone surrogates are excluded from the count because MySQL treats them as invalid UTF-8, but the input data itself is not modified.
Whether to reject input containing lone surrogates via validation can be decided by the calling code.
Reference Table
| Concept | MySQL | PHP | JavaScript |
|---|---|---|---|
| Character count (code points) | CHAR_LENGTH(col) | mb_strlen($str, 'UTF-8') | [...str].length |
| Byte count | LENGTH(col) | strlen($str) | new TextEncoder().encode(str).length |
| Visual character count | - | grapheme_strlen($str) | Intl.Segmenter |
Conclusion
Using [...str].length for surrogate pair handling is well-known, but there are cases where it’s not enough.
Intl.Segmenter is correct for visual character count, but it’s a bit tricky that it can’t be used for database limit checks.
NFD normalization might be needed in some cases, but it seems rare enough that it’s probably okay to ignore for now.