What Is Unicode?
Unicode is the universal standard that assigns a unique number, called a code point, to every character used in human writing systems — Latin letters, Chinese ideographs, Arabic script, mathematical symbols, and emoji alike. Before Unicode, dozens of incompatible encodings made it nearly impossible to mix scripts in one document. Unicode solved this by giving every character one canonical identity. Code points are written as U+ followed by a hexadecimal value, such as U+0041 for the capital letter A or U+1F600 for the grinning face emoji. The standard reserves the range U+0000 through U+10FFFF, more than 1.1 million positions, of which around 150,000 are currently assigned.
Code Points vs. Code Units
The single most important distinction in Unicode is between a code point and a code unit. A code point is the abstract number of a character. A code unit is the fixed-size chunk that a particular encoding uses to store that character in memory. They are not the same thing, and confusing them is the source of countless bugs. In UTF-8 a code unit is one byte (8 bits); in UTF-16 a code unit is two bytes (16 bits). A single code point may need several code units. The emoji U+1F600 is one code point, but it occupies four UTF-8 code units or two UTF-16 code units. This is exactly why JavaScript reports a string length of 2 for one emoji: the language counts UTF-16 code units, not code points.
UTF-8 vs. UTF-16
UTF-8 is a variable-width encoding that uses one to four bytes per code point. ASCII characters (U+0000–U+007F) take a single byte, which makes UTF-8 extremely compact for English and source code, and backwards compatible with ASCII. Accented Latin, Greek, and Cyrillic take two bytes; most CJK ideographs take three; emoji and rare scripts take four. UTF-8 dominates the web, used by over 98% of pages.
UTF-16 uses two or four bytes per code point. Characters in the Basic Multilingual Plane (U+0000–U+FFFF) fit in a single 16-bit unit, while characters above U+FFFF are stored as a surrogate pair: two special code units in the ranges U+D800–U+DBFF and U+DC00–U+DFFF that together encode one code point. UTF-16 is the in-memory format for JavaScript strings, Java, and Windows APIs, which is why so many string-length surprises trace back to it.
What This Tool Shows
Paste any text and this inspector iterates it by code point — using for...of so that surrogate pairs and emoji are treated as one character rather than two. For each code point it shows the rendered glyph, the U+XXXX notation, the decimal value, a best-effort category and name, the UTF-8 byte sequence in hex, the UTF-16 code units in hex, and the HTML entity (numeric, plus the named entity when a common one exists). The summary at the top reports four counts that frequently disagree: the code point count, the UTF-16 length (JavaScript's .length), the grapheme count from Intl.Segmenter when available, and the total UTF-8 byte size. Click any code point or entity value to copy it to your clipboard.
Why Emoji Count as Multiple Units
Emoji routinely break naive character counting for two separate reasons. First, most emoji sit above U+FFFF, so each one is a surrogate pair — two UTF-16 code units — even though it is a single code point. Second, many emoji you see as one symbol are actually grapheme clusters built from several code points joined by Zero Width Joiners (U+200D). The family emoji can be four people plus three joiners, seven code points and far more bytes, all rendered as one glyph. Skin-tone modifiers, flag sequences (pairs of regional indicator letters), and combining accent marks behave the same way. That is why the three counts in the summary diverge: graphemes count what a human sees, code points count Unicode characters, and UTF-16 length counts storage units.
Frequently Asked Questions
What is a Unicode code point?
A Unicode code point is the unique number assigned to each character, written as U+ followed by a hexadecimal value (U+0041 for 'A', U+1F600 for the grinning emoji). Code points range from U+0000 to U+10FFFF. A code point identifies a character abstractly and says nothing about how it is stored — that is the job of an encoding like UTF-8 or UTF-16.
What is the difference between a code point and a code unit?
A code point is the abstract number of a character. A code unit is the fixed-size building block an encoding uses to store it: 8 bits in UTF-8, 16 bits in UTF-16. One code point can take several code units. The emoji U+1F600 is one code point but four UTF-8 bytes or two UTF-16 units (a surrogate pair), which is why JavaScript's .length reports 2 for one emoji.
Why does an emoji count as two characters in JavaScript?
JavaScript strings are UTF-16, and .length counts code units. Characters above U+FFFF are stored as two-unit surrogate pairs, so .length returns 2 for most emoji. To count code points, use [...str].length or for...of. To count what a human sees as one character, use Intl.Segmenter, since ZWJ sequences combine several code points into one grapheme.
What is a grapheme cluster?
A grapheme cluster is what a reader perceives as one character, possibly made of several code points. A family emoji is one grapheme built from several emoji joined by Zero Width Joiners. A base letter plus a combining accent is two code points but one grapheme. This tool reports a grapheme count via Intl.Segmenter when your browser supports it.
How many bytes does a character take in UTF-8?
ASCII (U+0000–U+007F) uses 1 byte. Up to U+07FF (accented Latin, Greek, Cyrillic, Arabic) uses 2 bytes. Up to U+FFFF (most CJK and symbols) uses 3 bytes. From U+10000 up, including emoji, uses 4 bytes. This tool shows the exact UTF-8 byte sequence in hex for every code point.
What is an HTML entity?
An HTML entity writes a character using a code rather than the literal glyph. Numeric entities use the code point: 😀 (decimal) or 😀 (hex). Named entities use a friendly name like & for & or © for the copyright sign. They are essential for the reserved characters &, <, >, and quotes. This inspector shows the numeric entity for every character and a named entity for common ones.