Question 1

What is a Unicode code point?

Accepted Answer

A Unicode code point is the unique number assigned to each character in the Unicode standard, written as U+ followed by a hexadecimal value, like U+0041 for 'A' or U+1F600 for the grinning face emoji. Code points range from U+0000 to U+10FFFF, covering over 1.1 million possible positions. A code point identifies a character abstractly; it says nothing about how that character is stored in memory, which is the job of an encoding like UTF-8 or UTF-16.

Question 2

What is the difference between a code point and a code unit?

Accepted Answer

A code point is the abstract number of a character (U+1F600). A code unit is the fixed-size building block an encoding uses to store that character: 8 bits in UTF-8, 16 bits in UTF-16. A single code point can require multiple code units. The emoji U+1F600 is one code point but is stored as four UTF-8 code units (bytes) or two UTF-16 code units (a surrogate pair). This is why JavaScript's string .length, which counts UTF-16 code units, reports 2 for a single emoji.

Question 3

Why does an emoji count as two characters in JavaScript?

Accepted Answer

JavaScript strings are sequences of UTF-16 code units, and .length counts those units. Characters above U+FFFF (the Basic Multilingual Plane), including most emoji, are stored as two 16-bit units called a surrogate pair, so .length returns 2 for them. To count actual code points, spread the string ([...str].length) or iterate with for...of, both of which step by code point. To count what a human sees as one character, use Intl.Segmenter, since emoji ZWJ sequences combine several code points into one grapheme.

Question 4

What is a grapheme cluster?

Accepted Answer

A grapheme cluster is what a reader perceives as a single character, which may be made of several code points. The family emoji is one grapheme built from several emoji joined by Zero Width Joiners (U+200D). An accented letter typed as a base letter plus a combining mark is two code points but one grapheme. This inspector reports a grapheme count using Intl.Segmenter when your browser supports it, alongside the raw code point and UTF-16 counts.

Question 5

How many bytes does a character take in UTF-8?

Accepted Answer

In UTF-8, ASCII characters (U+0000 to U+007F) use 1 byte. Characters up to U+07FF (accented Latin, Greek, Cyrillic, Hebrew, Arabic) use 2 bytes. Characters up to U+FFFF (most CJK ideographs and symbols) use 3 bytes. Characters from U+10000 upward, including emoji and rare scripts, use 4 bytes. This tool shows the exact UTF-8 byte sequence in hexadecimal for every code point you enter.

Question 6

What is an HTML entity?

Accepted Answer

An HTML entity is a way to write a character in HTML using a code instead of the literal glyph. Numeric entities use the code point, like &#128512; (decimal) or &#x1F600; (hexadecimal) for the grinning emoji. Named entities use a friendly name, like & for & or &copy; for the copyright sign. Entities are essential for the reserved characters &, <, >, and quotes, and for inserting symbols that are hard to type. This inspector shows the numeric entity for every character and the named entity for common ones.

Unicode Inspector

What Is Unicode?

Code Points vs. Code Units

UTF-8 vs. UTF-16

What This Tool Shows

Why Emoji Count as Multiple Units

Frequently Asked Questions