Understanding Unicode
Unicode is the universal character encoding standard that assigns a unique "code point" to every character in every language. From English letters to Chinese characters, emojis to ancient scripts—Unicode covers them all with over 149,000 characters.
Unicode Format Examples
| Character | Unicode | Description |
|---|---|---|
| A | U+0041 | Latin Capital Letter A |
| 中 | U+4E2D | CJK Unified Ideograph (middle) |
| 😀 | U+1F600 | Grinning Face Emoji |
| ♠ | U+2660 | Black Spade Suit |
🌍 Unicode Fun Fact
Unicode includes ancient scripts like Egyptian Hieroglyphics and even fictional languages like Klingon (rejected) and Elvish (under consideration). It truly aims to encode ALL human writing!
Unicode Notation Formats
- U+XXXX: Standard Unicode notation (U+0041)
- \uXXXX: JavaScript/JSON escape (\u0041)
- &#xXXXX;: HTML hex entity (A)
- &#NNNN;: HTML decimal entity (A)
Why Unicode Matters
Internationalization
Before Unicode, different regions used incompatible encodings. Japanese Shift-JIS couldn't coexist with Russian Windows-1251. Unicode unified everything.
Emoji Support
Emojis are Unicode characters! When you send 👍, you're actually sending U+1F44D. Unicode Consortium regularly adds new emoji.
Frequently Asked Questions
What's the difference between Unicode and UTF-8?
Unicode is the standard defining what characters exist. UTF-8, UTF-16, and UTF-32 are encodings that describe how to store Unicode as bytes.
Why do some characters need more bytes?
UTF-8 uses 1-4 bytes per character. Basic Latin uses 1 byte, most languages use 2-3, and emoji use 4. This efficiency is why UTF-8 dominates the web.