ASCII and Character Encodings Explained — ASCII, Unicode, and UTF-8
Character encodings define how text is stored as bytes. Learn how ASCII works, why Unicode was created, what UTF-8 and UTF-16 are, and why encoding mismatches cause garbled text.
The core problem
Computers store everything as numbers. A character encoding is the agreed-upon rule that says “number 65 means the letter A”. Without a shared encoding, the same sequence of bytes looks like completely different text depending on which rule you apply.
ASCII — where it started
ASCII (American Standard Code for Information Interchange) was defined in 1963. It maps 128 characters to numbers 0–127 using 7 bits. That covers:
- Uppercase letters A–Z (65–90)
- Lowercase letters a–z (97–122)
- Digits 0–9 (48–57)
- Punctuation: space,
! " # $ % & ' ( ) * + , - . /and more - Control characters (0–31): newline, tab, carriage return, null, etc.
ASCII was designed for English. It has no accented characters, no non-Latin scripts, no emoji. The 8th bit (values 128–255) was left undefined — different systems used it for different things, which created chaos when exchanging text internationally.
The code page era and why it failed
To handle non-English text, vendors created code pages — extended versions of ASCII that used the 128 extra values (128–255) for language-specific characters. Windows-1252 used them for Western European accents. ISO-8859-5 used them for Cyrillic. Shift-JIS used an entirely different scheme for Japanese.
The problem: the same byte value meant different characters in different code pages. Byte 0xE9 is “é” in Windows-1252 but “щ” in Windows-1251 (Cyrillic). Files had to carry metadata about which code page they used — and when that metadata was lost or ignored, text became garbled. This is where mojibake (garbled text from encoding mismatch) comes from.
Unicode — one standard for every character
Unicode was created to solve this once and for all. It assigns a unique code point (a number) to every character in every writing system — Latin, Cyrillic, Arabic, Chinese, Japanese, Korean, emoji, mathematical symbols, ancient scripts, and more. The current version covers over 149,000 characters.
A Unicode code point is written as U+ followed by a hex number. For example:
Unicode defines code points — the numbers. It does not specify how those numbers are stored as bytes. That is what encodings like UTF-8 and UTF-16 do.
UTF-8 — the dominant encoding
UTF-8 encodes Unicode code points using 1 to 4 bytes per character, depending on the code point value:
| Code point range | Bytes used | Examples |
|---|---|---|
| U+0000 – U+007F | 1 byte | A, B, 0–9, punctuation (ASCII) |
| U+0080 – U+07FF | 2 bytes | é, ñ, ü, Arabic, Hebrew basics |
| U+0800 – U+FFFF | 3 bytes | Chinese, Japanese, Korean |
| U+10000 – U+10FFFF | 4 bytes | Emoji, rare scripts, math symbols |
The key insight: UTF-8 is backward-compatible with ASCII. Any valid ASCII text is also valid UTF-8. This is why UTF-8 won — existing ASCII files required no conversion, and English text stayed compact at 1 byte per character.
UTF-8 is used by over 98% of web pages today.
UTF-16 and UTF-32
UTF-16 uses 2 bytes for most characters and 4 bytes for characters above U+FFFF (like most emoji). It is used internally by Windows, Java, and JavaScript. It is not backward-compatible with ASCII — every ASCII character takes 2 bytes, with a zero byte padding. This makes it a poor choice for text storage and transmission.
UTF-32 uses exactly 4 bytes for every character, regardless of the code point. Simple to work with programmatically (character N is always at byte offset N×4) but wastes space — an English sentence takes four times more bytes than ASCII.
Why garbled text happens
Garbled text (mojibake) occurs when bytes written in one encoding are read with a different one. Common real-world examples:
- UTF-8 read as Windows-1252: The euro sign € is
E2 82 ACin UTF-8 but Windows-1252 reads those bytes as “€”. - Missing charset declaration: A web page saved as UTF-8 but served without
Content-Type: text/html; charset=utf-8— old browsers may guess wrong. - Database charset mismatch: A column declared as
latin1storing UTF-8 bytes — the data survives but is interpreted incorrectly on read.
<meta charset="utf-8"> tag, and your file saves. Consistency eliminates the problem.HTML entities
Before UTF-8 was universal, HTML used character entities to safely represent characters that might not survive encoding: & for &, < for <, é for é.
In modern UTF-8 HTML you can type most characters directly — entities are only strictly required for characters that have meaning in HTML syntax (<, >, &, ").
Related guides
Browse the full ASCII table
ASCII Table →Look up HTML entities
HTML Entities →