ASCII and Character Encodings Explained — ASCII, Unicode, and UTF-8 | DataToolkit

The core problem

Computers store everything as numbers. A character encoding is the agreed-upon rule that says “number 65 means the letter A”. Without a shared encoding, the same sequence of bytes looks like completely different text depending on which rule you apply.

ASCII — where it started

ASCII (American Standard Code for Information Interchange) was defined in 1963. It maps 128 characters to numbers 0–127 using 7 bits. That covers:

Uppercase letters A–Z (65–90)
Lowercase letters a–z (97–122)
Digits 0–9 (48–57)
Punctuation: space, ! " # $ % & ' ( ) * + , - . / and more
Control characters (0–31): newline, tab, carriage return, null, etc.

ASCII was designed for English. It has no accented characters, no non-Latin scripts, no emoji. The 8th bit (values 128–255) was left undefined — different systems used it for different things, which created chaos when exchanging text internationally.

View the full ASCII table →

The code page era and why it failed

To handle non-English text, vendors created code pages — extended versions of ASCII that used the 128 extra values (128–255) for language-specific characters. Windows-1252 used them for Western European accents. ISO-8859-5 used them for Cyrillic. Shift-JIS used an entirely different scheme for Japanese.

The problem: the same byte value meant different characters in different code pages. Byte 0xE9 is “é” in Windows-1252 but “щ” in Windows-1251 (Cyrillic). Files had to carry metadata about which code page they used — and when that metadata was lost or ignored, text became garbled. This is where mojibake (garbled text from encoding mismatch) comes from.

Unicode — one standard for every character

Unicode was created to solve this once and for all. It assigns a unique code point (a number) to every character in every writing system — Latin, Cyrillic, Arabic, Chinese, Japanese, Korean, emoji, mathematical symbols, ancient scripts, and more. The current version covers over 149,000 characters.

A Unicode code point is written as U+ followed by a hex number. For example:

U+0041ALatin capital letter A

U+00E9éLatin small letter e with acute

U+4E2D中CJK unified ideograph (middle/China)

U+1F600😀Grinning face emoji

Unicode defines code points — the numbers. It does not specify how those numbers are stored as bytes. That is what encodings like UTF-8 and UTF-16 do.

UTF-8 — the dominant encoding

UTF-8 encodes Unicode code points using 1 to 4 bytes per character, depending on the code point value:

Code point range	Bytes used	Examples
U+0000 – U+007F	1 byte	A, B, 0–9, punctuation (ASCII)
U+0080 – U+07FF	2 bytes	é, ñ, ü, Arabic, Hebrew basics
U+0800 – U+FFFF	3 bytes	Chinese, Japanese, Korean
U+10000 – U+10FFFF	4 bytes	Emoji, rare scripts, math symbols

The key insight: UTF-8 is backward-compatible with ASCII. Any valid ASCII text is also valid UTF-8. This is why UTF-8 won — existing ASCII files required no conversion, and English text stayed compact at 1 byte per character.

UTF-8 is used by over 98% of web pages today.

UTF-16 and UTF-32

UTF-16 uses 2 bytes for most characters and 4 bytes for characters above U+FFFF (like most emoji). It is used internally by Windows, Java, and JavaScript. It is not backward-compatible with ASCII — every ASCII character takes 2 bytes, with a zero byte padding. This makes it a poor choice for text storage and transmission.

UTF-32 uses exactly 4 bytes for every character, regardless of the code point. Simple to work with programmatically (character N is always at byte offset N×4) but wastes space — an English sentence takes four times more bytes than ASCII.

Why garbled text happens

Garbled text (mojibake) occurs when bytes written in one encoding are read with a different one. Common real-world examples:

UTF-8 read as Windows-1252: The euro sign € is E2 82 AC in UTF-8 but Windows-1252 reads those bytes as “â‚¬”.
Missing charset declaration: A web page saved as UTF-8 but served without Content-Type: text/html; charset=utf-8 — old browsers may guess wrong.
Database charset mismatch: A column declared as latin1 storing UTF-8 bytes — the data survives but is interpreted incorrectly on read.

Fix: Use UTF-8 everywhere — in your editor, your database, your HTTP headers, your HTML <meta charset="utf-8"> tag, and your file saves. Consistency eliminates the problem.

HTML entities

Before UTF-8 was universal, HTML used character entities to safely represent characters that might not survive encoding: & for &, < for <, é for é.

In modern UTF-8 HTML you can type most characters directly — entities are only strictly required for characters that have meaning in HTML syntax (<, >, &, ").

Browse all HTML entities →

Related guides

Browse the full ASCII table

ASCII Table →

Look up HTML entities

HTML Entities →