Understanding Text Encoding: The Missing Manual
7 min read

Understanding Text Encoding: The Missing Manual

Why special characters break things and how to fix them—a practical guide to character encoding for developers.

encodingunicodeutf-8tutorial
Share:

Understanding Text Encoding: The Missing Manual

Why "special characters" break things and how to fix them—a practical guide to character encoding for developers.


The Problem You've Definitely Encountered

You've seen it before: garbled text like é instead of é, or mysterious characters appearing in your database. Users complain that their names display wrong. Your CSV import breaks on the 500th row. An API returns ??? instead of emoji.

These aren't random bugs—they're encoding mismatches. Understanding encoding eliminates an entire category of head-scratching issues.

Characters Are Not Bytes

Here's the fundamental insight: text is an abstraction. Computers store bytes, not characters. An encoding is the mapping between the two.

The letter A isn't inherently byte 65. That's ASCII's decision. In EBCDIC (used on IBM mainframes), A is byte 193. Same letter, different byte representation.

This means the same bytes can represent completely different text depending on which encoding you use to interpret them:

Bytes (hex)UTF-8Latin-1Windows-1252
C3 A9ééé
E2 82 AC€€
C2 A0(non-breaking space)ÂÂ

When you see é instead of é, someone interpreted UTF-8 bytes as Latin-1. The bytes are correct—the interpretation is wrong.

The Encodings You Need to Know

ASCII (1963)

The grandfather of encodings. Maps 128 characters (bytes 0-127) including English letters, numbers, punctuation, and control characters.

Limitations: Only covers English. The byte range 128-255 is undefined.

Latin-1 / ISO-8859-1 (1985)

Extended ASCII to 256 characters (bytes 0-255), adding Western European characters like é, ñ, ü.

Still used in: Legacy systems, some HTTP headers, older databases.

Windows-1252 (1990s)

Microsoft's extension of Latin-1, slightly different in the 128-159 range. Contains smart quotes, em dashes, and € that Latin-1 lacks.

The bug factory: When systems assume Latin-1 but receive Windows-1252, characters like ' become ’.

UTF-8 (1993)

The modern standard. Can represent every Unicode character (over 140,000) using 1-4 bytes:

  • ASCII characters (0-127): 1 byte (same as ASCII—clever backwards compatibility)
  • Latin characters: 2 bytes
  • Asian characters: 3 bytes
  • Emoji, rare scripts: 4 bytes

This is what you should use. UTF-8 is the default for HTML5, JSON, most programming languages, and modern databases.

UTF-16 (1996)

Uses 2 bytes for common characters, 4 bytes for rare ones. JavaScript strings and Windows internals use UTF-16.

Gotcha: "String length" in JavaScript is UTF-16 code units, not characters. "😀".length equals 2, not 1.

Real-World Encoding Scenarios

Web Pages

<!-- Always declare encoding in HTML -->
<meta charset="UTF-8">

If missing, browsers guess—often wrong. This is why some old sites show garbage characters.

HTTP headers also matter:

Content-Type: text/html; charset=UTF-8

If the header says Latin-1 but the content is UTF-8, the browser follows the header (wrong interpretation).

Databases

-- MySQL: Set database default
CREATE DATABASE myapp CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- MySQL: Set table default
CREATE TABLE users (
  name VARCHAR(100)
) CHARACTER SET utf8mb4;

-- PostgreSQL: Set database encoding
CREATE DATABASE myapp ENCODING 'UTF8';

Critical MySQL note: utf8 in MySQL is broken—it only supports 3-byte characters (no emoji!). Always use utf8mb4.

Files

# Python: Always specify encoding for file operations
with open('data.txt', 'r', encoding='utf-8') as f:
    content = f.read()

# Without encoding specified, Python uses system default
# Windows: cp1252, Mac/Linux: usually UTF-8
# This causes "works on my machine" bugs
// Node.js
const content = fs.readFileSync('data.txt', { encoding: 'utf-8' });

// Browser FileReader
reader.readAsText(file, 'UTF-8');

APIs and JSON

JSON is defined as UTF-8 (with UTF-16/32 as alternatives). In practice:

// Requests
fetch('/api/data', {
  headers: {
    'Content-Type': 'application/json; charset=UTF-8'
  },
  body: JSON.stringify(data)
});

// JSON itself handles Unicode
{"name": "José", "emoji": "👍"} // This is valid JSON

URLs

URLs can only contain ASCII characters. Everything else must be encoded:

Original:  https://example.com/search?q=café résumé
Encoded:   https://example.com/search?q=caf%C3%A9%20r%C3%A9sum%C3%A9

The %C3%A9 is é encoded as UTF-8 (C3 A9) then percent-encoded.

Two different URL encoding functions:

// encodeURIComponent: Encode special characters
encodeURIComponent("café") // "caf%C3%A9"

// encodeURI: Keep URL structure intact
encodeURI("https://example.com/path?q=café")
// "https://example.com/path?q=caf%C3%A9"

Try it: URL Encoder/Decoder

Email

Email standards (RFC 2822) only support ASCII. Non-ASCII requires MIME encoding:

Subject: =?UTF-8?B?5pel5pys6Kqe44Gu44Oh44O844Or?=

That's Base64-encoded UTF-8. Your email library handles this—until it doesn't.

CSV Files

CSV has no standard encoding. Excel on Windows creates Latin-1 or Windows-1252. Excel on Mac creates UTF-8. When exchanging CSV files:

# Specify encoding when reading/writing CSV
import csv

# Reading: Try UTF-8, fall back to Latin-1
try:
    with open('data.csv', encoding='utf-8') as f:
        reader = csv.reader(f)
except UnicodeDecodeError:
    with open('data.csv', encoding='latin-1') as f:
        reader = csv.reader(f)

# Writing: Use UTF-8 with BOM for Excel compatibility
with open('data.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.writer(f)

The utf-8-sig encoding adds a BOM (Byte Order Mark) that tells Excel "this is UTF-8."

Base64: When You Need Binary-Safe Text

Sometimes you need to transmit binary data through text-only channels (email attachments, JSON payloads, URLs). Base64 converts binary to a 64-character alphabet:

A-Z (26) + a-z (26) + 0-9 (10) + +/ (2) = 64 characters

Every 3 bytes become 4 characters (33% size increase).

// Encoding
btoa("Hello") // "SGVsbG8="

// Decoding
atob("SGVsbG8=") // "Hello"

// For Unicode text, encode to UTF-8 first
const encoded = btoa(unescape(encodeURIComponent("Héllo")));
const decoded = decodeURIComponent(escape(atob(encoded)));

URL-safe Base64: Replaces + with - and / with _ to avoid URL encoding issues.

Common uses:

  • Data URIs: data:image/png;base64,iVBORw0KGgo...
  • JWT tokens: Header and payload are Base64-encoded JSON
  • Basic auth headers: Authorization: Basic dXNlcjpwYXNz
  • Embedding binary in JSON (images, files)

Try it: Base64 Encoder/Decoder

Diagnosing Encoding Problems

Symptom 1: Garbled Text (Mojibake)

é instead of é = UTF-8 interpreted as Latin-1

Fix: Ensure consistent UTF-8 throughout:

-- Check database encoding
SHOW VARIABLES LIKE 'character_set%';

-- Check table encoding
SHOW CREATE TABLE users;

Symptom 2: Question Marks

??? or = Character doesn't exist in target encoding

Fix: The receiving system can't represent the character. Use UTF-8.

Symptom 3: Empty or Truncated Text

Data disappears after certain characters = NULL byte or encoding error

Fix: Check for \x00 in your data, ensure binary-safe handling.

Symptom 4: Different Results on Different Machines

"Works on my machine" = Locale-dependent default encoding

Fix: Explicitly specify encoding everywhere:

# Set Python's default
import sys
sys.stdout.reconfigure(encoding='utf-8')

# Or set environment variable
export PYTHONIOENCODING=utf-8

Debugging Tools

Hex Dump

See the actual bytes:

# Linux/Mac
echo -n "café" | xxd
# 63 61 66 c3 a9 (UTF-8: c3 a9 = é)

echo -n "café" | iconv -t LATIN1 | xxd
# 63 61 66 e9 (Latin-1: e9 = é)

Identify Encoding

# Python chardet library
import chardet
with open('mystery.txt', 'rb') as f:
    result = chardet.detect(f.read())
    print(result)  # {'encoding': 'utf-8', 'confidence': 0.99}

Convert Encoding

# Using iconv
iconv -f WINDOWS-1252 -t UTF-8 input.txt > output.txt

# List available encodings
iconv -l

Best Practices Summary

  1. Default to UTF-8 for everything: files, databases, APIs, HTML

  2. Declare encoding explicitly—never rely on defaults:

    • HTML: <meta charset="UTF-8">
    • HTTP: Content-Type: text/html; charset=UTF-8
    • Files: open(file, encoding='utf-8')
    • Database: utf8mb4 in MySQL, UTF8 in PostgreSQL
  3. Validate at boundaries—when data enters your system (user input, file uploads, API calls), validate and normalize encoding

  4. Use encoding-aware tools—your JSON formatter, text editor, and database client should all be configured for UTF-8

  5. Test with real internationalized data—include names like José García, 田中太郎, and مريم in your test data, plus emoji 🎉

  6. Handle encoding errors gracefully:

    # Replace undecodable bytes instead of crashing
    text = data.decode('utf-8', errors='replace')
    

Encoding issues are frustrating because they're invisible until they cause problems. But once you understand the underlying principles—bytes versus characters, encoding as mapping, UTF-8 as the universal standard—they become entirely predictable and fixable.


Last updated: January 2025

Related Tools

Related Articles