Creating a database is often treated as a trivial step. In reality, the decisions made here—especially encoding and collation—tend to stay with a system for years.
One of the most common sources of long-term issues in MySQL-based systems is choosing the wrong character set early on.
Options at a glance
| Option | Example | When to use | Notes |
|---|---|---|---|
| Modern default | CREATE DATABASE mydb; |
Quick setups, dev environments, trusted server defaults | Works well on modern MySQL, but defaults may vary. |
| Modern explicit (recommended) |
CHARACTER SET utf8mb4COLLATE utf8mb4_0900_ai_ci
|
Production systems, teams, multi-service environments | Most predictable and future-proof choice. |
| Legacy explicit (deprecated) |
CHARACTER SET utf8COLLATE utf8_general_ci
|
Compatibility with legacy systems only | utf8 is not full Unicode (3-byte limit). |
| Safe provisioning | CREATE DATABASE IF NOT EXISTS mydb; |
CI/CD pipelines, automation scripts | Prevents errors if the database already exists. |
| Reset (destructive) | DROP DATABASE IF EXISTS mydb; |
Local dev and disposable CI environments | Never use in production without safeguards. |
Modern approach (recommended)
Use utf8mb4 for full Unicode support:
CREATE DATABASE mydb
CHARACTER SET utf8mb4
COLLATE utf8mb4_0900_ai_ci;
Why utf8mb4 matters
- Supports all Unicode characters (including emojis)
- Prevents subtle data and comparison issues
- Works better with modern frameworks and integrations
- Avoids costly migrations later
Legacy approach (not recommended)
Historically, many MySQL databases were created using utf8:
CREATE DATABASE mydb
CHARACTER SET utf8
COLLATE utf8_general_ci;
At first glance, this looks correct. The problem is that MySQL’s utf8 is not full UTF-8. It only supports up to 3 bytes per character, excluding many valid Unicode characters.
A real-world problem
A system can run for years without issues—until a real-world input arrives:
- A user enters their name with an emoji
- A third-party API sends modern Unicode characters
- Content includes symbols outside the 3-byte range
Then inserts fail, or data gets corrupted/truncated. At that point, fixing encoding becomes costly.
Perspective: Encoding decisions made on day one often survive multiple rewrites and teams. Choosing the modern standard early is cheaper than fixing it later.
.png)