IceBearSoft

Josué Pérez

Creating a database is often treated as a trivial step. In reality, the decisions made here—especially encoding and collation—tend to stay with a system for years.

One of the most common sources of long-term issues in MySQL-based systems is choosing the wrong character set early on.

Options at a glance

Option	Example	When to use	Notes
Modern default	`CREATE DATABASE mydb;`	Quick setups, dev environments, trusted server defaults	Works well on modern MySQL, but defaults may vary.
Modern explicit (recommended)	`CHARACTER SET utf8mb4` `COLLATE utf8mb4_0900_ai_ci`	Production systems, teams, multi-service environments	Most predictable and future-proof choice.
Legacy explicit (deprecated)	`CHARACTER SET utf8` `COLLATE utf8_general_ci`	Compatibility with legacy systems only	`utf8` is not full Unicode (3-byte limit).
Safe provisioning	`CREATE DATABASE IF NOT EXISTS mydb;`	CI/CD pipelines, automation scripts	Prevents errors if the database already exists.
Reset (destructive)	`DROP DATABASE IF EXISTS mydb;`	Local dev and disposable CI environments	Never use in production without safeguards.

Modern approach (recommended)

Use utf8mb4 for full Unicode support:

                            CREATE DATABASE mydb
CHARACTER SET utf8mb4
COLLATE utf8mb4_0900_ai_ci;

Why utf8mb4 matters

Supports all Unicode characters (including emojis)
Prevents subtle data and comparison issues
Works better with modern frameworks and integrations
Avoids costly migrations later

Legacy approach (not recommended)

Historically, many MySQL databases were created using utf8:

                            CREATE DATABASE mydb
CHARACTER SET utf8
COLLATE utf8_general_ci;

At first glance, this looks correct. The problem is that MySQL’s utf8 is not full UTF-8. It only supports up to 3 bytes per character, excluding many valid Unicode characters.

A real-world problem

A system can run for years without issues—until a real-world input arrives:

A user enters their name with an emoji
A third-party API sends modern Unicode characters
Content includes symbols outside the 3-byte range

Then inserts fail, or data gets corrupted/truncated. At that point, fixing encoding becomes costly.

Perspective: Encoding decisions made on day one often survive multiple rewrites and teams. Choosing the modern standard early is cheaper than fixing it later.

Database Creation: Legacy vs Modern Approach

Options at a glance

Modern approach (recommended)

Why utf8mb4 matters

Legacy approach (not recommended)

A real-world problem