Database Creation: Legacy vs Modern Approach

Josué Pérez

Creating a database is often treated as a trivial step. In reality, the decisions made here—especially encoding and collation—tend to stay with a system for years.

One of the most common sources of long-term issues in MySQL-based systems is choosing the wrong character set early on.

Options at a glance

Option Example When to use Notes
Modern default CREATE DATABASE mydb; Quick setups, dev environments, trusted server defaults Works well on modern MySQL, but defaults may vary.
Modern explicit (recommended) CHARACTER SET utf8mb4
COLLATE utf8mb4_0900_ai_ci
Production systems, teams, multi-service environments Most predictable and future-proof choice.
Legacy explicit (deprecated) CHARACTER SET utf8
COLLATE utf8_general_ci
Compatibility with legacy systems only utf8 is not full Unicode (3-byte limit).
Safe provisioning CREATE DATABASE IF NOT EXISTS mydb; CI/CD pipelines, automation scripts Prevents errors if the database already exists.
Reset (destructive) DROP DATABASE IF EXISTS mydb; Local dev and disposable CI environments Never use in production without safeguards.

Modern approach (recommended)

Use utf8mb4 for full Unicode support:

                            CREATE DATABASE mydb
CHARACTER SET utf8mb4
COLLATE utf8mb4_0900_ai_ci;
                        

Why utf8mb4 matters

  • Supports all Unicode characters (including emojis)
  • Prevents subtle data and comparison issues
  • Works better with modern frameworks and integrations
  • Avoids costly migrations later

Legacy approach (not recommended)

Historically, many MySQL databases were created using utf8:

                            CREATE DATABASE mydb
CHARACTER SET utf8
COLLATE utf8_general_ci;
                        

At first glance, this looks correct. The problem is that MySQL’s utf8 is not full UTF-8. It only supports up to 3 bytes per character, excluding many valid Unicode characters.

A real-world problem

A system can run for years without issues—until a real-world input arrives:

  • A user enters their name with an emoji
  • A third-party API sends modern Unicode characters
  • Content includes symbols outside the 3-byte range

Then inserts fail, or data gets corrupted/truncated. At that point, fixing encoding becomes costly.

Perspective: Encoding decisions made on day one often survive multiple rewrites and teams. Choosing the modern standard early is cheaper than fixing it later.