Optimizing MySQL Queries for Billion-Scale Datasets: A 3-Level Architecture Approach
Backend Development

Optimizing MySQL Queries for Billion-Scale Datasets: A 3-Level Architecture Approach

CodeGram Team
January 15, 2025
1885 views

Article Summary

Master the art of querying 5 billion records with millisecond response times. Learn three powerful optimization levels: precision indexing with composite keys, hot-cold data separation, and horizontal database sharding. A complete guide to thinking like a database architect.

📋 Interview Scenario:

An e-commerce platform's core orders table contains 5 billion records. When users check their "shipped orders" on the "My Orders" page, response time exceeds 3 seconds, severely impacting user experience.

How would you optimize this query to achieve millisecond-level response times?

Original SQL Query:

-- Original query
SELECT *
FROM orders
WHERE user_id = 10086
  AND status = 'shipped'
ORDER BY create_time DESC
LIMIT 10;

Candidate: "Hello interviewer, my answer is: 'add an index'."

If your answer stops here, this interview will likely stop here too.

🎯 What Interviewers Really Want

In modern tech interviews, "high-performance queries with massive datasets" is practically the gold standard for measuring backend engineering capabilities.

When an interviewer presents this concrete and challenging scenario, they're not testing whether you can write SQL—they're evaluating whether you possess the architectural thinking required to build large-scale systems from the ground up.

🎯 Level 1: Precision Indexing—Surgical Optimization

Any performance problem starts with indexing. But there's a world of difference between using this "scalpel" well and using it poorly.

❌ Wrong Approach: KEY(user_id)

A junior engineer's first instinct might be: "The query has user_id in the WHERE clause, so let's add a single-column index on user_id."

This is a classic "index trap." Here's what happens with KEY(user_id):

  1. The user_id index quickly locates all orders belonging to that user (potentially hundreds or thousands of records).
  2. These order records are loaded into memory.
  3. A "filesort" operation is performed in memory, sorting these hundreds or thousands of records by create_time.
  4. The first 10 records from the sorted results are returned.

⚠️ The Bottleneck:

Steps 2 and 3 are the culprits. When a user has a large number of orders, the in-memory sorting operation consumes significant CPU and time, naturally resulting in slow queries.

✅ The Golden Index: KEY(user_id, status, create_time)

A true "scalpel" cuts straight to the point, eliminating unnecessary operations. The correct approach is to create a "golden" composite index.

🔍 Why is this order critical?

This perfectly leverages MySQL's "leftmost prefix matching principle" and the index's natural ordering. With this composite index, the database execution flow becomes:

  1. Locate the user (user_id): The composite index first locates the data region based on user_id.
  2. Filter by status (status): Within the data block where user_id matches, continue using the index to find records where status is 'shipped'.
  3. Leverage natural ordering (create_time): Since the index itself is physically sorted by (user_id, status, create_time) when created, once user_id and status are determined, the remaining data is naturally ordered by create_time DESC in the index tree.

🚀 The Result:

The database doesn't need to perform any additional sorting operations. It simply "scans" the first 10 records in order from the index tree. This fundamentally eliminates the "filesort" performance black hole.

🚀 Professional Level: Achieving "Covering Index"

Professional engineers go one step further. They notice the SELECT * in the original SQL. This means even when the index is hit, the database still needs to perform a "table lookup" (also called "index-only scan prevention") based on the primary key IDs found in the index to fetch all column data from the main table.

To avoid this extra I/O overhead of "table lookups," we can rewrite the SQL to query only the necessary fields and include all these fields in our composite index, making it a "covering index".

-- Optimized SQL
SELECT order_id, order_no, total_amount, create_time
FROM orders
WHERE user_id = 10086
  AND status = 'shipped'
ORDER BY create_time DESC
LIMIT 10;

-- Ultimate index
ALTER TABLE orders
ADD KEY `idx_user_status_time` (
  user_id, status, create_time, 
  order_id, order_no, total_amount
);

⚡ Ultimate Performance:

Now all data required by the query can be obtained directly from the index tree, achieving ultimate query performance.

❄️🔥 Level 2: Hot-Cold Data Separation—Dimensional Reduction Attack

Index optimization can improve performance by 10x, but when facing a base of 5 billion records, achieving millisecond-level response requires a "dimensional reduction attack"—reducing the size of the queried dataset itself.

🔍 Business Perspective: Understanding Data Temperatures

From a business standpoint, a user's orders can be divided into "hot data" and "cold data":

🔥 Hot Data

Orders users care about recently: "pending payment," "shipped," "in progress."

  • Small volume but high query frequency
  • Recent orders (last 3-6 months)
  • Active status orders

❄️ Cold Data

Orders from months or even years ago: "completed," "cancelled."

  • Accounts for 99%+ of total volume
  • Rarely queried
  • Historical/archived orders

⚠️ The Problem: Mixing these two data types in one table forces every query to search through a "haystack" of 5 billion records, which is clearly inefficient.

✅ The Solution: Archive Strategy

  1. Create an archive table orders_archive Structure identical to the main orders table
  2. Set up scheduled tasks (e.g., nightly batch processing) Move orders from the orders table to orders_archive where:
    • create_time is older than 3 months
    • status is "completed" or "cancelled"

🎯 The Impact:

Through this approach, the main orders table's data volume can dramatically drop from 5 billion to 100 million or even tens of millions.

Querying on this "small and focused" hot data table versus a 5 billion record table are completely different dimensions of performance.

-- Example: Archive old orders
INSERT INTO orders_archive 
SELECT * FROM orders
WHERE create_time < DATE_SUB(NOW(), INTERVAL 3 MONTH)
  AND status IN ('completed', 'cancelled');

-- Then delete from main table
DELETE FROM orders
WHERE create_time < DATE_SUB(NOW(), INTERVAL 3 MONTH)
  AND status IN ('completed', 'cancelled');

⚔️ Level 3: Database Sharding—The Ultimate Weapon

As business continues to grow, even a hot data table with 100 million records will eventually hit throughput bottlenecks (TPS) for single-table write operations. At this point, you must deploy the ultimate weapon of distributed databases—database sharding.

🎯 Core Principle: Choosing the Right Sharding Key

Our query's core is user_id—all operations revolve around "a specific user."

Therefore, the sharding key must be user_id

📋 Implementation Plan:

1️⃣ Plan Your Shards

Let's say we prepare 1,024 database instances.

2️⃣ Routing Rules

When a request arrives, the routing layer (such as Sharding-Sphere middleware or custom routing) performs a hash calculation on user_id and takes the modulo with the total number of shards to determine which database the request should be routed to.

shard_db_index = hash(user_id) % 1024

3️⃣ Data Isolation

Through this rule, ensure that all order data for the same user always resides in the same physical database.

🚀 The Result:

Now, a query for "shipped orders for user_id = 10086" will be precisely routed by the routing layer to the database storing that user's data, for example, db_512.

✨ The Transformation:

The grand problem of "querying within 5 billion records" is cleverly transformed into the simple problem of "querying within db_512 (which may only contain a few hundred thousand records)".

At a scale of hundreds of thousands of records, combined with our Level 1 optimized "golden index," achieving millisecond-level response is naturally straightforward.

⚠️ Important Considerations:

  • Cross-shard queries: Queries not based on user_id become complex—avoid if possible or use alternative solutions (e.g., separate reporting databases)
  • Distributed transactions: Require careful design and implementation
  • Data migration: Plan shard count carefully; resharding later is extremely painful
  • Consistent hashing: Consider using consistent hashing for better scalability

🎓 Conclusion: The Architect's Mental Leap

When facing the challenge of "querying 5 billion orders with millisecond response times," a complete answer should demonstrate clear layered thinking and architectural mindset:

1️⃣ Index Layer (Micro-optimization)

Design the perfect (user_id, status, create_time) covering index to solve SQL execution efficiency issues, eliminating sorting and table lookups. This is the foundation—the "scalpel."

2️⃣ Data Layer (Meso-optimization)

Implement hot-cold data separation to dramatically reduce the target dataset for core queries. This is the game-changing "dimensional reduction attack."

3️⃣ Architecture Layer (Macro-design)

Shard horizontally by user_id to distribute massive pressure across countless small units, providing virtually unlimited horizontal scalability. This is the decisive "ultimate weapon."

🎯 The Final Verdict:

With this combination punch, the interviewer will no longer see just an engineer who can write SQL, but someone who can:

  • Understand business requirements deeply
  • Master data management at scale
  • Design high-concurrency systems
  • Think like a future architect

This is the true "perfect answer."

📊 Performance Comparison Summary

Optimization Level Technique Data Scale Query Time Complexity
Baseline No optimization 5 billion 3+ seconds Low
Level 1 Composite + Covering Index 5 billion 100-300ms Low
Level 2 Hot-Cold Separation 100 million (hot) 10-50ms Medium
Level 3 Database Sharding (1024 shards) ~100K per shard 1-10ms High

* Times are approximate and depend on hardware, network conditions, and specific query patterns. The key insight is the relative improvement at each optimization level.

💡 Key Takeaways

✅ Best Practices

  • Always start with index optimization
  • Use composite indexes in the right order
  • Separate hot and cold data early
  • Plan for sharding before you need it
  • Monitor query performance continuously

❌ Common Mistakes

  • Adding single-column indexes blindly
  • Ignoring index order in composite indexes
  • Using SELECT * unnecessarily
  • Waiting too long to archive old data
  • Sharding without considering query patterns

Share this article