I found, the hard way of course, a very important difference between SQL Server and MySQL. Among my large list of favorite things to do is tuning queries for better performance — the bigger and uglier, the better. I frequently replace correlated subqueries (a.k.a. 1+n queries) with joins to derived tables. Here’s an example using a fictitious table named users that has a one-to-many relationship with another table named rewards in which we want to return an email address and the total number of “rewards points” for users in California during Jan, Feb, and Mar 2009:
select
email,
(select sum(points) from rewards where user_id = users.id
and award_date between '1/1/2009' and '1/31/2009') jan_09_points,
(select sum(points) from rewards where user_id = users.id
and award_date between '2/1/2009' and '2/28/2009') feb_09_points,
(select sum(points) from rewards where user_id = users.id
and award_date between '3/1/2009' and '3/31/2009') mar_09_points
from users
where state = 'CA';
If we have 1000 users in California, then this “query” will really be 3001 queries (1 query on the users table, 1000 sum queries on each of the subqueries to the rewards table).
A better way to write this query is to use derived tables:
select email, tp_jan_09.total_points jan_09_points, tp_feb_09.total_points feb_09_points, tp_mar_09.total_points mar_09_points from users inner join (select user_id, sum(points) total_points from rewards where tp_jan_09.award_date between '1/1/2009' and '1/31/2009' group by user_id ) tp_jan_09 on tp_jan_09.user_id = users.id inner join (select user_id, sum(points) total_points from rewards where tp_feb_09.award_date between '2/1/2009' and '2/28/2009' group by user_id ) tp_feb_09 on tp_feb_09.user_id = users.id inner join (select user_id, sum(points) total_points from rewards where tp_mar_09.award_date between '3/1/2009' and '3/31/2009' group by user_id ) tp_mar_09 on tp_mar_09.user_id = users.id where state = 'CA';
Believe it or not, you get an enormous improvement in performance by replacing the correlated subquery with the derived table — in SQL Server. I am talking about production data — tens of thousands of records in the users table and hundreds of thousands or millions of records in the rewards table; not a development environment with a dozen records.
SQL Server’s Treatment of Derived Tables
SQL Server’s optimizer will examine the above query, and find that it is best to first query the users table to find everyone in California, then query the rewards table for only those users and return the results as expected.
MySQL’s Treatment of Derived Tables
MySQL will first select all records in the rewards table, summing the points column, grouping by user_id. Then MySQL will select all users in California, then join the two results together, returning only users in California, albeit after some delay.
So What Is The Difference?
Both database servers return the expected result. SQL Server, however, is faster because it has determined that it is best to start with the users table and join the rewards table onto it — which is the shortest path to a result. MySQL makes no such determination, and instead summarizes the entire rewards table without regard for the fact that there might be a shorter path. If there are a million records in the rewards table, this query can take several seconds instead of a few milliseconds in MySQL.
Solution: Helper Fields
One solution is to use what I refer to this as the umbilical cord approach, which is a way of denormalizing the database. This is where fields are added to child tables (e.g. rewards) that already exist on parent tables (e.g. users). This means that a column named state will be added to the rewards table:
-- 20 chars in case "state" is not a state in the USA alter table rewards add state varchar(20);
And that state column will be set accordingly:
update rewards inner join users on rewards.user_id = users.id set rewards.state = users.state;
And now our query is optimized for MySQL:
select
email,
tp_jan_09.total_points jan_09_points,
tp_feb_09.total_points feb_09_points,
tp_mar_09.total_points mar_09_points
from users
inner join (select user_id, sum(points) total_points
from rewards
where tp_jan_09.state = 'CA'
and tp_jan_09.award_date between '1/1/2009' and '1/31/2009'
group by user_id
) tp_jan_09 on tp_jan_09.user_id = users.id
inner join (select user_id, sum(points) total_points
from rewards
where tp_feb_09.state = 'CA'
and tp_feb_09.award_date between '2/1/2009' and '2/28/2009'
group by user_id
) tp_feb_09 on tp_feb_09.user_id = users.id
inner join (select user_id, sum(points) total_points
from rewards
where tp_mar_09.state = 'CA'
and tp_mar_09.award_date between '3/1/2009' and '3/31/2009'
group by user_id
) tp_mar_09 on tp_mar_09.user_id = users.id
where users.state = 'CA';
Final Thoughts
This is intentionally simplified, perhaps to the point where you are thinking that query isn’t written right in the first place! You should inner join rewards onto users and group by month/year! I agree, but the example above is made up. It is, however, based on a real world case in which nine different correlated subqueries were run against a 70,000 record table (that is 630,001 individual queries). The execution time of that query was improved from 52 seconds to 5 seconds by doing what was described above (and what I will attempt to describe in a future post).
The update query above will not solve the problem forever because new users will register and existing users will move, so I devised a way to keep rewards.state in sync with users.state (fortunately, the real-world application has Ruby on Rails for a foundation):
class User < ActiveRecord::Base
def after_save
# better performance by using SQL
# 1 update statement instead of n where n is # of rewards
# it's anti-Rails, for sure; but this blog post is about performance
sql = "update rewards set state = ? where user_id = ? and state <> ?"
sql = sanitize_sql([sql,self.state,self.id,self.state])
connection.execute sql
end
end