Archive for March, 2009

MySQL vs. MSSQL – Derived Table Performance

Thursday, March 12th, 2009

I found, the hard way of course, a very important difference between SQL Server and MySQL. Among my large list of favorite things to do is tuning queries for better performance — the bigger and uglier, the better. I frequently replace correlated subqueries (a.k.a. 1+n queries) with joins to derived tables. Here’s an example using a fictitious table named users that has a one-to-many relationship with another table named rewards in which we want to return an email address and the total number of “rewards points” for users in California during Jan, Feb, and Mar 2009:

select
  email,
  (select sum(points) from rewards where user_id = users.id
     and award_date between '1/1/2009' and '1/31/2009') jan_09_points,
  (select sum(points) from rewards where user_id = users.id
     and award_date between '2/1/2009' and '2/28/2009') feb_09_points,
  (select sum(points) from rewards where user_id = users.id
     and award_date between '3/1/2009' and '3/31/2009') mar_09_points
from users
where state = 'CA';

If we have 1000 users in California, then this “query” will really be 3001 queries (1 query on the users table, 1000 sum queries on each of the subqueries to the rewards table).

A better way to write this query is to use derived tables:

select
  email,
  tp_jan_09.total_points jan_09_points,
  tp_feb_09.total_points feb_09_points,
  tp_mar_09.total_points mar_09_points
from users
inner join (select user_id, sum(points) total_points
  from rewards
  where tp_jan_09.award_date between '1/1/2009' and '1/31/2009'
  group by user_id
  ) tp_jan_09 on tp_jan_09.user_id = users.id
inner join (select user_id, sum(points) total_points
  from rewards
  where tp_feb_09.award_date between '2/1/2009' and '2/28/2009'
  group by user_id
  ) tp_feb_09 on tp_feb_09.user_id = users.id
inner join (select user_id, sum(points) total_points
  from rewards
  where tp_mar_09.award_date between '3/1/2009' and '3/31/2009'
  group by user_id
  ) tp_mar_09 on tp_mar_09.user_id = users.id
where state = 'CA';

Believe it or not, you get an enormous improvement in performance by replacing the correlated subquery with the derived table — in SQL Server. I am talking about production data — tens of thousands of records in the users table and hundreds of thousands or millions of records in the rewards table; not a development environment with a dozen records.

SQL Server’s Treatment of Derived Tables
SQL Server’s optimizer will examine the above query, and find that it is best to first query the users table to find everyone in California, then query the rewards table for only those users and return the results as expected.

MySQL’s Treatment of Derived Tables
MySQL will first select all records in the rewards table, summing the points column, grouping by user_id. Then MySQL will select all users in California, then join the two results together, returning only users in California, albeit after some delay.

So What Is The Difference?
Both database servers return the expected result. SQL Server, however, is faster because it has determined that it is best to start with the users table and join the rewards table onto it — which is the shortest path to a result. MySQL makes no such determination, and instead summarizes the entire rewards table without regard for the fact that there might be a shorter path. If there are a million records in the rewards table, this query can take several seconds instead of a few milliseconds in MySQL.

Solution: Helper Fields
One solution is to use what I refer to this as the umbilical cord approach, which is a way of denormalizing the database. This is where fields are added to child tables (e.g. rewards) that already exist on parent tables (e.g. users). This means that a column named state will be added to the rewards table:

-- 20 chars in case "state" is not a state in the USA
alter table rewards add state varchar(20);

And that state column will be set accordingly:

update rewards
inner join users on rewards.user_id = users.id
set rewards.state = users.state;

And now our query is optimized for MySQL:

select
  email,
  tp_jan_09.total_points jan_09_points,
  tp_feb_09.total_points feb_09_points,
  tp_mar_09.total_points mar_09_points
from users
inner join (select user_id, sum(points) total_points
  from rewards
  where tp_jan_09.state = 'CA'
    and tp_jan_09.award_date between '1/1/2009' and '1/31/2009'
  group by user_id
  ) tp_jan_09 on tp_jan_09.user_id = users.id
inner join (select user_id, sum(points) total_points
  from rewards
  where tp_feb_09.state = 'CA'
    and tp_feb_09.award_date between '2/1/2009' and '2/28/2009'
  group by user_id
  ) tp_feb_09 on tp_feb_09.user_id = users.id
inner join (select user_id, sum(points) total_points
  from rewards
  where tp_mar_09.state = 'CA'
    and tp_mar_09.award_date between '3/1/2009' and '3/31/2009'
  group by user_id
  ) tp_mar_09 on tp_mar_09.user_id = users.id
where users.state = 'CA';

Final Thoughts
This is intentionally simplified, perhaps to the point where you are thinking that query isn’t written right in the first place! You should inner join rewards onto users and group by month/year! I agree, but the example above is made up. It is, however, based on a real world case in which nine different correlated subqueries were run against a 70,000 record table (that is 630,001 individual queries). The execution time of that query was improved from 52 seconds to 5 seconds by doing what was described above (and what I will attempt to describe in a future post).

The update query above will not solve the problem forever because new users will register and existing users will move, so I devised a way to keep rewards.state in sync with users.state (fortunately, the real-world application has Ruby on Rails for a foundation):

class User < ActiveRecord::Base
  def after_save
    # better performance by using SQL
    # 1 update statement instead of n where n is # of rewards
    # it's anti-Rails, for sure; but this blog post is about performance
    sql = "update rewards set state = ? where user_id = ? and state <> ?"
    sql = sanitize_sql([sql,self.state,self.id,self.state])
    connection.execute sql
  end
end

Making code readable inside a blog

Friday, March 6th, 2009

Which of these is more readable?

<a href="javascript:alert('hello world'); return false;">

or

<a href="javascript:alert('hello world'); return false;">

The first is more readable thanks to Mr. Gorbatchev who created the SyntaxHighlighter, which is a very easy to use Javascript and CSS library. It’s this easy:

<link rel="stylesheet" href="shCore.css" type="text/css" />
<link rel="stylesheet" href="shThemeDefault.css" type="text/css" />

<script type="text/javascript" src="shCore.js"></script>
<script type="text/javascript" src="shLegacy.js"></script>
<script type="text/javascript" src="shBrushXml.js"></script>

<pre class="brush: html">
   <a href="http://johnstanfield.com">This is easy!</a>
</pre>

<script type="text/javascript">
SyntaxHighlighter.all();
</script>

There are different brushes for many popular languages, including my most used: Ruby, HTML, and CSS.

Remember to return false after onclick

Friday, March 6th, 2009

There are many ways of implementing Javascript within anchor tags (i.e. hyperlinks). One such way is to insert the Javascript into the href attribute, such as:

<a href="javascript:alert('hello world'); return false;">

At my company, standard practice is to design Javascript hyperlinks like this (where the href is # and the Javascript is in the onclick attribute):

<a href="#" onclick="alert('hello world'); return false;">

One thing I noticed today is that one of our forms uses really easy field validation (which I highly recommend to anyone implementing client-side validation) on submit, but did not return false. This was not a problem except in IE6. Here’s what would happen:

  • The form was very long, with perhaps 20 inputs
  • The invalid field and the submit button were near the bottom of the page
  • When clicking Submit, IE6 would scroll to the top of the page, leaving the user wondering what went wrong
  • The user would have to scroll down and look for something wrong

The problem was in this anchor tag:

<a href="#" onclick="if (valid.validate()) {$('registration').submit()}">

Really easy field validation did exactly what it was supposed to do, which is to put a red error message next to invalid fields (e.g. leaving a required field blank). However, because the onclick event did not return false, IE6 proceeded to treat the click as valid, and send the user to #, which is the top of the screen.

This was the solution:

<a href="#" onclick="if (valid.validate()) {$('registration').submit()}">

I’ll also add that the real cause of the problem is this:

  • User leaves first name empty and clicks Register
  • Really easy field validation puts a red message next to first name and puts the focus on first name
  • User clicks Register again
    • Because the focus is already on first name, IE doesn’t scroll to first name, and instead scrolls to the top of the screen
    • For whatever reason, FireFox does not do this.