The Resplendent Developer

Software Development and Software Quality Assurance

SQL: The Quick and Dirty Answer to Clustered vs. Non-Clustered Indexes!

While it may not be as necessary to know the real difference between the concepts of Clustered and Non-Clustered indexes in databases as you move away from the Database Administrator  role toward development and QA, it is a useful concept to understand. For the more technical minded, please forgive me, I’m going to take some liberties in order to get the concept across. It might hurt

In reality, the difference is fairly easy to define.

In a Non-Clustered index, you are essentially creating a look-up reference for a series of records based on whichever field (piece of data) you index. Think of an index as an address book: You want to find out where your friend “Joe” lives, so you look up “Joe” in your address book. Now you know where he lives and you can go RIGHT to his house and not have to knock on every door in your town until you find him. 

Great so what is a Clustered index?

To further extend our analogy, not only do you have an address book to find Joe, but everyone in the neighborhood has been sorted alphabetically. In short, Joe’s house is right next to John’s house, which is right next to Jose’s house. This means that if you needed to visit all your friends who’s name begins with “J”, you can do it more quickly because they all live in the same neighborhood.

In the non-clustered index, John and Joe could live in opposite ends of town. In a clustered index, they are “clustered together” in the same neighborhood.

Why would you choose one over the other?

Clustered:
Well, in a database, you can only have 1 field be clustered (actually you can, but that makes this harder to understand, so I’m glossing over it). Using our previous analogy of names, say our friends are named like this…

Joe Bob
John Zimmer
Jose Smith

You can tell them that all the people with first names beginning with “J” need to live in the same neighborhood. But you can’t also say that “Mary Barber” lives near “Joe Bob” because her first name begins with M. She lives in the “M” neighborhood.


Non-Clustered:
Lets face it, using our analogy, really, could also have a second address book that has everyone sorted by lastname first. This means, it’s easy for you to get “Mary Barber’s” address and then “Joe Bob’s” address, but it’s longer to visit them both because they are in different parts of town. The Non-Clustered Index, is just the address book.

Hopefully, this clears some things up for non-technical readers who hear their Database Administrator say “Yeah, I’ll just put a non-clustered index on the last name”. Then you can say, “Wouldn’t it make more sense to have last name be clustered?”

Are you the Database Administrator reading this? Let me know how miserable I’ve made your life by commenting below!

One Comment

  1. It's a good analogy. The main thing I see left out is the "why" one should care, what problem is one trying to solve.

    The problem being solved is that it is possible for a query to find records quickly, but still take a substantial amount of time reading the data. Usually, this is a scaling problem. It doesn't matter when there are only 200 people in town (in your analogy), but it can cause the process to bog down and fail if one is really dealing with 300 million people across the country. Most db users won't see this effect except on the largest tables (or another field that should be indexed isn't indexed, and thus clustering makes the table scan go faster).

    But it isn't just the table size that's of concern, here. The "query size" also matters. There is no benefit from clustering if the result set of a query is very small (e.g., 1 to 10 records) or if the result set is very large (i.e., the query is stuck reading most of the table anyway).

    In a real, live work case, I've dealt with a 5 terabyte (TERAbyte, not merely gigabyte) table. It contains important audit/transaction records of user accounts over 10 years. It was (quite sensibly) clustered by date. In fact, "by date" is often the best answer of which index to cluster on a table, because people usually aren't looking for things from 2 days ago AND 2 years ago, but usually a set of things that happened around the same time. Trouble is, that assumption turned out to be wrong due to how a 3rd-party app (which provided analytics based on the transaction records) needed to query this data. The 3rd-party app needed ALL of a users records, and then cached the results in memory to process them. So, this app was ALWAYS getting records from 2 days ago and 2 years ago (and 10 years ago). Worse, this app was on the first page a user would see upon logging in, because it made pretty graphics for the user to see and played a key role in the user's dashboard. This arrangement was cratering the product whenever lots of users logged in at once to see their accounts.

    In the case of this scenario, the ideal clustering would be BY USER. Each user probably only had a few kB of data to read (which is why the 3rd-party app regarded it as reasonable to load it all up at once and cache it). So putting it all on the same cluster(s) of the table would make the reads very fast.

    An interesting twist on this is that because the table was 5 TB, changing which index was clustered was an nigh-impossible task. It would take too long to add/remove an index. (Days. I'm not kidding.) Worse, the table was really 10 tables, each containing a particular years' data … the clustering was "hard-coded" in this manner. The solution to this was to add "included columns" to the index tailored for this query. Including columns in an index actually copies the columns' data over to the index itself, which does effectively the same thing as clustering – upon finding the desired records within the index, the db read can get the needed data from that spot in the index, without having to hop around the table to get it. Included columns on an index is the way to "cluster" for multiple cases/sortings of data.