Introduction
There was a time where many have not realized the importance of indexes in database world. Indexes were predominantly used to decrease the execution time of a query without understanding the side effects it would cause on DML statements. However, in this contemporary world where lots and lots of information are available in the net, many have realized that indexes are to be used carefully and used only when a need arises. This article even though provides information on the basics, we will also delve into the internals on how these indexes react under the cover. People who travel to the world of SQL Server from other databases will obviously have to confine themselves with only two types of indexes viz Clustered Indexes and Non-Clustered indexes. The design and architecture of SQL Server allows the user to play with these indexes alone (even though SQL Server 2008 has provided a new feature called FILTERED INDEXES, which we will see in next article). We will work on the different scenarios, observe the results and provide a logical conclusion towards the end. Please note that this is not a universal conclusion and results may vary depending on the data size, underlying OS, hardware configuration, number of concurrent users etc. I will be using SQL Server 2008 on Windows XP. As and when needed, I will also provide the information on the data size and other variable factors.
Definitions
Before we dive deep into this article, it would be better if we understand the terminologies and their definitions as it would be easy to understand the contents in this article.
Fragmentation | A process by which the spaces occur due to frequent updates and deletes in the page. More this fragmentation more the performance issues or maintenance will be. |
Index Level | This value provides the number of levels from the bottom up of an index. Please note that Leaf Node will always have 0. |
Clustered Index Seek | Index seek uses a clustered index to look up the rows to be returned directly |
Index Depth | Number of levels we need to pass to get into the index. |
Record Count | Number of records available in the page. In general this value at index level 0 will be the number of records in the table. But in case of any forwarding count, then there will be some differences |
Extent | An extent is the basic unit of storage used to allocate space for tables and indexes within a given data file. It is made up of eight contiguous 64KB data pages |
Page | A page is the unit of allocation within a specific extent. There are eight pages in every extent. Please note that a page is the last level you reach before hitting at the actual data row. |
Page Splits | Whenever a page becomes full, half of the data is shifted to the newly created page and pointers are kept in the old page pointing to the new location of the data. This is called Page Split. In case of a clustered index the inserted row which does not have enough space alone is shifted to the newly created page. |
Row Offset | For every row that we insert into the database, SQL Server will place the “Row Offset” located at the end of the page, to actually indicate where in that page the row data starts. Indirectly it tells us how many spaces we need to leave to read the data from the top. |
Index Seek | Index seek uses a non-clustered index to look up the rows to be returned directly |
Nested Loops | Takes the inputs from two data sets and performs the scanning of the outer set for each inner set. Very efficient operation if the data sets are small. |
RID Lookup | The query optimizer performs a RID LookUp, a type of bookmark lookup that occurs on a heap table and uses a row identifier to find the rows to return. |
Key Lookup | A Key Lookup is a bookmark lookup on a clustered table and which essentially means that the optimizer cannot retrieve the rows in a single operation, and has to use a clustered key to return the corresponding rows from a clustered index (or from the table itself). |
AVERAGE_FRAGMENTATION_ IN_PERCENT | The degree of fragmentation in the index tree which is based on pages or extents that are not in sync or out of order (i.e.) the pointer to the logical next page is not the same as the physical next page. Please note that we need to be looking for a low number here, though how low depends on the the scenario and requirements |
Clustered Index
Clustered index, as the name suggests, the indexes are clustered in one place with the data. The major advantage (in some cases disadvantage) with clustered index is the arrangement of data in one particular order. Since the data is arranged in one order, we can have only one clustered index in a table. By default, as soon as a primary key for a table is created, internally SQL Server creates a clustered index. However, when we create a unique key, we can specify either clustered or non-clustered index. Creating a clustered index is a time consuming task and in case if this is not defined properly then it may lead to lot of performance issues. In this section, we will see where this can be a gain and where this can lead to performance issues.
Scenario 1: Well defined Cluster Key
Let us create a table called “Accounts” having information related to a company accounts as follows:
CREATE TABLE ACCOUNTS (ACCOUNTNUMBER NUMBER(6), ACCOUNTNAME VARCHAR(50), ACCOUNTDESCRIPTION VARCHAR(50), ACCOUNT_TYPE VARCHAR(1))
In order to test various scenarios, let me load 50,000 records into this table using the following script:
BEGIN DECLARE @COUNTER AS INT SET @COUNTER=1
WHILE (@COUNTER <= 50000) INSERT INTO ACCOUNTS VALUES (@COUNTER, ('ACCOUNT ' + CAST(@COUNTER AS VARCHAR)), ('ACCOUNT DESC ' + CAST(@COUNTER AS VARCHAR)), (CASE (CAST(@COUNTER AS VARCHAR) % 4) WHEN 1 THEN 'A' WHEN 2 THEN 'L' WHEN 3 THEN 'E' WHEN 4 THEN 'I' ELSE NULL END)) END
We have four different types through which we classify the account as Asset (A), Liability (L), Expense (E) and Income (I). Now to insert these many records using SQL Server Management Studio (SSMS), it took around 23 seconds. The following query is a simple select statement filtering the data on accountnumber. SELECT * FROM ACCOUNTS WHERE ACCOUNTNUMBER < 1000 The above query got executed in no time and returned the 999 rows. The actual execution plan showed that the query took the CLUSTERED INDEX SEEK plan, which actually used the index and other parameters (or some authors term as attributes) are also well within the bounds. Please also note that since the key what we have chosen to index is well organized, whatever be the filter on the clustered key it would use the index. Since this clustered key is well organized, even the page splits and fragmentation should be well within the limits. We can look at this in three different angles. First let us truncate the existing records in this table and load the table with three sets of data. First data will be from 1 to 150,000. Second will be from 300,000 and 500,000. And final set will be between 150,000 and 300,000. At the initial stage before starting this experiment the page splits per second is 18,562 as calculated through the following query: SELECT * FROM SYS.SYSPERFINFO WHERE COUNTER_NAME = ‘PAGE SPLITS/SEC’ AND OBJECT_NAME LIKE ‘%ACCESS METHODS%’ Now the fragmentation for the index on this table is provided below:
Now let us start loading the data using the following script.
It took 1 minute to execute the above script and Page Splits per second increased from 18,562 to 19,604. It took just 1042 page splits per second to load 150,000 records. Fragment details are provided below: Now let us start working on the second set of data and observe the page splits and fragmentation.
It took 1 minute and 39 seconds to execute the above script and Page Splits per second increased from 19,604 to 21,030. It took just 1426 page splits per second to load 200,000 records. Fragment details are provided below: The most interesting part is the third set, which actually inserts the data in between the existing data. This will increase the page splits further and cause the cascading effects on the remaining data as well. This should not cause any problem if the clustering key is defined properly. Let us execute the following script to insert the missing data.
It took 1 minute to execute the above script and Page Splits per second increased from 21,030 to 22,099. It took just 1069 page splits per second to load 150,000 records. Fragment details are provided below:
Scenario 2: Not So Well Defined Clustered Key
Now let us change the scenario and use a calculated column and index the same. By calculated column, I mean a column that is derived from another column or a combination of values that can be put in one column. For this, I will create another table called PurchaseMaster as follows:
A clustered index was also created on the column purchasenumber as follows:
The column PURCHASENUMBER is derived on the basis of date of purchase. For example, the following is the sample data available in the PURCHASENUMBER column.
Please note that each purchase number is appended with the financial year. The purchase number 100 has the financial year 2011-2012 and the same for the other purchase numbers. As soon as the person who is raising the purchase order looks at this, the person will know the financial year details. This is indeed very useful and infact very handy as well. Also, when we issue the following statement, we did not see any anomaly.
The execution plan for the above select statement is provided below: Now, if we see a range filter based query as follows, the results were not proper.

Inserting 150,000 records took just 1 minute and 3 seconds. The page splits/second and fragmentation after this are provided below: From 9325, page splits/second has increased to 11523 which caused 2198 page splits per second during this insert operation. Also the fragment details are provided below: Please note that the fragmentaion is observed to be high in this case. Now, let us consider the next set, which is inserting next set of 200,000 records from 300,000 to 500,000.
Inserting 200,000 records took just 1 minute and 18 seconds. The page splits/second and fragmentation after this are provided below: From 11,523, page splits/second has increased to 14,947 which caused 3,424 page splits per second during this insert operation. Also the fragment details are provided below: Wow, that is a great increase in fragmentation. Now the final is what more important to observe both Page Splits and Fragmentation. We are now going to insert the missing records between 150,000 and 300,000 and observe the page splits. I will also explain diagramatically how this would happen. Now let us execute the same script by changing the values.
The third batch took just 57 seconds to complete. Page Splits per second was observed to be 17,520 which caused 2,573 page splits per second during this insert operation. Also the fragment details are provided below: Now having said this on page splits and fragmentation, how do we know if this is good or bad? Let us compare the scenario 1 and 2 in the following table and observe the results.
Description | Execution Time (mm:ss) | Page Splits | Fragmentation (AVG_FRAGMENTATION_ IN_PERCENT FOR INDEX_LEVEL=0) | |
Scenario 1 | 1-150,000 | 01:00 | 1,042 | 0.68 |
300,000-500,000 | 01:39 | 1,426 | 0.57 | |
150,001-299,999 | 01:00 | 1,069 | 0.62 | |
Scenario 2 | 1-150,000 | 01:03 | 2,198 | 35.21 |
300,000-500,000 | 01:18 | 3,424 | 41.87 | |
150,001-299,999 | 00:57 | 2,573 | 43.36 |
Non-Clustered Index
What we saw till now is a very ideal situation where we have a well defined key and can go ahead to create a clustered index. But what if the data is not so organized but we still need to have an index on the column. Enter the world of Non-Clustered Index. Where actually the data is not organized or ordered in any pre-existing condition, but still satisfy the purposes. The major difference is leaf level does not contain data. It includes pointers to the relevant data and the pointer is a simple RID column. RID columns include the information on extent, page and row offset. Using this RID column it will go and fetch the data from different pages. Another issue with with non-clustered index is it behaves differently for heap and clustered table. Here again we will see two scenarios. One Non-clustered index created on a heap and then created on a clustered table.
Scenario 3: Non-Clustered Index on Heap Table
Let us create a table called PRODUCTS as follows:
Now let us create a non-clustered index as follows:
The following script will insert the data into this table.
Now let us use a simple select clause and observe the execution plan.
Now the execution plan is changed, no RID lookup because we have selected only the indexed columns. What if we wanted to avoid RID lookup and at the same time most of the columns in the table to be selected? Use Include option along with the definition of the index creation as follows:
Please note that we did not get RID lookup, which is because of the columns included in the index creation. However, please be careful with INCLUDE column option. Please do benchmark before using it as it occupies more space in LEAF level and there on reducing the number of rows and increasing the number of IOs.
Scenario 4: Non-Clustered Index on Clustered Table
Now, in the same PRODUCTS table I have dropped the indexes and created two indexes as follows:

Further Reading | Please don’t forget to visit this site to read my next articles on
|
Wow, that is a great increase in fragmentation.
Now the final is what more important to observe both Page Splits and Fragmentation. We are now going to insert the missing records between 150,000 and 300,000 and observe the page splits. I will also explain diagramatically how this would happen. Now let us execute the same script by changing the values.
BEGIN
DECLARE @COUNTER AS INT
SET @COUNTER=150001
WHILE (@COUNTER <= 299999)
BEGIN
INSERT INTO PURCHASEMASTER VALUES
(CAST(@COUNTER AS VARCHAR) + ‘\’ + cast(YEAR(GETDATE()) as varchar) + ‘-‘ + cast(YEAR(GETDATE())+1 as varchar),
‘PURCHASE ‘ + CAST(@COUNTER AS VARCHAR),
‘VENDOR ‘ + CAST(@COUNTER AS VARCHAR),
GETDATE())
SET @COUNTER=@COUNTER+1
END
END
The third batch took just 57 seconds to complete. Page Splits per second was observed to be 17,520 which caused 2,573 page splits per second during this insert operation. Also the fragment details are provided below:
Now having said this on page splits and fragmentation, how do we know if this is good or bad? Let us compare the scenario 1 and 2 in the following table and observe the results.
Description |
Execution Time (mm:ss) |
Page Splits |
Fragmentation (AVG_FRAGMENTATION_ IN_PERCENT FOR INDEX_LEVEL=0) |
|
Scenario 1 |
1-150,000 |
01:00 |
1,042 |
0.68 |
300,000-500,000 |
01:39 |
1,426 |
0.57 |
|
150,001-299,999 |
01:00 |
1,069 |
0.62 |
|
Scenario 2 |
1-150,000 |
01:03 |
2,198 |
35.21 |
300,000-500,000 |
01:18 |
3,424 |
41.87 |
|
150,001-299,999 |
00:57 |
2,573 |
43.36 |
This shows in Scenario 2, as the cluster key is not well defined, the inserts to fill the missing gap (150,001 to 299,999) is observed to be high. Also, the fragmentation and Page Splits are quite high since the cluster key is not well defined. Whenever we need to define a clustered index on a key, please make sure the following:
a) Key should be well defined, satisfying the business requirements and as well as the providing a greater selectivity.
b) Key should not be defined on derived table and calculated column
c) Other columns in the table should be completely dependent on the cluster key
Non-Clustered Index
What we saw till now is a very ideal situtation where we have a well defined key and can go ahead to create a clustered index. But what if the data is not so organized but we still need to have an index on the column. Enter the world of Non-Clustered Index. Where actually the data is not organized or ordered in any pre-existing condition, but still satisfy the purposes. The major difference is leaf level does not contain data. It includes pointers to the relevant data and the pointer is a simple RID column. RID columns include the information on extent, page and row offset. Using this RID column it will go and fetch the data from different pages.
Another issue with with non-clustered index is it behaves differently for heap and clustered table. Here again we will see two scenarios. One Non-clustered index created on a heap and then created on a clustered table.
Scenario 3: Non-Clustered Index on Heap Table
Let us create a table called PRODUCTS as follows:
CREATE TABLE PRODUCTS (PRODUCTNUMBER NUMERIC(6), PRODUCTNAME VARCHAR(30), PRODUCTDESCRIPTION VARCHAR(50), PRODUCTTYPE VARCHAR(1))
Now let us create a non-clustered index as follows:
CREATE NONCLUSTERED INDEX PRODIDX ON PRODUCTS(PRODUCTNUMBER, PRODUCTNAME)
The following script will insert the data into this table.
BEGIN
DECLARE @COUNTER AS INT
SET @COUNTER=1
WHILE (@COUNTER <= 50000)
BEGIN
INSERT INTO PRODUCTS VALUES
(CAST(@COUNTER AS VARCHAR),
‘PRODUCTS ‘ + CAST(@COUNTER AS VARCHAR),
‘DESCRIPTION ‘ + CAST(@COUNTER AS VARCHAR),
(case (@COUNTER % 2) when 1 then ‘I’ when 0 then ‘R’ else NULL end))
SET @COUNTER=@COUNTER+1
END
END
Now let us use a simple select clause and observe the execution plan.
SELECT * FROM PRODUCTS WHERE PRODUCTNUMBER=2089
If you see this execution plan carefully, the following can be observed:
a) Since we do not have clustered index in this table, it has used RID lookup to retrieve other two columns PRODUCTDESCRIPTION and PRODUCTTYPE
b) It has used Index Seek to retrieve indexed columns PRODUCTNUMBER and PRODUCTNAME
c) Internally what happens is the optimizer uses NESTED LOOPS to join the data from the results of the two operations viz Index Seek and RID Lookup.
Is this kind of execution plan is bad? Depends on the business needs and execution time. The above query took couple of seconds to execute. Let us now select only the indexed columns and see what is the execution plan for the same query.
SELECT PRODUCTNUMBER, PRODUCTNAME FROM PRODUCTS WHERE PRODUCTNUMBER=2089
Now the execution plan is changed, no RID lookup because we have selected only the indexed columns. What if we wanted to avoid RID lookup and at the same time most of the columns in the table to be selected? Use Include option along with the definition of the index creation as follows:
DROP INDEX PRODUCTS.PRODIDX
CREATE NONCLUSTERED INDEX PRODIDX ON PRODUCTS(PRODUCTNUMBER, PRODUCTNAME) INCLUDE PRODUCTDESCRIPTION, PRODUCTTYPE
The following select statement provides an execution plan
SELECT PRODUCTNUMBER, PRODUCTNAME FROM PRODUCTS WHERE PRODUCTNUMBER=2089
Please note that we did not get RID lookup, which is because of the columns included in the index creation. However, please be careful with INCLUDE column option. Please do benchmark before using it as it occupies more space in LEAF level and there on reducing the number of rows and increasing the number of IOs.
Scenario 4: Non-Clustered Index on Clustered Table
Now, in the same PRODUCTS table I have dropped the indexes and created two indexes as follows:
DROP INDEX PRODUCTS.PRODIDX
CREATE CLUSTERED INDEX PRODCLUSTIDX ON PRODUCTS(PRODUCTNUMBER)
CREATE NONCLUSTERED INDEX PRODIDX ON PRODUCTS(PRODUCTNAME)
Again we have the same number of records viz 50,000 records in the table. The following select statement produces the below execution plan:
Select * from products where productname=’PRODUCTS 9201’
If you read the execution plan, the following can be observed:
a) Since we have clustered index in this table, it has used Key Lookup to retrieve other two columns PRODUCTDESCRIPTION and PRODUCTTYPE
b) It has used Index Seek to retrieve indexed columns PRODUCTNUMBER and PRODUCTNAME
c) Internally what happens is the optimizer uses NESTED LOOPS to join the data from the results of the two operations viz Index Seek and Key Lookup.
Further please note that in case if the KEY LOOKUP is to be avoided, use INCLUDE option while creating the index.
Non-Clustered index can be used as part of heap table or as part of clustered table. There is no hard and fast rule that it has to be used only with heap or clustered table. Before using the same we need to benchmark and test it completely.
Further Reading |
Please don’t forget to visit this site to read my next articles on Ø Joins and Sets Ø Hierarchical Queries |