Database – Page 2 – Sharing Experiences

FILESTREAM – What and When

If you have to store files (BLOB) in MS SQL Server database, you might consider using FILESTREAM (started from MS SQL Server 2008). This is essentially enabling FILESTREAM attribute of varbinary (max) type column. Some other setups are necessary prior to that.

FILESTREAM combines MS SQL Server with NTFS file system. Database is good at reading/writing/processing small pieces (rows and columns) of information scattered around in the database files while file system is good at reading/writing files as a whole. File system start performing better when the files are larger than 1 MB.

By combining the two you can get the best of the both worlds, like getting past the low max size limit of database BLOB and use max NTFS size, good streaming performance of file system, integration with structured data (transactional consistency), better fragmentation management, manageability (backups etc can be done together) etc.

However if you are doing frequent small updates FILESTREAM performs poor. This is because here it has to work more – guarantee transactional consistency as well write to file system that essentially creates/writes a new copy and delete the old one. It also takes more time to take backup.

FILESTREAM data can be accessed in two ways: standard Transact-SQL and by using Win32 APIs. If you want the best performance of File System you need to use Win32 APIs. There are provisions for the Win32 APIs to work within the context of database transaction.

FILESTREAM BLOBS use file system cache and does not use database buffer pool. This improves query processing and overall database performance.

However do note that certain features like Transparent Data Encryption (TDE) are not supported with FILESTREAM.

To summarize, if you are to save files 1 MB or larger in database that will be read a lot with few small updates and that a middle tier is available where you can use Win32 APIs for accessing them then FILESTREAM is a good choice.

Reference

Index

Optimizing Inserts

If you are inserting a lot of rows in a big table one by one then it will take long time to do so given that the table has a number of indexes.

Try to rewrite the code so that you can do bulk insert. That will be faster.

For example, if you are inserting 2, 000 rows in a heavily indexed table, it might take 20 mins. If you can change the logic and get it done in a single bulk insert you can do it in say, 5 seconds.

If you cannot do so, then create a temp table equivalent to the one where the data should have been inserted. Once all the data are inserted into the temp table one by one, at the end at one go you can move them to the original table. You might need to use transaction and put in some more logic depending on the situations.

Another thing that you can do along with the above or alone is temporary disable the (non-clustered) indexes.

When you insert rows in a table, it is not only the data that MS SQL Server writes to disk but also the indexes on the table. The latter can take up a lot of time. So disable the indexes before you start inserting any data and rebuild it at the end of all insertions. That way you create the indexes at one go.

ALTER INDEX [Index_Name] ON Table DISABLE

ALTER INDEX [Index_Name] ON Table REBUILD

Index

Replacing Subqueries With Join Might Drastically Boost Query Performance

Let’s have a look at the following query:

SELECT TransactionNumber
FROM TransactionTable
WHERE TransactionNumber NOT IN
(SELECT TransactionNumber
FROM ClosedTransactionTable)

The following query can be re-written by using JOIN as the following way:

SELECT TransactionNumber
FROM TransactionTable
LEFT JOIN 
(SELECT DISTINCT TransactionNumber AS DistinctTransactionNumber 
FROM ClosedTransactionTable) AS T2
ON TransactionNumber = T2.DistinctTransactionNumber
WHERE T2.DistinctTransactionNumber IS NULL

Writing subqueries makes it easier to understand and at times works better than joins. However, at times a query using JOIN can execute in a few seconds instead few minutes for the same using IN.

Suppose the outer query is a JOIN of several tables including some large ones that produces several thousand or more rows. And the subquery would produce a maximum of several thousand rows every time you execute it. You know this because you have the domain knowledge of the business. In that case, if you can produce all the rows of the subquery by executing just once and create a HASH TABLE in the memory. Then using a HASH JOIN on it becomes quite fast. You can achieve this by giving the query hint to use HASH JOIN. The above query would then look like the below:

SELECT TransactionNumber
FROM TransactionTable
LEFT HASH JOIN 
(SELECT DISTINCT TransactionNumber AS DistinctTransactionNumber 
FROM ClosedTransactionTable) AS T2
ON TransactionNumber = T2.DistinctTransactionNumber
WHERE T2.DistinctTransactionNumber IS NULL

This query would execute really fast. It is true in general for all kinds of subqueries in MS SQL Server.

Index

Are You Blindly Trusting Plans Generated by MS SQL Server?

Imagine the following situation:

Your query (SELECT statement) involves a number of big tables including T (a big table itself). Suppose the execution usually takes 1 second.

After each time you execute the query you also insert some data into T.

Suddenly in some cases the same query taking 60 seconds to execute! And this is happening even though the overall data volume involved in this query did not change significantly (apart from some data you inserted into T, that is quite small compare to the total data).

If you have to call this stored procedure say 1000 times in the who processing that it is involved in, you will end up being really slow.

Why this is happening?

Well, if you compare the two plans, you might see a small change in the two plans (the one taking 1 second and the one taking 60 seconds). Say, a Nested Loop Join has replaced a Hash Join. This is possible. Since you were inserting some data into T, statistics got changed for that table. Due to the inherent weakness of the cost model, a wrong plan generation is entirely possible. Since you are dealing with big tables, instead of creating hash table used by a Hash Join, a Nested Loop might cost you severely.

What can you do?

Use your business domain knowledge to use query hints. For example, one might know the right side usually produces 5K to 10K rows (say, some special transactions that this right side is supposed to produce after a lot of joins). In that case, you specify query (join) hint to use Hash Join. That way, a small change in statistics won’t make Query Plan Generator to produce a small change in the plan (Nested Loop selection instead of Hash Join) that will cost you significantly.

Index

Searching Just One Record Taking Several Seconds?

You need just one record from a database. Say, the record for a particular transaction number. You need this information quite frequently (say, from your web application). However, every time you go and search it, it is taking several seconds.

There is something wrong. Just to pick one row should be done quite fast, even if your database is substantially big.

Did you look at the query plan? Is it using an index?

Not using an index? If you do have an index on this column, give hints to the query to use it, re-run (do not forget to execute DBCC DROPCLEANBUFFERS before every time you run a query, especially when you are doing performance testing) and check the time. If it is doing fine, save the query that way to force it to use the index every time you call this query later.

You don’t have any index on that column? Create a nonclustered index on it (assuming you have some clustered index on something more important). It should be fine now.

Index