Author: Gopal Das

Heading Data Science and Innovation Team @ CrimsonLogic, Singapore; BS in CSE from Khulna University; ME in Internet Science & Engineering from Indian Institute of Science (IISc); Publications on Query Optimization in RDBMS in ACM SIGMOD, IEEE ICDE etc.; Founding team member and VP Engineering of iTwin, a spinoff from A*STAR; Software engineer/data scientist for 21 years; Software, Database, ML; Father of 3 (two @ NUS High School and one is too little!); www.linkedin.com/in/dasgopal; https://github.com/gopalcdas;

FILESTREAM – Setup

To use FILESTREAM feature both Windows administrator and SQL Server administrator have to enable it. While enabling from Windows there are three options to choose from:

Enable FILESTREAM for Transact-SQL Access check box.
Enable FILESTREAM for file I/O streaming access. This is to read and write FILESTREAM data from Windows. Enter the name of the Windows share in the Windows Share Name box.
Select Allow remote clients to have streaming access to FILESTREAM data. This is to let remote clients access the FILESTREAM data stored on this share.

For SQL Server instance there are three options to choose from:

0 – Disable FILESTREAM support for this instance
1 – enable FILESTREAM for Transact-SQL access only
2 – enable FILESTREAM for Transact-SQL and Win32 streaming access

Note that FILESTREAM would perform worse for 1MB+ size files for full read/write if Transact-SQL is chosen (and Win32 streaming is not chosen) than no FILESTREAM BLOB.

Now that FILESTREAM is enabled, we can go ahead to do the rest of the setups.

First, create a file group to keep the FILESTREAM data. There is a 1:1 mapping between data containers and file groups.

       ALTER DATABASE MyDatabase ADD
       FILEGROUP MyFileStreamGroup1 CONTAINS FILESTREAM;

Second, define data container. In the below example, “D:\Production” must exist already. “D:\Production\FileStreamDataContainerRoot” is going to be the root of the data container.

       ALTER DATABASE MyDatabase ADD FILE (
       NAME = FileStreamGroup1File,
       FILENAME = 'D:\Production\FileStreamDataContainerRoot')
       TO FILEGROUP MyFileStreamGroup1 ;

Third, create a table with a FILESTREAM column. All FILESTREAM columns of a table must go to a single FILESTREAM group that can be specified, defaulting to default FILESTREAM filegroup. Table must have a column of the non-null uniqueidentifier data type with ROWGUIDCOL attribute and either a UNIQUE or PRIMARY KEY single-column constraint. FILESTREAM data container will have a directory for a table with subdirectory corresponding to the FILESTREAM column of the table.

       USE MyDatabase ;
       GO

       CREATE TABLE TableWithFileStreamColumn (
       DocumentID INT IDENTITY PRIMARY KEY,
       Document VARBINARY (MAX) FILESTREAM NULL,
       DocGUID UNIQUEIDENTIFIER NOT NULL ROWGUIDCOL UNIQUE DEFAULT NEWID ())
       FILESTREAM_ON MyFileStreamGroup1;

Now it is time to use FILESTREAM using standard Transact-SQL or Win32 APIs.

Reference

Prev Next

Index

Online technical stuffs I like/read, not necessarily in the order listed and not full

Raymond Chen, Paul Randal, Dr. Joseph M. Newcomer, stackoverflow, CodeProject, Wikipedia, GeeksforGeeks, MSDN, Joel Spolsky, Quora

FILESTREAM – Hardware/OS Consideration

There should be some considerations regarding hardware selection/configuration and OS configuration like:

Hard disk – SCSI is better for reliability and performance
RAID level – RAID 10 is recommended even though at least 50% space is redundant depending on configuration
Storage layout – all data in one volume may not be the right choice if load is high
Cluster size – 64KB and above to reduce fragmentation, also defragment periodically
Compression – cannot be enabled if cluster size is 4KB or less, may not be effective for all kinds of data, comes at the expense of CPU
Anti-virus configuration – AV can need to be setup properly so that it does not delete FILESTREAM files
Disable 8.3 naming for NTFS – if number of files are high then NTFS has to work harder to get a unique name
Turn off last access time update – since files are not accessed by explorer rather SQL Server turning off this feature is OK and it saves a lot of time
Security – SQL server need to be configured to use integrated security (Win32 file APIs to work that uses Windows token associated with the client’s thread accessing files)

Reference

Prev Next

Index

FILESTREAM – What and When

If you have to store files (BLOB) in MS SQL Server database, you might consider using FILESTREAM (started from MS SQL Server 2008). This is essentially enabling FILESTREAM attribute of varbinary (max) type column. Some other setups are necessary prior to that.

FILESTREAM combines MS SQL Server with NTFS file system. Database is good at reading/writing/processing small pieces (rows and columns) of information scattered around in the database files while file system is good at reading/writing files as a whole. File system start performing better when the files are larger than 1 MB.

By combining the two you can get the best of the both worlds, like getting past the low max size limit of database BLOB and use max NTFS size, good streaming performance of file system, integration with structured data (transactional consistency), better fragmentation management, manageability (backups etc can be done together) etc.

However if you are doing frequent small updates FILESTREAM performs poor. This is because here it has to work more – guarantee transactional consistency as well write to file system that essentially creates/writes a new copy and delete the old one. It also takes more time to take backup.

FILESTREAM data can be accessed in two ways: standard Transact-SQL and by using Win32 APIs. If you want the best performance of File System you need to use Win32 APIs. There are provisions for the Win32 APIs to work within the context of database transaction.

FILESTREAM BLOBS use file system cache and does not use database buffer pool. This improves query processing and overall database performance.

However do note that certain features like Transparent Data Encryption (TDE) are not supported with FILESTREAM.

To summarize, if you are to save files 1 MB or larger in database that will be read a lot with few small updates and that a middle tier is available where you can use Win32 APIs for accessing them then FILESTREAM is a good choice.

Reference

Index

Optimizing Inserts

If you are inserting a lot of rows in a big table one by one then it will take long time to do so given that the table has a number of indexes.

Try to rewrite the code so that you can do bulk insert. That will be faster.

For example, if you are inserting 2, 000 rows in a heavily indexed table, it might take 20 mins. If you can change the logic and get it done in a single bulk insert you can do it in say, 5 seconds.

If you cannot do so, then create a temp table equivalent to the one where the data should have been inserted. Once all the data are inserted into the temp table one by one, at the end at one go you can move them to the original table. You might need to use transaction and put in some more logic depending on the situations.

Another thing that you can do along with the above or alone is temporary disable the (non-clustered) indexes.

When you insert rows in a table, it is not only the data that MS SQL Server writes to disk but also the indexes on the table. The latter can take up a lot of time. So disable the indexes before you start inserting any data and rebuild it at the end of all insertions. That way you create the indexes at one go.

ALTER INDEX [Index_Name] ON Table DISABLE

ALTER INDEX [Index_Name] ON Table REBUILD

Index

Replacing Subqueries With Join Might Drastically Boost Query Performance

Let’s have a look at the following query:

SELECT TransactionNumber
FROM TransactionTable
WHERE TransactionNumber NOT IN
(SELECT TransactionNumber
FROM ClosedTransactionTable)

The following query can be re-written by using JOIN as the following way:

SELECT TransactionNumber
FROM TransactionTable
LEFT JOIN 
(SELECT DISTINCT TransactionNumber AS DistinctTransactionNumber 
FROM ClosedTransactionTable) AS T2
ON TransactionNumber = T2.DistinctTransactionNumber
WHERE T2.DistinctTransactionNumber IS NULL

Writing subqueries makes it easier to understand and at times works better than joins. However, at times a query using JOIN can execute in a few seconds instead few minutes for the same using IN.

Suppose the outer query is a JOIN of several tables including some large ones that produces several thousand or more rows. And the subquery would produce a maximum of several thousand rows every time you execute it. You know this because you have the domain knowledge of the business. In that case, if you can produce all the rows of the subquery by executing just once and create a HASH TABLE in the memory. Then using a HASH JOIN on it becomes quite fast. You can achieve this by giving the query hint to use HASH JOIN. The above query would then look like the below:

SELECT TransactionNumber
FROM TransactionTable
LEFT HASH JOIN 
(SELECT DISTINCT TransactionNumber AS DistinctTransactionNumber 
FROM ClosedTransactionTable) AS T2
ON TransactionNumber = T2.DistinctTransactionNumber
WHERE T2.DistinctTransactionNumber IS NULL

This query would execute really fast. It is true in general for all kinds of subqueries in MS SQL Server.

Index

Are You Blindly Trusting Plans Generated by MS SQL Server?

Imagine the following situation:

Your query (SELECT statement) involves a number of big tables including T (a big table itself). Suppose the execution usually takes 1 second.

After each time you execute the query you also insert some data into T.

Suddenly in some cases the same query taking 60 seconds to execute! And this is happening even though the overall data volume involved in this query did not change significantly (apart from some data you inserted into T, that is quite small compare to the total data).

If you have to call this stored procedure say 1000 times in the who processing that it is involved in, you will end up being really slow.

Why this is happening?

Well, if you compare the two plans, you might see a small change in the two plans (the one taking 1 second and the one taking 60 seconds). Say, a Nested Loop Join has replaced a Hash Join. This is possible. Since you were inserting some data into T, statistics got changed for that table. Due to the inherent weakness of the cost model, a wrong plan generation is entirely possible. Since you are dealing with big tables, instead of creating hash table used by a Hash Join, a Nested Loop might cost you severely.

What can you do?

Use your business domain knowledge to use query hints. For example, one might know the right side usually produces 5K to 10K rows (say, some special transactions that this right side is supposed to produce after a lot of joins). In that case, you specify query (join) hint to use Hash Join. That way, a small change in statistics won’t make Query Plan Generator to produce a small change in the plan (Nested Loop selection instead of Hash Join) that will cost you significantly.

Index

Searching Just One Record Taking Several Seconds?

You need just one record from a database. Say, the record for a particular transaction number. You need this information quite frequently (say, from your web application). However, every time you go and search it, it is taking several seconds.

There is something wrong. Just to pick one row should be done quite fast, even if your database is substantially big.

Did you look at the query plan? Is it using an index?

Not using an index? If you do have an index on this column, give hints to the query to use it, re-run (do not forget to execute DBCC DROPCLEANBUFFERS before every time you run a query, especially when you are doing performance testing) and check the time. If it is doing fine, save the query that way to force it to use the index every time you call this query later.

You don’t have any index on that column? Create a nonclustered index on it (assuming you have some clustered index on something more important). It should be fine now.

Index