Replacing Subqueries With Join Might Drastically Boost Query Performance

Let’s have a look at the following query:

SELECT TransactionNumber
FROM TransactionTable
WHERE TransactionNumber NOT IN
(SELECT TransactionNumber
FROM ClosedTransactionTable)

The following query can be re-written by using JOIN as the following way:

SELECT TransactionNumber
FROM TransactionTable
LEFT JOIN 
(SELECT DISTINCT TransactionNumber AS DistinctTransactionNumber 
FROM ClosedTransactionTable) AS T2
ON TransactionNumber = T2.DistinctTransactionNumber
WHERE T2.DistinctTransactionNumber IS NULL

Writing subqueries makes it easier to understand and at times works better than joins. However, at times a query using JOIN can execute in a few seconds instead few minutes for the same using IN.

Suppose the outer query is a JOIN of several tables including some large ones that produces several thousand or more rows. And the subquery would produce a maximum of several thousand rows every time you execute it. You know this because you have the domain knowledge of the business. In that case, if you can produce all the rows of the subquery by executing just once and create a HASH TABLE in the memory. Then using a HASH JOIN on it becomes quite fast. You can achieve this by giving the query hint to use HASH JOIN. The above query would then look like the below:

SELECT TransactionNumber
FROM TransactionTable
LEFT HASH JOIN 
(SELECT DISTINCT TransactionNumber AS DistinctTransactionNumber 
FROM ClosedTransactionTable) AS T2
ON TransactionNumber = T2.DistinctTransactionNumber
WHERE T2.DistinctTransactionNumber IS NULL

This query would execute really fast.  It is true in general for all kinds of subqueries in MS SQL Server.

Index