How does JOIN involving database across 2 servers store the intermediate data?

Posted on

Question :

In case of a SQL JOIN on 2 databases across servers.

For example:

Select *
From server1.db1.dbo.tbl1 a
Join server2.db2.dbo.tbl2 b
On a.common_id=b.common_id
Where a.some_col = 1000

Suppose this query is run on server 2 via SSMS, then where does the query processing work? For example:

  1. Data for server1.db1.dbo.tbl1 needs fetching

  2. Then data for server2.db2.dbo.tbl2 needs fetching

  3. Then join to be performed.

  4. Then a filter

Where (on which server) does each of the above operation happen?

Answer :

As Dan points out, it’s dependent on the execution plan that gets generated, as the processing can occur on either end (the remote server or local server which executed the query). Usually you can determine which server it was actually processed on by seeing if there’s a Remote Query vs a Remote Scan operation in your query’s execution plan.

A Remote Query usually means the query is sent to the remote server to be processed, and the predicates will be performed remotely, resulting in less data being return back locally.

A Remote Scan generally results in all the data from the remote entities (e.g. the entire tables) to be returned to the local server first, then the query is processed, predicates are applied, etc, on the local server.

Evidently, Remote Query is usually a more efficient operator to have in your query’s execution plan as opposed to a Remote Scan operator.

By the way, this is a great website with a quick description of the different query plan operators.

I’ve seen a lot of adhoc answers on query re-writes to try and fix plans with Remote Scan and force a Remote Query operation instead. I’m not positive on exactly what the SQL Engine is doing under the hood when it chooses one over the other, but I have consistently seen that when your query uses the primary key (and more likely resultantly the clustered index) field of the remote entities, as part of the predicates, it results in a Remote Query operation.

To expand a bit on Dan’s and J.D.’s comments:

Note that SQL Server doesn’t have a distributed optimizer.

In my early database days, some 3 decades ago, I came in contact with IBM’s Distributed Relational Database Architecture, DRDA. This included, as I recall, a protocol for optimizers where they could share info and use a holistic view on the query execution.

Imagine you join two tables, one local (A) and one remote (B). You have a predicate on A making it return only three rows. The ideal here would be to ship the three rows from A to B, let B do the query to combine the data, and ship back the result to A.

SQL Server won’t do this, since it doesn’t have a distributed optimizer. SQL Server B isn’t “aware” of SQL Server A. To B, server A is just a client app. Sure, B can expose meta-data about its tables and indexes and stuff to server A (to what extent depends on the capacity of the OLEDB provider used to talk to the remote DBMS), but that is not the same as a distributed optimizer.

So, what can SQL Server do?
For above example, A can retrieve the three rows that satisfies the predicate, and do three remote queries. One for each row.

Also, note that a join doesn’t necessarily mean that there has to be some temp data stored anywhere. Only one of the three join types does that (hash joins, the other join types being loop and merge).

Below is an example:

I have linked server named TKB. I use Adventureworks, and have the same database on both A and B. There’s an index on the local table for the WHERE clause, so only three rows satisfy that condition.

When the query uses a linked server, you see the same query being executed three times on the remote server.

Then I instead join to the same table but as a local table, you instead of those three query execution see one clustered index seek.

SELECT AVG(h.SalesPersonID),  AVG(UnitPriceDiscount)
FROM [TKB].Adventureworks.Sales.SalesOrderHeader AS h
  INNER JOIN Sales.SalesOrderDetail AS d ON d.SalesOrderID = h.SalesOrderID
WHERE d.OrderQty = 34

SELECT AVG(h.SalesPersonID),  AVG(UnitPriceDiscount)
FROM Adventureworks.Sales.SalesOrderHeader AS h
  INNER JOIN Sales.SalesOrderDetail AS d ON d.SalesOrderID = h.SalesOrderID
WHERE d.OrderQty = 34

Here’s the plan.

Leave a Reply

Your email address will not be published. Required fields are marked *