Let’s assume that I have a table with 30-50 millions records. It has ~40 columns. Currently when I request data it always requests all columns because of legacy ORM that we use.
I know that if we specify fewer result columns in
SELECT query, then it will be easier for SQL Server as it doesn’t need to load them from disk and return to client. But what about SQL Server buffer cache?
Lets say my application executes such query:
SELECT Column1,Column2,...,Column40 FROM ... WHERE OwnerID = 42
So it will load all columns for these records, cache them in memory, and return to client.
And next query after it will be
SELECT Column1,Column2 FROM ... WHERE OwnerID = 42
So it won’t load other columns and return much less data, but will it matter if anyway they are loaded from cache and not from disk, as we loaded all columns in previous query. Will it help SQL Server or it won’t matter at all because of SQL cache? How much will be difference in terms of complexity for SQL Server?
I can’t remove first query for now, so I’m trying to optimize next queries for the same entity and trying to figure out a good ways.
SQL Server stores all the columns for one row together in a single disk page. (It’s more complex than this but for
char etc. this is more-or-less true.) To retrieve any column’s value the whole page is read into RAM. So once you have any column available for a given row, all of the columns for that row are available.
There are several buffers / caches involved in responding to a query. One is the great big page bufferpool where pages read from disk are held in memory so execution plans can act on them. And yes, if one query causes a page to be loaded into the bufferpool all subsequent queries will use the same page from the bufferpool and not have to suffer the IO overhead. This is an incidental performance gain, however, and shouldn’t be relied upon because SQL Server may choose to evict any page from the bufferpool at any time it needs the space.
Another set of buffering occurs when the output from your query gets sent over your connection to your application. This is (mostly) on a per-connetion basis so duplicating the work duplicates the resources required (more or less).
Another cost is the optimisations you miss out on when you return all columns. Covering indexes and “INCLUDES” columns are meaningless. Every read will have to retrieve the underlying clustered index (or heap) page. It’s a fair bet that SQL Server will not be using specific indexes it might otherwise have chosen because a scan-the-cluster plan is cheaper than an index-and-cluster plan.