How can I speed up this query on table with 65 million records

Posted on

Question :

select hash_id, hash_value,
            substring(T.cri, 1, 7),
            substring(T.cri, 9, D2.Pos-9),        
            substring(T.cri, D2.Pos + 1, D3.Pos - D2.Pos - 1),
            substring(T.cri, D3.Pos + 1, D4.Pos - D3.Pos - 1),
            substring(T.cri, D4.Pos+1, len(T.cri) - 6 - D4.Pos),
from hash_key as T WITH(index(IX_hashkey_hashid_covering_refcount),nolock)
    cross apply (select charindex('?', T.cri, 31)) as D2(Pos)
    cross apply (select charindex('?', T.cri, D2.Pos+1)) as D3(Pos)
    cross apply (select charindex('?', T.cri, D3.Pos+1)) as D4(Pos)                 
where hash_id between 1 and 10000

Create table:

create table hash_key_binary 
  hash_id bigint not null identity, 
  hash_value nvarchar(100) not null, 
  cri nvarchar(100) not null, 
  ref_count_checksum int not null default(0), 
  compartment_id int , 
  cg_id int , 
  cri_version_id int, 
  primary key (hash_value, cg_id), 
  constraint fk_hash_compartment foreign key (compartment_id) references compartment(compartment_id) on delete no action, 
  constraint fk_hash_compartmentgroup foreign key (cg_id) references compartment_group(group_id) on delete no action

These are the indexes present on this table :

CREATE NONCLUSTERED INDEX IX_hashkey_hashid_covering_refcount 
  ON hash_key (hash_id ASC) 
  INCLUDE (ref_count) 

  ON hash_key (cri ASC)

Answer :

I think your biggest issue is the schema.

  • you should try to see if you can reduce the row footprint. For example, is nvarchar really needed? Using Varchar will cut the size in half. This may cut your processing time in half too. Is BigInt really needed for HashID? you have only 65M rows, a regular INT is more than enough and again half the size. Each byte you save will gain you 65Mb of data.
  • you should see if you could use separate fields instead of using text manipulations, that would simplify the query, improves integrity and most probably speed.
  • primary key on a nvarchar(100) is not a great idea, that should be on your identity column instead. And then make it clustered. I’m sure that will remove the need for the index hint.
  • instead of making (hash_value, cg_id) a primary key, set it as an unique index
  • make sure you have a good maintenance plan to reduce index fragmentation

If this query is the main concern for your system, you should be clustering on hash_id and use a secondary index to enforce the primary key.

However, that is a suboptimisation. Likely, you are burning a lot of CPU cycles doing all those string operations. There might be a way to pre+calculate the results of the different values for cri if the input values all fall into a small subset.

Also, depending on the distribution of the hash key, you may not be getting perfect parallelism on a query like this. A simple rewrite could dramatically change how many cores you can utilise in parallel.

If you have the opportunity to add columns to the table, why not create persisted, computed columns containing the values for D1, D2, D3? They can be calculated very fast at INSERT time, which will help you amortise the cost of the SELECT batch request.

Please add more context to my questions above and I will add details on potential rewrites of this query.

Leave a Reply

Your email address will not be published. Required fields are marked *