What index should I use for a read-only database?

Posted on

Question :

The database looks like this:

http://i.imgur.com/GLWohG3.png

Parameters has 126000 rows, which link an ExperimentID to the starting parameters for that ExperimentID. Simulations has ~107M rows, but I really only query against Xa (a floating point value with mostly unique entries). I’m interested in the other data, but for now it’s just there. The current indices I have are the Primary Key index on Simulations, which is clustered, and a nonclustered columnstore index on the following columns:

http://i.imgur.com/MpoKJS5.png

Here’s an example query I might be running against this database:

USE IMTsimulations
SELECT DISTINCT p.ControlRange, p.SensorError,
    PERCENTILE_DISC(0) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_MIN,
    PERCENTILE_DISC(.25) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_Quartile25,
    PERCENTILE_DISC(.5) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_Median,
    PERCENTILE_DISC(.75) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_Quartile75,
    PERCENTILE_DISC(1) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_MAX,
    AVG(s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_Mean,
    STDEV(s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_STDEV,
    COUNT(s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as TotalEvents

FROM dbo.Simulations s JOIN Parameters p ON s.ExperimentID = p.ExperimentID
WHERE s.TimeElapsed <= 7200 and p.ExogDexCurve =  4
ORDER BY p.ControlRange, p.SensorError

As you can see, I’ll be mostly looking at aggregates of Xa according to certain starting parameters. I feel like my primary key shouldn’t be cluster, and the columnstore index should be clustered, and it should only have Xa, but I don’t know for sure. The dataset is complete, so there will be no INSERTs, UPDATEs, DELETEs, or anything like that on either table.

What indexes should I be building for this thing? What’s going to let me efficiently query aggregates on Xa that are partitioned according to things in the Parameters table?

I’d appreciate any depth you go into, as I’m doing this on my own, and don’t have much technical background with databases.

Answer :

You said, that TimeElapsed is not really filtering much, so there is no point in indexing it.

Just looking at the query structure, I’d add an index to Parameters table on
(ExogDexCurve, ControlRange, SensorError). The order of columns is important.

I would also make sure that all indexes have Fill Factor=100, because the data is static.

But, overall, the main bottleneck is reading all Xa values for the needed ControlRange and SensorError.

The server has to read all this data. Storing it in columnstore index should help.


I would also consider denormalizing the data and making just one table (persisting the join). In this case I’d create a covering index. The order of columns is important.

(ExogDexCurve, ControlRange, SensorError, Xa) INCLUDE (TimeElapsed)

Xa is part of an index rather than included, because you order by Xa.

With one joined table and this covering index the query should scan the index without extra sorts.

Also, to “hide” the persisted joined table it may be possible to create an indexed view of the join instead of explicit table, but I’m not sure.

Leave a Reply

Your email address will not be published. Required fields are marked *