Say I wish to store when staff are on holiday (
ToDate) and then I wish to find all staff that are on holiday between two given dates (
Now assume that I have many such records (more than will fit in the server’s RAM) and need to do this query often.
Now assume that I also have
pay_rate tables, etc – all that have
ToDate with the requirement to join them based on overlapping dates.
How should I store the date ranges and how can the queries be written to run fast?
(The choice of RDBMS is not fixed, but being able to run on any “standard” RDBMS is of value unless doing so has a large adverse effect.)
I have posted some answers that I have considered, but don’t like! However they may help other people.
You can generate a customized index table, for example, that has a row for all year&week pairs that the date range (a certain vacation etc) encompasses. Then you can join dateranges by going through that index table. It will be large, but does avoid large scans as you can just list all vacations that have any weeks that are in common with another vacation, as an inner join.
create table rangeindex ( vacation_id bigint, year int, week int, primary key (year,week,vacation_id), index (vacation_id)) select v2.* from vacation v1 join rangeindex r1 on r1.vacation_id = v1.vacation_id join rangeindex r2 on r2.year = r1.year and r2.week = r1.week join vacation v2 on v2.vacation_id = r2.vacation_id where v1.vacation_user = ? -- and the usual start/end comparisons to filter unwanted pseudo hits
Similar things are used for example spatial coordinate stuff, dividing linear coordinates to grid cells that can be indexed and scanned more easily.
Update: fixed primary key
Overlapping date ranges can be found thus
WHERE FromDate <= QToDate AND ToDate >= QFromDate
You will get one row returned for each partially or fully overlapping range. For example, if you had separate pay rates for morning, afternoon and evening shifts and someone worked all three, there would be three rows returned.
The same basic pattern applies whether your columns are dates, times, datetime or, say, day-of-week. You will have to use the the appropriate
DATEPART() calculation on the columns.
Make sure you know whether you require open or closed intervals. In other words, should you use “<” or “<=” for comparison. You may miss or double-count if you make the wrong decision.
As for storage, obviously you must use the types that match your need. Public holidays can be held as
date alone since they will never be partial days. Shifts I would guess should be
time since they will be the same whatever day they are worked on, or (day of week, time) if that’s what your system has to cater for. Holidays? Most employers allow for fractional days off so
datetime may be appropriate.
I would recommend storing both “from” and “to” values for each row rather than just “from” and trying to work out when a value ended from the other rows in the table. You may need some magic values or some NULLs but your SQL will be simpler overall.
Performance will be improved by including the To and From columns in the index(es). Where they should appear in the key order will depend on each table’s primary key and the usage pattern. Be aware that DBMSs may refuse to use indexes on columns wrapped in a function. They become “non-sargable” in the jargon. You may need to pre-process your query predicates to match the columns’ types rather than the other way around.
There is a field of research around temporal databases. Even if they’re outside of your customers’ comfort zones you will gain insight from how these problems have been solved elsewhere.
The problem with a FromDate / ToDate format is that the query generates an index scan.
For example, a staff member takes vacation and returns two weeks later.
ID FromDate ToDate 15 2011-01-03 2011-01-17
Imagine many staff members, some with 20 or 30 years of such work history, maybe taking only a day or two at a time several times each year.
select ... from ... where StaffID = :Id and FromDate >= :AsOf and ToDate <= :AsOf;
This must examine every record of the specified staff member even though we know that the first hit found will end up being the only result. In some DBMSs, you can specify “stop at the first record found”, but this is very platform dependent and not even available in all platforms. And it still might end up performing an index scan if it starts at the “wrong” end of the index. Either way, every entry must also be examined if the “as of” date is not within any range because they were not on vacation at that time.
Plus there is one slight problem. The small but still existing doubt that an empty result can mean either the staff member was not on vacation or the query failed in some way.
A method that is platform independent and can obtain the result using an index seek, is to enter a record whenever the staff’s vacation status changes with only the one date of when the change took place.
For example, when first employed, a staff starts out “not on vacation” which we can designate ‘O’ for “on duty.”
ID Status EffDate 15 O 2010-01-04
A year later he takes vacation and returns two weeks later.
ID Status EffDate 15 O 2010-01-04 15 V 2011-01-03 15 O 2011-01-17
Now you have a date and you want to know if staff #15 was on duty or on vacation at that time.
select sv1.ID, sv1.Status from StaffVacation sv1 where sv1.StaffID = :Id and sv1.EffDate =( select Max( sv2.EffDate ) from StaffVacation sv2 where sv2.StaffID = sv1.StaffID and sv2.EffDate <= :AsOf );
The subquery may look like a time-wasting complication but, assuming an index on (StaffID, EffDate), it finds the one result with an index seek. Returning that value to perform a straight comparison, the outer
where clause also performs an index seek, but it seeks to the same record found by the subquery so it really doesn’t have to seek very far. That record is still in cache somewhere.
If the “as of” date is the current timestamp, the current status of the staff is shown. Omit the staff id in the
where clause and you get the current status of all staff.
There is also the (admittedly small) benefit that, as long as the date was anywhere in the employment span of the staff member, the query will always return something, either ‘O’ or ‘V’. If the index is also defined as unique (no two dates can be the same for any staff — an obvious requirement), it is then not possible to have gaps or overlaps between status changes.
One additional feature that may or may not count as a benefit. If the vacation times are planned in advance (as most are), the “on vacation” and “return to duty” entries may be entered in advance. As soon as the staff member’s application for vacation is approved, the entries are made and no followup is needed — as long as the vacation goes as planned, of course…
Generate a calendar table (that’s googleable) and join your other tables to it. This makes it easy to do overlaps, etc. in an efficient way.
In case it is not clear this is the standard way. I haven’t gone into example usage because it is too involved for an answer here. Search for “calendar table” will give lots of examples and a full explanation to work from.
You don’t need to re-solve the tablescan problem because it is already covered by indexing both of the date fields.