Finding all date ranges that overlap a target date range

Posted on

Question :

Say I wish to store when staff are on holiday (FromDate, ToDate) and then I wish to find all staff that are on holiday between two given dates (QFromDate, QToDate).

Now assume that I have many such records (more than will fit in the server’s RAM) and need to do this query often.

Now assume that I also have sick_leave tables, shift_pattern tables, pay_rate tables, etc – all that have FromDate and ToDate with the requirement to join them based on overlapping dates.

How should I store the date ranges and how can the queries be written to run fast?

(The choice of RDBMS is not fixed, but being able to run on any “standard” RDBMS is of value unless doing so has a large adverse effect.)

I have posted some answers that I have considered, but don’t like! However they may help other people.

Answer :

You can generate a customized index table, for example, that has a row for all year&week pairs that the date range (a certain vacation etc) encompasses. Then you can join dateranges by going through that index table. It will be large, but does avoid large scans as you can just list all vacations that have any weeks that are in common with another vacation, as an inner join.

pseudo example:

create table rangeindex (
  vacation_id bigint, 
  year int, 
  week int, 
  primary key (year,week,vacation_id), 
  index (vacation_id))

select v2.* 
    from vacation v1 
    join rangeindex r1 on r1.vacation_id = v1.vacation_id 
    join rangeindex r2 on r2.year = r1.year and r2.week = r1.week
    join vacation v2 on v2.vacation_id = r2.vacation_id
where v1.vacation_user = ?
    -- and the usual start/end comparisons to filter unwanted pseudo hits

Similar things are used for example spatial coordinate stuff, dividing linear coordinates to grid cells that can be indexed and scanned more easily.

Update: fixed primary key

Overlapping date ranges can be found thus

WHERE FromDate <= QToDate
AND   ToDate   >= QFromDate

You will get one row returned for each partially or fully overlapping range. For example, if you had separate pay rates for morning, afternoon and evening shifts and someone worked all three, there would be three rows returned.

The same basic pattern applies whether your columns are dates, times, datetime or, say, day-of-week. You will have to use the the appropriate DATEPART() calculation on the columns.

Make sure you know whether you require open or closed intervals. In other words, should you use “<” or “<=” for comparison. You may miss or double-count if you make the wrong decision.

As for storage, obviously you must use the types that match your need. Public holidays can be held as date alone since they will never be partial days. Shifts I would guess should be time since they will be the same whatever day they are worked on, or (day of week, time) if that’s what your system has to cater for. Holidays? Most employers allow for fractional days off so datetime may be appropriate.

I would recommend storing both “from” and “to” values for each row rather than just “from” and trying to work out when a value ended from the other rows in the table. You may need some magic values or some NULLs but your SQL will be simpler overall.

Performance will be improved by including the To and From columns in the index(es). Where they should appear in the key order will depend on each table’s primary key and the usage pattern. Be aware that DBMSs may refuse to use indexes on columns wrapped in a function. They become “non-sargable” in the jargon. You may need to pre-process your query predicates to match the columns’ types rather than the other way around.

There is a field of research around temporal databases. Even if they’re outside of your customers’ comfort zones you will gain insight from how these problems have been solved elsewhere.

The problem with a FromDate / ToDate format is that the query generates an index scan.

For example, a staff member takes vacation and returns two weeks later.

ID   FromDate    ToDate
15   2011-01-03  2011-01-17

Imagine many staff members, some with 20 or 30 years of such work history, maybe taking only a day or two at a time several times each year.

select  ...
from    ...
where   StaffID = :Id
   and  FromDate >= :AsOf
   and  ToDate <= :AsOf;

This must examine every record of the specified staff member even though we know that the first hit found will end up being the only result. In some DBMSs, you can specify “stop at the first record found”, but this is very platform dependent and not even available in all platforms. And it still might end up performing an index scan if it starts at the “wrong” end of the index. Either way, every entry must also be examined if the “as of” date is not within any range because they were not on vacation at that time.

Plus there is one slight problem. The small but still existing doubt that an empty result can mean either the staff member was not on vacation or the query failed in some way.

A method that is platform independent and can obtain the result using an index seek, is to enter a record whenever the staff’s vacation status changes with only the one date of when the change took place.

For example, when first employed, a staff starts out “not on vacation” which we can designate ‘O’ for “on duty.”

ID  Status EffDate
15       O 2010-01-04

A year later he takes vacation and returns two weeks later.

ID  Status EffDate
15       O 2010-01-04
15       V 2011-01-03
15       O 2011-01-17

Now you have a date and you want to know if staff #15 was on duty or on vacation at that time.

select  sv1.ID, sv1.Status
from    StaffVacation sv1
where   sv1.StaffID = :Id
   and  sv1.EffDate =(
        select  Max( sv2.EffDate )
        from    StaffVacation sv2
        where   sv2.StaffID = sv1.StaffID
            and  sv2.EffDate <= :AsOf );

The subquery may look like a time-wasting complication but, assuming an index on (StaffID, EffDate), it finds the one result with an index seek. Returning that value to perform a straight comparison, the outer where clause also performs an index seek, but it seeks to the same record found by the subquery so it really doesn’t have to seek very far. That record is still in cache somewhere.

If the “as of” date is the current timestamp, the current status of the staff is shown. Omit the staff id in the where clause and you get the current status of all staff.

There is also the (admittedly small) benefit that, as long as the date was anywhere in the employment span of the staff member, the query will always return something, either ‘O’ or ‘V’. If the index is also defined as unique (no two dates can be the same for any staff — an obvious requirement), it is then not possible to have gaps or overlaps between status changes.

One additional feature that may or may not count as a benefit. If the vacation times are planned in advance (as most are), the “on vacation” and “return to duty” entries may be entered in advance. As soon as the staff member’s application for vacation is approved, the entries are made and no followup is needed — as long as the vacation goes as planned, of course…

Generate a calendar table (that’s googleable) and join your other tables to it. This makes it easy to do overlaps, etc. in an efficient way.

In case it is not clear this is the standard way. I haven’t gone into example usage because it is too involved for an answer here. Search for “calendar table” will give lots of examples and a full explanation to work from.

You don’t need to re-solve the tablescan problem because it is already covered by indexing both of the date fields.

Leave a Reply

Your email address will not be published. Required fields are marked *