Simplifying the “repetitive” code of a query involving UNION ALL, WHERE and LIMIT

Posted on

Question :

How can I simplify the code of the query shown below? Requirements are always the same as I describe as follows:

(select * from questions where points = 3 and type = 1 order by rand() asc limit 6)
union all
(select * from questions where points = 2 and type = 1 order by rand() asc limit 4)
union all
(select * from questions where points = 1 and type = 1 order by rand() asc limit 2)
union all
(select * from questions where points = 3 and type = 0 order by rand() asc limit 10)
union all
(select * from questions where points = 2 and type = 0 order by rand() asc limit 6)
union all
(select * from questions where points = 1 and type = 0 order by rand() asc limit 4)

Is there any way to do it without UNION ?

I am just curious if there is a faster/more efficient way of doing it. SQL Server or MySQL, I haven’t decided yet, because I am at design stage right now.

Answer :

You can use the ROW_NUMBER() analytical function to generate row numbers for each partition of (points, type):

SELECT
  *,
  ROW_NUMBER() OVER (PARTITION BY points, type ORDER BY rand ()) AS rownum
FROM
  questions

And then you can compare rownum to a certain value that depends on points and type:

WHERE
  rownum <= some_expression

Of course, you will need to use nesting in order to be able to reference rownum like that:

SELECT
  ...
FROM
  (
    SELECT
      ...
      ROW_NUMBER() ... AS rownum
    FROM
      ...
  ) AS derived
WHERE
  rownum <= ...

And you need to keep in mind that because of the nesting the rownum column becomes part of the derived table’s column set – so, specifying * as the main SELECT’s column list will include the rownum column as well. If you want it out of the result, you will have to list each questions column explicitly:

SELECT
  question_id,  /* or whatever the PK column is going to be called */
  points,
  type,
  ...  /* other "questions" columns */
FROM
  (
    ...

Taking all of the above points into account (the last one in particular), I am unsure if the resulting query can be called a simplification of the original, but here is my attempt anyway:

SELECT
  question_id,
  points,
  type,
  ...
FROM
  (
    SELECT
      *,
      ROW_NUMBER() OVER (PARTITION BY points, type ORDER BY rand ()) AS rownum
    FROM
      questions
  ) AS derived
WHERE
  rownum <= CASE
              WHEN points = 3 AND type = 1 THEN 6
              WHEN points = 2 AND type = 1 THEN 4
              WHEN points = 3 AND type = 1 THEN 2
              WHEN points = 3 AND type = 0 THEN 10
              WHEN points = 2 AND type = 0 THEN 6
              WHEN points = 1 AND type = 0 THEN 4
            END
;

The CASE expression returns a row limit for each required partition and a NULL for any other, thus effectively providing a filter to eliminate all irrelevant rows (the rows where points and type have values other than those mentioned in the conditions). This kind of filter, however, is not pushed down to the underlying questions table and, therefore, does not prevent the query from generating row numbers for all combinations of (points, type) before filtering out those that are not needed. You can avoid the redundant work by adding an explicit filter to the questions table like this:

  ...
FROM
  (
    SELECT
      *,
      ROW_NUMBER() OVER (PARTITION BY points, type ORDER BY rand ()) AS rownum
    FROM
      questions
    WHERE
      points IN (1, 2, 3)
      AND type IN (0, 1)
  ) AS derived
WHERE
  ...

Alternative

Alternatively you can rewrite the entire query in a different way, such that rows are filtered and row limits assigned at the same time. If you use a virtual table like this:

(
  SELECT 3, 1,  6 UNION ALL
  SELECT 2, 1,  4 UNION ALL
  SELECT 1, 1,  2 UNION ALL
  SELECT 3, 0, 10 UNION ALL
  SELECT 2, 0,  6 UNION ALL
  SELECT 1, 0,  4
) AS limits (points, type, rowlimit)

you can join questions to it and thus have both a filter and a way to assign row limits to each partition. The full query will then look like this:

SELECT
  question_id,
  points,
  type,
  ...
FROM
  (
    SELECT
      *,
      ROW_NUMBER() OVER (PARTITION BY points, type ORDER BY rand ()) AS rownum
    FROM
      questions AS q
      INNER JOIN
      (
        SELECT 3, 1,  6 UNION ALL
        SELECT 2, 1,  4 UNION ALL
        SELECT 1, 1,  2 UNION ALL
        SELECT 3, 0, 10 UNION ALL
        SELECT 2, 0,  6 UNION ALL
        SELECT 1, 0,  4
      ) AS limits (points, type, rowlimit)
        USING (points, type)
  ) AS derived
WHERE
  rownum <= rowlimit
;

You can then go further and make the limits dataset an actual table in your database. You have mentioned that requirements are not going to alter, but perhaps knowing that you can configure the requirements in the table will change that perspective.

Product-related notes

You mention that you have not decided in which SQL product this is going to be used. Here are some product-related notes to consider.

  1. Both variations of the solution will work in MySQL 8.0 or newer version. Version 8.0 is the minimum because that is where support for ROW_NUMBER() was first introduced in MySQL.

  2. The other product you mentioned, SQL Server, has supported ROW_NUMBER() since version 2005 but other aspects of the solution will require changes in order to make it work in SQL Server.

    2.1. You are using ORDER BY RAND() to randomise the rows selected. In SQL Server, the RAND() function is a runtime constant, therefore you will need to use a different method; ORDER BY NEWID() could be one:

      ROW_NUMBER() OVER (PARTITION BY points, type ORDER BY NEWID()) AS rownum

    2.2. The JOIN ... USING syntax is not supported in SQL Server. You will need to replace it with the JOIN ... ON one. Because of this switch, you will also need to rewrite the nested SELECT column list. The reason is, USING automatically suppresses duplication of the same-name columns used for joining, points and type, when you are using * in the SELECT clause. The ON operator has no such effect, therefore you will need to rewrite the derived query like this:

      (
        SELECT
          q.*,
          limits.rowlimit,
          ROW_NUMBER() OVER (PARTITION BY points, type ORDER BY NEWID()) AS rownum
        FROM
          questions AS q
          INNER JOIN ( ... ) AS limits (points, type, rowlimit)
            ON q.points = limits.points AND q.type = limits.type
      ) AS derived
    

    2.3. While not a required change per se, you can take advantage of SQL Server’s support of the VALUES constructor as a means of defining a derived table to make the limits definition more compact:

      (
        VALUES
          (3, 1,  6),
          (2, 1,  4),
          (1, 1,  2),
          (3, 0, 10),
          (2, 0,  6),
          (1, 0,  4)
      ) AS limits (points, type, rowlimit)
    

The result of the function will not change for the duration of the SELECT statement’s execution.

Leave a Reply

Your email address will not be published. Required fields are marked *