For an example data set with rows fulfilling both subqueries:
SELECT h.batch, li.id as line_item, d.prusr as price, 0 as advertised FROM schema.lineitem li INNER JOIN schema.detail d on d.line = li.line and d.item = li.item INNER JOIN schema.header h on d.batch = h.batch LEFT JOIN schema.specialoffer sp ON sp.LINE = d.LINE AND sp.ITEM = d.ITEM WHERE h.status = 'T' AND sp.batch IS NULL UNION ALL SELECT h.batch, li.id as line_item, d.prusr as price, 1 as advertised FROM schema.lineitem li INNER JOIN schema.detail d on d.line = li.line and d.item = li.item INNER JOIN schema.header h on d.batch = h.batch INNER JOIN schema.specialoffer sp ON sp.LINE = d.LINE AND sp.ITEM = d.ITEM WHERE h.status = 'T'
What RDBMS will attempt to parallelize this type of query? I’d also be interested in any other “under the hood” sorts of optimizations you may know of, but parallelization is the main focus of this question.
References I have found already:
Can each section of
UNION ALL run concurrently?
YES (oracle 12c+)
NO (per OP’s question)
No for Postgres 10 or earlier
Yes for Postgres 11 currently in Beta
Microsoft SQL Server
Mostly no, when
UNION ALL is implemented by a Concatenation operator. Although not documented (or guaranteed), Concatenation has been observed to always read from its inputs in sequence. Each input may be processed using multiple threads of execution. There may be a period of time where threads from different instances are reading from different inputs as processing moves from one input to the next.
Somewhat yes, when
UNION ALL is implemented by a Concatenation operator and the inputs reside on different servers (
Open() calls are performed asynchronously – Async Concat operator). How much processing occurs during
Open() depends on plan shape.
Mostly yes when
UNION ALL is implemented by a Merge Join Concatenation operator, since each instance (thread) reads from any input as needed to perform an order-preserving union.
Are there other “under the hood” sorts of optimizations?
There are several approaches found in literature regarding common subexpression identification: detecting the same expression in two or more union subqueries. Having done so, you can either use the following equivelence:
(A join B join C) union (A join B join D) -> A join B join (C union D) to push union before joins when you estimate that this plan will be cheaper, or use methods from an area known as multi-query optimization (here obviously we have a single query, but we can regard different subqueries as separate queries) in order to reuse an expression. Again the decision to reuse an expression is not trivial, as doing so often requires materializing the intermediate result and can lead to much worse execution times. Also, in case that several complex common expressions exist in the subqueries, deciding which one should be materialized (or even which subexpression of an expression) affects the decisions for the rest, making the search space during optimization doubly exponential. In any case, examining these issues heavily affects optimization time, and although there are promising research prototypes, I believe that most commercial products do not consider this kind of optimization. Also note that using such methods results in the whole query being “less-parallelizable”, as you can no longer execute each subquery completely independently.