Fastest Aggregate function when you have series of duplicate values

Posted on

Question :

Frequently, I’ll need to crunch down a table into a series of aggregates.

And frequently, it will look like this.

John, Smith, Houston, Texas, Order 55, Balloon, 99 dollars
John, Smith, Houston, Texas, Order 55, Bicycle, 400 dollars

In other words, there’s the “key column” to group by (order ID in this case).

There’s then columns that should be “summed” or “counted” or properly aggregated.

Then there’s a bunch of columns that are just duplicated values. This happens a lot and is not necessarily a normalization problem (or you’re querying a database like this anyway that you don’t admin).

Seems like there’s a few choices here. Max(blah), max(ha) max(whatver).

Or min(every column). Obviously since they’re all the same, it’ll return it.

There’s also “group by” each of those duplicate columns. Naturally, this can provide different results depending on your assumptions/ certainty that the columns will be duplicated.

Then there’s also, in MySQL only (not MS or Oracle) — putting nada, and it’ll just take the first row found.

Which is typically the fastest? I suppose the best way is to test it. Just wondering though.

Answer :

Given the current design, I don’t know if there will be one approach that is noticeably more efficient than any other. You may be able to minimize the pain by creating an index on just OrderID and Price, making it relatively efficient to group by OrderID and SUM(Price), but then to get back the repeated information like name and address, you’re still going to have to go back to the whole table. Perhaps an indexed view on the duplicates (OrderID + demographics), but really you should only be storing that information once in the first place. Then there’s the question of what to do with values that can’t be aggregated and also are absolutely not duplicated, like Balloon and Bicycle. When you have these two rows for the same order, you can collapse the name and address, and SUM() the price, but what do you expect to show for the product?

Why isn’t your design more like a traditional store schema?

  • Customers

    • CustomerID (PK)
    • Name (once!)
    • e-mail
    • PasswordHash
    • etc. etc.
  • Addresses (so customers can have more than one address)

    • AddressID (PK)
    • CustomerID (FK)
    • Street
    • City
    • Region
    • Postcode
    • etc. etc.
  • Products

    • ProductID (PK)
    • Description
    • Price
    • etc. etc.
  • Orders

    • OrderID (PK)
    • CustomerID (FK)
    • Date
    • etc. etc.
  • OrderDetails

    • OrderID (FK)
    • ProductID (FK)
    • Quantity
    • etc. etc.

Leave a Reply

Your email address will not be published.