Design star schema for many-many relationship

Posted on

Question :

What are the steps/rules to build a DW star schema design from a production database; specifically, how do you handle many-to-many relationships.

I understand how to take basic data including a many-to-many relationship, and get to a normalized production database:

For example:

If I want to handle sales transactions, given tables Product Entity, Promotion, and Employee, the initial step is to construct a table, SaleTransaction:

SaleTransaction
- TransactionID
- ProductID
- EmployeeID
- SellingDateID
- Quantity
- SaleAmount
- PromotionID

The Promotion entity will be:

Promotion
- PromotionID
- ProductID
- DiscountAmount

However, that would only allow 1 product and 1 promotion per sale transaction. Since we want to allow one or more products, and zero or more promotions:

  • I remove PromotionID, ProductID, Quantity, and (unless required by some business rule) SaleAmount from SaleTransaction

  • I create SaleTransactionDetail:

    - DetailID
    - TransactionID
    - ProductID
    - Quantity
    - SaleAmount
    
  • I will create SaleTransactionPromotion:

    - TransPromoID
    - PromotionID
    

What would be the equivalent steps to take a production DB design to a DW star schema design?

Answer :

Many-to-many tables are typically handled with a “bridge table.” This is really just another name for a many-to-many table like you’d use in a transactional database: one column for an FK to the first table, one column for an FK to the second table, and probably a column or two for housekeeping fields.

You may also want a Weight field in the bridge table, with a value equal to one divided by the number of entries for each sale. This lets you calculate SUM(Weight) instead of COUNT(*), when you want to avoid double-counting orders which had multiple promotions.

Your proposed schema is good as-is, but I think it could be improved slightly; two things come to mind. One is that I would generally collapse SalesTransactions and SalesTransactionDetails into a single table. This introduces redundancy for things like customer IDs, which ought to make your OLTP instincts flinch (transitioning from normalized databases to data warehousing is a paradigm shift!), but in this case it’s worth it. By putting all of your sales into a single table, you simplify joins and, in many databases, signal that you want to use bitmap indices. In a warehouse, you often end up having to do table scans, but that’s OK because your fact tables are very narrow, and fit in memory.

So we have three tables Sales, Promotions, and SalesPromotions. You might consider “order” rather than “sale,” because the latter implies a finality which may not apply. If your system includes orders which have been placed but not shipped or not closed, the more general term makes sense. That’s quibbling, but it never hurts to think carefully about what you name things.

CREATE TABLE Orders
    -- IDs/dimensions:
    OrderID
    EmployeeID
    ProductID
    -- Date dimensions: date placed, date contract signed, date shipped, etc.
    -- Metrics: quantity, gross price, net price, extended price
    -- Attributes: is taxable, currency code, business key

CREATE TABLE Promotions
    PromotionID
    -- Date dimensions: date effective, date expires
    -- Metrics: fixed amount, percentage amount
    -- Attributes: name, type

CREATE TABLE OrderPromotions
    OrderID
    PromotionID
    Weight

ExtendedPrice can be a computed column rather than storing a value. If attributes start to bloat your Orders table, consider a “junk dimension.”

You could include ProductID in Promotions, but only if promotions will only ever apply to a single product. For example, do you have some promotions which are applied to a whole order rather than specific line items? Might a promotion apply to product code ABC1 today, but ABC2 tomorrow, after a price change?

The second caveat I had in mind is IDs. I strongly recommend that you use synthetic keys (SKs) in a data warehouse, rather than relying on IDs from the source system. There are a number of reasons, and this is discussed elsewhere on DBA.StackExchange, but offhand:

  1. A 32- or 64-bit integer for an SK will often be narrower than source system order IDs or customer IDs (e.g., a 144-bit Salesforce ID or other long alphanumeric IDs). This keeps your tables narrow, your secondary indices narrow, and your scans fast.
  2. You can avoid collisions in the source system. If Marketing some day decides to use product key ABC123 for something unrelated to what it was last year, you can give them two different synthetic keys.
  3. In particular, if your company acquires another business, you may find that their IDs don’t match the data types you’ve been using. You’ll need to map them to a common set, and handle collisions. Best to have the infrastructure for that from day one.
  4. You can use magic values for specific meanings which are relevant for reporting, but which aren’t handled in the transactional system. For example, I always use SKs -1 for “N/A” and 0 for “Unknown”. E.g., the Customers.FirstOrderSK field will be populated with -1 for customers who’ve had no orders, and 0 for customers who’ve had an order but for which no more specific info is available yet. Your reports can reflect this without having to put in special logic to handle NULL values.

You can take advantage of synthetic keys in your bridge table by including a “dummy record” for orders without promotions. Populate it with a PromotionID of -1, to indicate that no promotion applied; now consumers of your data can filter for orders sold at full price.

Leave a Reply

Your email address will not be published. Required fields are marked *