How can I define and declare a good PRIMARY KEY for my dataset?

Posted on

Question :

I have a large historical dataset that has the following attributes:

  • productCode: Alphabetic code used to identify a product name, e.g., ‘LE’ (limited edition), ‘LI’ (limited inclusion)
  • productMonth: The month of the product manufacturing (multiple products can be made per month)
  • productPrice: the price level for a particular product
  • dateSold: the day the product was sold

Note: A productCode can have multiple productMonths and multiple productPrice levels. For example, for a productCode ‘LE’:

Note 2: Once a product (e.g., ‘LE’, ‘Mar 16’) is sold at a certain priceLevel (e.g., ‘2.0’) on a certain day, it is no longer available at that priceLevel for the rest of such day. It will be available at the priceLevel again the next day.

This data will be pulled for analysis and the database will also be updated daily. As a beginner I want to make sure I employ best practices. From my research, I have been told that every table should have a primary key. In my case (my dataset below), I’m not sure how to best construct one.

Sample data

In order to exemplify the pieces of information in tabular form, I have provided the following sample data:

 productCode dateSold productMonth price randomData
 ----------- -------- ------------ ----- -----------
 LI            Dec 16       Mar 16  1.65        0.05
 LE            Dec 16       Mar 16  1.65        0.05
 LE            Nov 16       Mar 16  1.65        0.04
 LE            Nov 16       Apr 16  2.00        0.03
 LE            Nov 16       Apr 16  5.00        0.01


I believe I need a PRIMARY KEY (PK) for the case of UPDATE/INSERT into the database, I’d only want to INSERT to the database if there wasn’t an existing row already (having duplicate rows would skew analysis of data). I believe a PK can help identify duplicates if I construct it based on the existing data and check as I INSERT into the database.

A unique ID can be made for every item if I combine the productCode, dateSold, productMonth and price columns. This is because once the product (productcode, productmonth, price) is sold on a particular day (dateSold), it can not be sold again for the rest of the day.

However, I’m not sure if this is the proper way of building a PK or if there is a better key for my particular dataset. Any guidance would be greatly appreciated.

Answer :

I’ve asked some questions in the comments, but in general:

If you have a combination of columns that would make a row unique according to the business rules of the data, you can add those columns to a unique index/constraint and the DBMS will take care of no duplicates. Concatenating the columns and putting them in a PK would not add any more uniqueness and it would impact performance in a negative way by taking up more space and potentially being a wider than needed composite key.

Leave a Reply

Your email address will not be published. Required fields are marked *