SQL Server one-to-many and index fragmentation

Posted on

Question :

I’m currently writing code to update an entities child collection (one-to-many relationship), and while thinking about how to write code to determine which entities have been added/removed/modified, I realized that not only is it easier to just recreate the whole list, but probably better for SQL server performance?

Let’s say I have a table Student with a one-to-many relationship to Course, and I have there 2 courses Math and Physics, with primary keys 1 and 2 respectively, and obviously a foreign key to Student which is 1.

If I one day decide to update the first course, remove the other, and add a new one, I would end up with the following entities

  • Math (upated), primary key is: 1
  • Physics (deleted), primary key was: 2
  • History (added), primary key is: 5000

Now the 2 courses are no longer next to each other, and this probably causes big performance problems in the long run since my keys are now fragmented, is this correct?

A related question is whether I should have a primary key on the Course table at all? I usually add it by default to all of my (except association) tables, but in this case I never query courses individually, but always in the context of a student. Does it make sense to keep the primary key, even if I know I’ll never reference it from anywhere? Would the lack of a primary key hurt performance?

EDIT: I’ve decided to keep the primary key, in-case in the future I do need to query this child table without the parent.

My primary keys are integers and have a clustered index (default on SQL Server) and I’m using SQL Server 2016, and Entity Framework 6 as my ORM.

UPDATE:

Here’s my CREATE TABLE from SSMS. I’m using EF Code-First so I didn’t write any of this by hand.

CREATE TABLE [dbo].[Courses](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [Description] [nvarchar](max) NOT NULL,
    [StudentId] [uniqueidentifier] NOT NULL,
    [Fk1Id] [int] NOT NULL,
    [Fk2Id] [int] NOT NULL,
    [Fk3Id] [int] NULL,
 CONSTRAINT [PK_dbo.Courses] PRIMARY KEY CLUSTERED 
 (
     [Id] ASC
 )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
 ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[Courses]  WITH CHECK ADD  CONSTRAINT [FK_dbo.Courses_dbo.Students_StudentId] FOREIGN KEY([StudentId])
REFERENCES [dbo].[Students] ([Id])
GO

ALTER TABLE [dbo].[Courses] CHECK CONSTRAINT [FK_dbo.Courses_dbo.Students_StudentId]
GO
--Omitted other alter tables for Fk1Id,Fk2Id, and Fk3Id

StudentId is the foreign key that I will always use to fetch these rows, they are not relevant in any other context (at least for now).

I have renamed the table and column names for privacy reasons, and removed 3 foreign key columns for simplicity! I hope they’re not too relevant.

Answer :

Let’s suppose student 1 takes Math. We get a Course.Id value of 1. What if student 2 also wants to become more numerate? Should she study “Math”, “Maths” or “Mathematics”? Is there a list somewhere of the agreed available subjects? Conversely if the last student stops studying History does your history department disappear? I believe this is a hole in your analysis. The tables should be

Student
  StudentId
  StudentName
  .. other values

Subject
  SubjectId,
  SubjectName,
  .. other values

Enrolment
  EnrolmentId,
  StudentId,
  SubjectId,
  .. others

Enrolment is the intersection of Student and Subject. It is what you have called Course(s). As an intersection it could use the two foreign keys (StudentId and SubjectId) as the primary key. There is no problem with having multi-column primary keys. There are a lot of good reasons to design the key this way. It would be my preferred implementation.

As with any table, it could stick with it’s natural key or also be given a surrogate key, which is what EnrolmentId is. While useful for many reasons it is not a requirement to have a surrogate primary key on any table. It is certainly not a requirement for the surrogate key to be the clustered key.

For performance reasons you do, indeed, want to keep together on disk those rows which are read together. Since you’re using SQL Server you can achieve this by a clustered index. So which rows do you read together? From your question it would seem that the most important query is to list courses for a student. Therefore clustering Enrolment by StudentId will be the best idea. When a student drops or adds a course the new row will be beside that student’s existing rows, because that’s how a clusterd index works.

Maybe in future it will become important to list the students which are in a given course. Although the clustered index has been assigned, a second, covering index could be defined.

create unique nonclustered index IX_by_subject on Enrolment
(
  SubjectId,
  StudentId
);

Although not defined as clustered it holds all the values required to satisfy a large class of important queries. As students take up and drop subjects SQL Server will keep together on disk all the rows associated with one subject. This will be a different bit of disk than where the base table data (i.e. the clustered index) is held.

For more complex tables SQL Server supports the INCLUDE clause. Using this one can create what are effectively further clustered indexes that are updated by SQL Server to remain in sync with the base data.

With students studying a few subjects, and the tables indexed as above, fragmentation will not be a concern. Each student’s rows will most likely be on one page, and never more than on two pages. Performing sensible, measured index maintenance is good practice.

A related question is whether I should have a primary key on the
Course table at all? I usually add it by default to all of my (except
association) tables, but in this case I never query courses
individually, but always in the context of a student. Does it make
sense to keep the primary key, even if I know I’ll never reference it
from anywhere? Would the lack of a primary index hurt performance?

Every table should have Primary Key/Clustered Index.Since ID is seldom use in join condition.You can drop Clustered index from ID column.

Make Studentid +courseID in Course table as Clustered index.

In Future or Present, Studentid will be most frequently use in join condition and in all important query.On ther hand Course ID column will be very less use and not so important query.

I think [StudentId] [uniqueidentifier] should have been INT or Bigint.

Now the 2 courses are no longer next to each other, and this probably
causes big performance problems in the long run since my keys are now
fragmented, is this correct?

You can rebuild index.It is very normal in database.So when index is rebuild then id=1 and next id=5000 will be very much together.

First of all you can reconsider [Description] [nvarchar](max) NOT NULL

If it is possible to make it VARchar (2000),(3000) etc.

Rebuilding an index drops and re-creates the index. This removes fragmentation, reclaims disk space by compacting the pages based on the specified or existing fill factor setting, and reorders the index rows in contiguous pages.

Is manually keeping the Id's together bad practice? 

Just because there is gap in ids that doesn’t mean it is Index Fragmentation or it will hurt performance in any way.Gap in id is ok.Even if ids are continuous there will be index fragmentation.

Index Fragmentation

Leave a Reply

Your email address will not be published. Required fields are marked *