Let’s say there’s a composite primary key on (field1, field2)
Also, let be X a theoretical set of n numbers
Inserts look like this:
INSERT INTO table(field1, field2) VALUES([a number picked at random from X], [an incrementing number]);
Every m inserts, a number is randomly removed from X, and another one is
Let’s say n=1000 and m=500. In other words, every 500 inserts, one element in set X changes, but set X always has 1000 elements.
How badly will the B+tree become fragmented?
In the case of an index, fragmentation can vary greatly. Yet, nothing can be worse that entering data into an index already sorted.
To illustrate, let’s take the worse case BTREE, a balanced binary tree. A balanced binary tree is a binary tree that monitors the height of its child nodes and will only allow for a difference in heights of :
-1(Left TreeNode Taller Than the Right TreeNode)
0(Left TreeNode and Right TreeNode Same Height)
+1(Left TreeNode Shorter Than the Right TreeNode)
- See Definition of AVLTrees
Over 25 years ago, I learned that entering already ordered data into a balanced binary tree results in 45% of the tree nodes requiring rotation and rebalancing.
Common sense would have it that if I enter the numbers 1-15 ordered as follows:
into a balanced binary tree, absolutely no tree rebalancing would ever occur.
This being the case, entering random data into a balanced binary tree could never be as bad as entering already sort data. On average, random ordering would require less tree rebalancing than sequential ordering.
Now, let’s get away from balanced binary trees into something more real-world, BTREEs. Treenodes in BTREE indexes usually hold some power of two index page entries + pointers to a lower BTREE node. Page splitting would occur in the event pages become full. Consequently, many early BTREE pages would end up half full due to page splitting (The execption to this would be with bulk inserts and in-memory sorting of keys and writing out more compact index pages with as little fragmentation as possible). Other than that exception, insertion of ordered data in a BTREE page would cause page splits with a very small degree of regularity. Random inserts would at the very least wreak less havoc than loading ordered data.
In terms of InnoDB, you have the gen_clust_index. It has hooks to another type of index used internally for tracking row IDs. Never think about the gen_clust_index as just any other BTREE. It is what the Oracle world would call an index-organized table. Column ordering for InnoDB is handled so differently from MyISAM that can you detect a notable difference if you attempt to reorder row data in a specific key order.
According to pages 148,149 of MySQL Database Design and Tuning
it shows that doing
ALTER TABLE tblname ORDER BY column-list; would greatly improve retrieval for MyISAM tables by physically ordering rows. In contrast, doing
ALTER TABLE tblname ORDER BY column-list; would have no effect on InnoDB tables because data is always ordered by RowID within the clustered key.
MORAL OF THE STORY
When it comes to InnoDB, do not worry about randomness of data. No matter what machinations you perform on tables indexes, the clustered key has the final say on access. Whatever order you load the data, gen_clust_index will abide by it. No questions asked.