This question is more about performance.
Assume that there are two tables one with millions of records and one with some thousands of records in a DB like MySQL. By this example, I am trying to show a very unbalanced situation. So the question is, how expensive would be the (inner) join operation? (join on for example user_goup from the 1st table and the id from the second). do you think that this architecture can be scaled to more records in tables?
… one with millions of records and one with some thousands of records … a very unbalanced situation … how expensive would be the join operation?
It depends entirely on the Query, how “good” the Indexes are on the fields that you’re joining on and what else is going on in the database at the same time.
Say you have a banking application.
You can join from the “Accounts” table (with thousands of rows) to the “Transactions” table (millions of rows) using the Account Number field. As long as that’s indexed on both tables (PK in Account, FK in Transactions) then the query will run [very] well. Databases all over the world run like this every day.
The trouble comes when you try to join on fields that aren’t indexed or that aren’t good choices for indexes (say an “Account Active” flag that only has two values – it’s essential, yes, but on it’s own, it’s a very poor discriminator to choose between records).
The trouble can also come from a [poorly-written] query, where something about it “tricks” the database into doing things the “wrong” way around – putting some value in the Transactions table through a function might force the database to Table Scan that first, before joining to the other.
The only real way to find out how things are going to work is to Test them.
Run you queries against a representative database, preferably under a representative workload.
It would depend more on the distribution of data in both tables. The total number of records in each table is much less of a factor here. If
Table1, which has millions of rows, only had 1 row that matched on the
JOIN conditions to
Table2 (the table with thousands of rows) and an
INNER JOIN was used then only 1 row would be returned in the result set. (This is known as the cardinality – the number of rows that match your predicates.) With proper indexing this should be achievable in milliseconds. This is true for almost any number of total rows in both tables (e.g. even if the table had billions of rows).
But there are a lot of factors that influence the outcome above which your question leaves out, so that’s just one example. If the above tables weren’t properly indexed (e.g. on
user_goup and on
id for each table respectively) then you’d possibly run into a full table scan, which would be a lot slower on the table with millions of records. The type of
JOIN being done would also result in a different outcome which could exhibit different performance from my above example, and again depending on how the data is distributed and how much data is actually being returned, would also impact the total runtime.