I’m scraping data from a website and they have pages for Users and Groups. The pages are pretty similar; id, name, description, created date, number of elements and custom background.
They also have a Friendlist for Users and a Memberlist for Groups, which besides names, are pretty much the same. I have already normalized that bit.
But the User page also contains other snippets.
I started sketching out the tables Users, Groups and Page. But now I’m starting to think if I’m over-normalizing this. Because truly, it’s not the same data that is being stored.
Should I go for 2 tables, each containing fields like page_id, page_name or should I add pivot table that contains all page data (with a side-effect of having some fields NULLable for the Groups pages)
Answering strictly in terms of MySQL, you do not want to overnormalize.
Here is why ;
REASON #1 : JOIN BUFFER
There is an in-memory buffer called the join buffer. Its size is regulated in RAM by the join_buffer_size option per DB Connection. The more normalized the data are, the more
JOIN clauses. In turn, the more
JOIN clauses, the more join buffers have to be allocated. If a join buffer is too small, it gets migrated to disk. This slows down the query’s execution by
- pausing the query to write the join buffer to disk
- generating more disk I/O while executing the join
REASON #2 : QUERY EVALUATION
Because MySQL executes SQL through external storage engines, query evaluation for WHERE clauses and JOIN clauses are virtually identical. I wrote about this back on March 11, 2013 (Is there an execution difference between a JOIN condition and a WHERE condition?). The algorithm for JOINs are also in the MySQL Documentation. The JOIN algorithm can get rather wonky with multiple JOIN clauses.
REASON #3 : STORAGE ENGINE PERFORMANCE
Any storage engine in MySQL is basically comprised of 14 operations. Those operations, including JOINs, are layered in such a way that performance issues can occur in two layers
- Query Evaluation (See Reason #2)
- Storage Engine
- If you JOIN using large columns
- If you use the wrong Storage Engine
- If you mix Storage Engines in the JOIN
It is strictly up to you how deep you want to normalize your data. If you must normalize and do multiple JOINs, please join only integer values only. Joining on character values will have negative effects on the three reasons I just mentioned.
Please someone make this a comment:
I am not sure exactly what the definition of the tables would be that you are proposing. And I also do not have a full understanding of the data you are describing. Would you please add some DDL or table/column lists to your question with the two schemas you are considering?