I’m using MariaDB.
I’m designing a simple application that can send notifications to subscribers for when certain websites they’ve subscribed to have been updated.
A user can be subscribed to several different sites and multiple users can subscribe to same site.
So both tables are independent in a vacuum.
Simple schema for both, just a
- sites table containing ids and urls
- subscribers table containing ids and phone#s
My tentative flow for efficiency and scalability is something like the following
- Cron job every hour that gets all Sites from sites table
- Makes parallel web requests for all sites
- Detect sites that have been changed (this will be compared against to another column in sites table)
- Alert users of changed sites which sites got updated (prob also send alerts in parallel)
Is introducing a junction table even necessary for this? I feel like it’d introduce complexity bc I’d have to join on that everytime and update that everytime a change has been made to either subscribers table.
Or.. should I just denormalize..
Seems like sites is kind of the “common denominator” here (i.e. I’ll always check if EVERY site has changed, but I don’t necessarily need all subscribers)
So given ^, I was thinking of maybe just adding a Sites.subsriber_ids field..
Or..should I just go for a NoSQL approach?
Any thoughts and recommendations ?
CREATE TABLE provsub ( provider_id int(11) NOT NULL, subscriber_id int(11) NOT NULL, created timestamp NOT NULL DEFAULT current_timestamp(), updated timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(), PRIMARY KEY (provider_id,subscriber_id), KEY subscriber_id (subscriber_id), CONSTRAINT provsub_ibfk_1 FOREIGN KEY (provider_id) REFERENCES providers (id), CONSTRAINT provsub_ibfk_2 FOREIGN KEY (subscriber_id) REFERENCES subscribers (id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 |
A junction table isn’t going to affect your performance here, especially with a process that only runs at a frequency of every hour.
NoSQL doesn’t buy you any advantage here either, and IMO should only be used when there’s a use case for it that can’t be accomplished in a regular RDBMS.
And as far as why I think denormalization is unnecessary here: Data integrity is the most important thing in a database, IMO. Denormalizing typically leads to data redundancy that is at higher potential risk against integrity. It also leads to heavier tables which can have their own performance implications as well. I don’t really see any reason for you to denormalize away a single simple junction table here. The extra work to maintain it should be trivial and not a performance bottleneck.
An example of where data integrity can come into question is when you redundantly store data such as a
phoneNumber multiple times, because you would have the same record multiple times (for each
site they’re subscribed to) in the denormalized table. When their
phoneNumber changes, you need a transactionally consistent way to update all instances of their
phoneNumber in the table, otherwise you lose data integrity.
Sure, in your simple example, the easy answer is to write an update statement by the
subscriberId, and you’ll usually be covered. (This is just one simple example, because your use case is pretty simple.) But now you’re also updating many records instead of one record to change the
phoneNumber which means more rows need to be located, loaded off disk, locked (which may result in lock escalation on the entire table – not sure if this happens in MariaDB), updated, and transactionally written back to disk. This is one example of performance implications of a denormalized table.