Question :
I’m planning to make a chat application. The problem is that there can be billions of records, since there will be lots of chat messages.
I have come up with the idea to make separate tables for each conversation, and that would make millions of tables. Or should I add them to a single table? Which makes tens of billions of records in a single table?
I’m also thinking about distributing each conversation between different servers.
Which of these would be an appropriate solution? Please feel free to guide if I’m doing this wrong! I can consider different database methodologies as well (like NoSQL).
Answer :
If this is an actual service that you would like to create and not just an academic exercise, then it would be best to start with some realistic expectations and grow from there. Many years ago I implemented something like this at the day job where employees could communicate with one another in a Microsoft Teams-like fashion (though before Teams was a thing). The system is still being used today and sees anywhere between 5 and 25 messages per minute between the hours of 7:30am and 2:45am local time. It is used by about 18,500 employees scattered across the globe.
Current statistics (as of 00:00:00 UTC today):
Item | Number | Note |
---|---|---|
Posts | 35,641,804 | As per HR and Legal requirements, no records are deleted, but they are instead “soft deleted”. |
Files | 2,613,017 | Same as above |
Channels | 74,331 | |
Accounts | 24,459 | 6,000+ idle/expired accounts |
Each one of these items has its own table, and there are additional tables that enable additional functionality, such as ChannelMember
, which specifies which Accounts are members of a Channel and what level of permission they have (Read-Only / Read-Write).
Posts are saved in Markdown format and have no realistic size limit. People are free to use the space as they see fit and, so long as nobody reports an Account or Channel as being inappropriate, no communications are monitored or shared with management.
Search is handled by splitting posts apart on a word-by-word basis and storing unique words in a PostSearch
table, which is then used for somewhat faster lookups. Because the words are split apart, search results can be “weighed” for their relevance before being returned to the requester.
This runs on a MySQL database with a PHP-based API handling all communication between the front-end applications and the database.
Now the fun stuff:
Action | Average Response Time |
---|---|
Loading Channels a Person Belongs To | 0.18 seconds |
Loading a Timeline view of recent Posts | 0.21 seconds |
Loading the most recent 250 Posts in a Channel | 0.13 seconds |
Publishing a new Post | 0.51 seconds |
Searching for Posts with 3 terms | 1.33 seconds |
Searching for Posts with 5+ terms | 1.61 seconds |
These are API response times, meaning requests are first authenticated, processed, then returned in a JSON format.
The database server is running MySQL 8.0.23 and Ubuntu 20.04 LTS on a db.m6g.xlarge
instance on AWS. There are three web servers, all running Apache and Ubuntu 20.04 LTS on t4g.medium
instances.
Unless you have tens of millions for marketing a hot new service that will supplant the existing big players, you may not need to over-think the structure just yet. Build something specific and see whether it catches on or not. If it does, then you can look at getting it to scale.