Single table of all chat conversations or separate tables? [closed]

Posted on

Question :

I’m planning to make a chat application. The problem is that there can be billions of records, since there will be lots of chat messages.

I have come up with the idea to make separate tables for each conversation, and that would make millions of tables. Or should I add them to a single table? Which makes tens of billions of records in a single table?

I’m also thinking about distributing each conversation between different servers.

Which of these would be an appropriate solution? Please feel free to guide if I’m doing this wrong! I can consider different database methodologies as well (like NoSQL).

Answer :

If this is an actual service that you would like to create and not just an academic exercise, then it would be best to start with some realistic expectations and grow from there. Many years ago I implemented something like this at the day job where employees could communicate with one another in a Microsoft Teams-like fashion (though before Teams was a thing). The system is still being used today and sees anywhere between 5 and 25 messages per minute between the hours of 7:30am and 2:45am local time. It is used by about 18,500 employees scattered across the globe.

Current statistics (as of 00:00:00 UTC today):

Item Number Note
Posts 35,641,804 As per HR and Legal requirements, no records are deleted, but they are instead “soft deleted”.
Files 2,613,017 Same as above
Channels 74,331  
Accounts 24,459 6,000+ idle/expired accounts

Each one of these items has its own table, and there are additional tables that enable additional functionality, such as ChannelMember, which specifies which Accounts are members of a Channel and what level of permission they have (Read-Only / Read-Write).

Posts are saved in Markdown format and have no realistic size limit. People are free to use the space as they see fit and, so long as nobody reports an Account or Channel as being inappropriate, no communications are monitored or shared with management.

Search is handled by splitting posts apart on a word-by-word basis and storing unique words in a PostSearch table, which is then used for somewhat faster lookups. Because the words are split apart, search results can be “weighed” for their relevance before being returned to the requester.

This runs on a MySQL database with a PHP-based API handling all communication between the front-end applications and the database.

Now the fun stuff:

Action Average Response Time
Loading Channels a Person Belongs To 0.18 seconds
Loading a Timeline view of recent Posts 0.21 seconds
Loading the most recent 250 Posts in a Channel 0.13 seconds
Publishing a new Post 0.51 seconds
Searching for Posts with 3 terms 1.33 seconds
Searching for Posts with 5+ terms 1.61 seconds

These are API response times, meaning requests are first authenticated, processed, then returned in a JSON format.

The database server is running MySQL 8.0.23 and Ubuntu 20.04 LTS on a db.m6g.xlarge instance on AWS. There are three web servers, all running Apache and Ubuntu 20.04 LTS on t4g.medium instances.

Unless you have tens of millions for marketing a hot new service that will supplant the existing big players, you may not need to over-think the structure just yet. Build something specific and see whether it catches on or not. If it does, then you can look at getting it to scale.

Leave a Reply

Your email address will not be published. Required fields are marked *