Totally aware that there are very similar questions, went through each of them, but those are pretty simple/basic solutions from where i took some things and ideas to create below entity relationship diagram, according to my needs. What i’m up to is to create database schema (postgresql) for chat/conversation history, where i need to support couple of tricky things, such as:
- Chat have a name and initial creator (
- Chat message created by
customer_idcan either be a plain text message or a file, while
those two are sharing some of the columns, but in a sense those are
different in a way that file can have
external_pathlike location where it is saved (e.g file system),
and file can be marked as being
relevantor not, by anybody in
conversation. Due to many differences between the two, to me at least, seems
more natural to divide the two and to have “common parent” table,
chat_lines. Other approach would be to have single table with many NULL values or so. Highly interested to see your opinion about it.
- One of the biggest worries that i have is that chat participant being
reply_toexisting chat line or start a thread on existing
chat line. I have no better idea than to make
chat_linestable having two separate
references to itself. I do not see any better way of how to model threads and replies.
- Each message can be
seenby anybody in discussion, hence
seenstable. Not quite sure is
- Finally anybody in conversation/chat can react to each
thumb down, no more than that, so ENUM seems like
a nice approach.
After hours and hours of thinking i came to this:
One of the worries here for me is that
chat lines table have become like a
God table, everything is concentrated around that table and probably every future table will be related to it. Further, each
SELECT query must
JOIN literally everything in order to rebuild chat history. Worrying about performances a lot, would expect that
chat_lines table will become huge over time, just imagine that one single chat between 5 to 10 participants can have around 1k-5k chat lines, with few file uploads along the way. Looking for proper
indexes on those table. Each
SELECT query that i can imagine now will be probably using heavily
customer ids (like participants of the chat). So thinking to create some
indexes on those.
Of course, whatever participant can search for mentioned things. When it comes to plain text messages, seems like way to go is postgresql
text column in
messages table. Participant will be able to search against
file_name when it comes to file uploads, so seems reasonable to have index on
Please sorry for a little bit “opinionated question”, highly interested in your opinion about how much this solution will scale in the future, would highly appreciated any advice what could be changed. Any other idea how replies and threads could be modeled? Where you would add indexes? Whatever i do, i guess database partions based on something from this schema should be created in future when data starts to grow very much.
Some loose opinions which may be useful but may be completely off:
- The model looks OK and should be scalable as far as I can tell.
- Relational model (3NF) is the king of storage size, and data consistency. It’s not the king of “easy querying”. See
- In actual long term business, maintainability is often more important than performance.
- Don’t worry about partitioning too soon. You may never see this first 0.5 TB of customer data in your database. But consider making
customer_idthe part of PK in all tables if this is going to be a multi-tenant app. It will make scaling up (think: sharding / separating tenants) easier.
threadID might be as well synonymous to
idof top level item.
- tsearch is OK, but for large scale, fancy fulltext search you might need to employ some outside tools like ElasticSearch / Sphinx / etc.