Design A Chat System
Propose a high level and get buy-in
Select a protocol
We can select a few protocols of how we want to deliver message to our clients. These includes
Which in conclusion, WebSocket should be the protocol to use for main chat application whereas Server-Sent Events (SSE) could be used for real-time notification
[!note]
In here real-time notification is different than push notification. Push notification can also works offline
Service segegration
Since WebSocket is a stateful service. Normal services such as login, sign up doesn't need to be under websocket and can be stateless. Therefore we can separate our service as following:
Storage
We can choose between SQL vs NoSQL.
For normal data (user registration, login, etc)
- SQL is a better choice since we want things to be consistent
For chat data:
- NoSQL key value store would be a better choice since we need fast random access (happen when people search for a message in the past)
- And SQL doesn't work well with long-tail data as the index grows bigger (people mostly don't search for chat message in the past but only recent messages)
Data model
One-one message
CREATE TABLE message (
id bigInt UNIQUE,
from_user_id bigInt,
to_user_id bigInt,
content text,
created_date TimeStamp
)
Group message
CREATE TABLE group_message (
id bigInt UNIQUE,
from_user_id bigInt,
channel_id bigInt,
content text,
created_date TimeStamp
)
Message Id
For message id to be unique, if we're using SQL, we can use auto_increment
feature. However, NoSQL generally doesn't support this one.
We can adapt technique from Design Unique ID generator in distributed system
Design deep dive
Service discovery
Service discovery is used for the user to find the best possible chat service to connect to. In here the flow happens as follows:
- User authenticated
- User ask chat server discovery service (Zookeeper) for the best server to connect to using HTTP
- Chat server return the best server to the user
- User connect to the server using web socket
One-one message Flow
For the 2 server talks to each other, there are multiple ways. For example, in a flow of User A
talks to User B
The flow happening as following:
- User A sends message
hello
to server A - Server A needs to query our server map database to see which server user B connects to (in this case it's server B)
- Server A then push message to server B. This could be done through a simple gRPC or REST endpoint
- Server B after receiving the message push it to user B.
Message queue
For the communication between Server A and Server B, we can use a normal rest call like the above. However what if
- There are multiple people try to push message to server B at the same time. As a result, server B will becomes overload
- User B is offline, how do we persist the message to send later.
- If user A send multiple time in the same time, what would be the message order
- Who would save the messages into database, if we let server B saves the messages it will have a massive hits in our database
As a result, we can deliver the messages to a queue / topic, each server will have a correspond queue to push it to relevant receiver.
As a result, we can solve the above problems:
- Multiple people push message to server B at the same time, server B can take time to process each message.
- If user B is offline, we can delegate the message to our push notification system
- If user A sends multiple message to server B, it will still be persist in the queue and will be consumed FIFO
- A group of workers can start saving the message in the queue once server B acknowledge the messages
Group chat flow
For group chat we can use a similar system like above where we duplicate the message to different chat server queue.
However, doing like this will when the group is large, our write request will take up a huge amount of resources.
Another way is the client can send a HTTP query to our server, and basically query for all the messages that the client missed. However this might not be ideal for real-time.
We can implement a hybrid system with this following logic. If the group size is larger than 500 people with more than 500 active users, we let the user priodically pull for new messages. Otherwiese, we push the messages to the message queue.
Message Synchronisation
For example if the user use 2 different devices, one from the phone and one from the laptop.
To synchronise these message, we can keep a cur_max_message_id
in each device and talk to our database when the client online to synchronise the missing messages — this means that we can potentially index our database based of message_id
Online presence
The way that online presence behave is actually the same thing as our message flows because essentially, online presence is a type of message.
However, to make sure that the user is actually offline, we need to keep track of a heart beat and only mark the user as offline after a certain time. This is to make sure that in the case of the user have a slight disconnection, we don't mark the user as offline.