Design A Chat System

Propose a high level and get buy-in

Select a protocol

We can select a few protocols of how we want to deliver message to our clients. These includes

Which in conclusion, WebSocket should be the protocol to use for main chat application whereas Server-Sent Events (SSE) could be used for real-time notification

[!note]
In here real-time notification is different than push notification. Push notification can also works offline

Service segegration

Since WebSocket is a stateful service. Normal services such as login, sign up doesn't need to be under websocket and can be stateless. Therefore we can separate our service as following:

Pasted image 20230918094528.png

Storage

We can choose between SQL vs NoSQL.

For normal data (user registration, login, etc)

  • SQL is a better choice since we want things to be consistent

For chat data:

  • NoSQL key value store would be a better choice since we need fast random access (happen when people search for a message in the past)
  • And SQL doesn't work well with long-tail data as the index grows bigger (people mostly don't search for chat message in the past but only recent messages)

Data model

One-one message

CREATE TABLE message (
	id bigInt UNIQUE,
	from_user_id bigInt,
	to_user_id bigInt,
	content text,
	created_date TimeStamp
)

Group message

CREATE TABLE group_message (
	id bigInt UNIQUE,
	from_user_id bigInt,
	channel_id bigInt,
	content text,
	created_date TimeStamp
)

Message Id

For message id to be unique, if we're using SQL, we can use auto_increment feature. However, NoSQL generally doesn't support this one.

We can adapt technique from Design Unique ID generator in distributed system

Design deep dive

Service discovery

Pasted image 20230918124850.png

Service discovery is used for the user to find the best possible chat service to connect to. In here the flow happens as follows:

  1. User authenticated
  2. User ask chat server discovery service (Zookeeper) for the best server to connect to using HTTP
  3. Chat server return the best server to the user
  4. User connect to the server using web socket

One-one message Flow

For the 2 server talks to each other, there are multiple ways. For example, in a flow of User A talks to User B

Pasted image 20230919223827.png

The flow happening as following:

  1. User A sends message hello to server A
  2. Server A needs to query our server map database to see which server user B connects to (in this case it's server B)
  3. Server A then push message to server B. This could be done through a simple gRPC or REST endpoint
  4. Server B after receiving the message push it to user B.

Message queue

For the communication between Server A and Server B, we can use a normal rest call like the above. However what if

  • There are multiple people try to push message to server B at the same time. As a result, server B will becomes overload
  • User B is offline, how do we persist the message to send later.
  • If user A send multiple time in the same time, what would be the message order
  • Who would save the messages into database, if we let server B saves the messages it will have a massive hits in our database

As a result, we can deliver the messages to a queue / topic, each server will have a correspond queue to push it to relevant receiver.

As a result, we can solve the above problems:

  • Multiple people push message to server B at the same time, server B can take time to process each message.
  • If user B is offline, we can delegate the message to our push notification system
  • If user A sends multiple message to server B, it will still be persist in the queue and will be consumed FIFO
  • A group of workers can start saving the message in the queue once server B acknowledge the messages

Pasted image 20230919231218.png

Group chat flow

For group chat we can use a similar system like above where we duplicate the message to different chat server queue.

However, doing like this will when the group is large, our write request will take up a huge amount of resources.

Another way is the client can send a HTTP query to our server, and basically query for all the messages that the client missed. However this might not be ideal for real-time.

Pasted image 20230919232032.png

We can implement a hybrid system with this following logic. If the group size is larger than 500 people with more than 500 active users, we let the user priodically pull for new messages. Otherwiese, we push the messages to the message queue.

Message Synchronisation

For example if the user use 2 different devices, one from the phone and one from the laptop.

To synchronise these message, we can keep a cur_max_message_id in each device and talk to our database when the client online to synchronise the missing messages — this means that we can potentially index our database based of message_id

Pasted image 20230919232437.png

Online presence

The way that online presence behave is actually the same thing as our message flows because essentially, online presence is a type of message.

Pasted image 20230919233415.png

However, to make sure that the user is actually offline, we need to keep track of a heart beat and only mark the user as offline after a certain time. This is to make sure that in the case of the user have a slight disconnection, we don't mark the user as offline.

Pasted image 20230919233522.png