Design A Notification System
Step 1 — Understand the problem
Candidate: What types of notifications does the system support?
Interviewer: Push notification, SMS message, and email.
Candidate: Is it a real-time system?
Interviewer: Let us say it is a soft real-time system. We want a user to receive notifications as soon as possible. However, if the system is under a high workload, a slight delay is acceptable.
Candidate: What are the supported devices?
Interviewer: iOS devices, android devices, and laptop/desktop.
Candidate: What triggers notifications?
Interviewer: Notifications can be triggered by client applications. They can also be scheduled on the server-side.
Candidate: Will users be able to opt-out?
Interviewer: Yes, users who choose to opt-out will no longer receive notifications.
Candidate: How many notifications are sent out each day?
Interviewer: 10 million mobile push notifications, 1 million SMS messages, and 5 million emails.
So we have the functionality:
- Near real time
- Push notification, SMS email
- IOS, Android, Laoptop, Desktop
- Notifications triggered by client or server
- Opt out option
- 10 million push per day, 1 million SMS and 5 million emails
Step 2 – Propose a high level solution
We can consider the types of each notifications
We also want to consider third party service because in some country our service is not usable. For example in China, we cannot use Firebase Cloud Messaging (FCM)
Contact info gather flow
For push notification to work, we need:
- Device tokens
- Phone number
- Email address
Therefore, the customer when sign up for our app, we need to store in the database
User can have multiple devices because the push notification can be send to all user devices
Notification sending / receving flow
- Service: Could be microservice or cronjobs etc
- Notifications system: For simplicity we can assume only 1 notification service is use. This notification system will pull the user data from the database and talk directly to APNS, PCM, …
- IOS, Android, SMS, Email: To handle user notifications
- Third party service: In the case that the default service is not available in some countries. We need third-party service. As explain above.
Problems:
- Single point of failure:
- Single notification server means single point of failure
- Hard to scale
- Our notification system handle all types of push notification, therefore it's harder to scale
- Performance bottleneck
- We only have 1 server (notification system) to handle processing and sending notification. Tasks that take long time such as constructing HTML pages or waiting responses from third party will result into system overload
Highlevel design (Improved)
- Service 1 - N: Microservices or cronjobs
- Notification Servers:
- Provide basic APIs for service to send notifications. These can be access internally to prevent spam and to only controls the type of notifications we want to send.
- For example
http://api.example.com/v/sms/send
- For example
- Carry out basic validations ie: emails, phone numbers, …
- Query the database or cache to fetch data to render notification
- Put notification in the queue for paralell process
- Provide basic APIs for service to send notifications. These can be access internally to prevent spam and to only controls the type of notifications we want to send.
- Cache: user info, device info, notificationtemplate are caches
- DB: stores data about user, notification settings
- Message queues:
- Buffer for high volumes.
- Each notification type is assigned to a queue so that outage in one will not affect the other queue
- Workers: processing the event
- Third party services
- IOS, Android, SMS, Email
Step 3 – Design deep dive
Reliability
One of the requirement for the notification system is it cannot loose data. The notification can usually be delayed or reorder but never lost.
Therefore we need to persist the notification data inside the database. As a result, we implement the notification log
Deduplication
The reciepents will never receive exactly one notification. This could happen due to various reasons:
- Network partition
- Failure to acknowledged
- Workers not processing fast enough, hence the message go back into the queue.
Therefore, we need to first check if the message is sent before with the eventId
before sending it.
Additional condsiderations
Message template
To avoid re-generating notifications every single time from scratch, we can use a message template and fill in the parameter.
Notification setting
We want to allow the user to have the ability to opt-in or out the notification. Therefore having the following fields in the database would be helpful
user_id bigInt
channel varChar
opt_in boolean
Rate limtting
To avoid overwhelming users with too many notifications. This is important because if we send too many notification, the user can turn our notification off completely.
Retry mechanism
When the third-party fails to send a notification, the notification will be added to the message queue for retrying
Security
We can use appSecret
and appKey
to verify if the client is genuine.
Monitor queued notification
We can monitor the queues to see if the number is large or not. If the number is too large that means the workers can not send the notification on time, Hence we need to increase the number of workers.
Events tracking
For analytics purposes, we want to keep track of how many notifications have been sent and how many clicks / unsubscribes.
Updated design
Step 4 — Wrap up
Discuss any addition features that needed.