Design A URL Shortener
We follow the same steps as System Design Framework
Step 1. Establish a design scope
A few question we can ask, just clarify and gather all the information as much as possible
- Can you give me an example of how the shorten url works
- Assume URL https://www.systeminterview.com/q=chatsystem&c=loggedin&v=v3&l=long is the original URL. Your service creates an alias with shorter length: https://tinyurl.com/ y7keocwj. If you click the alias, it redirects you to the original URL.
- What's the traffic volume
- 100 million URLs are generated per day.
- How long is the URL
- As short as possible.
- What characters are allowed in shortened URL
- 0-9, (a-z, A-Z)
- Can the shortened URL be deleted or updated
- No
As a result, we can conclude some of the basic usage:
- URL shortening:
- Given a long url, return a shorter URL
- URL redirecting:
- Given a short URL, redirect to original URL
- High availability, scalability and fault tolerance
Back of envelop estimation
QPS estimation
- Write operation:
- 100 million URLS per day
- 100 million / 24 / 3600 = 1160 write per seconds
- 100 million URLS per day
- Read operation:
- Assume the read ratio vs write is
10:1
, read operation per seconds = $1160 \times 10 = 11600$
- Assume the read ratio vs write is
Storage estimation
- Assume that we need to support this service for 10 years. That means we need to support:
- 100 (million urls / day) $\times$ 365 days $\times$ 10 (years) = 365 billion records
- Assume average URL length is 100 characters
- Storage requirement over 10 years: 365 billion $\times 100 bytes $\times$ 10 years = 365 TB
Step 2 - Propose a high-level design and get buy in
API design
Since the system is web-based, we allow our user to communicate with our system via REST endpoints:
1. URL Shortening:
User can send a POST
request, contains 1 parameter: originalURL
.
The URL will look similar to this:
POST api/v1/data/shorten
- request parameter: originalURL
- return: shortenURL
2. URL redirecting
First we need to determine which status code we would then return to our users. For redirecting we have either 301 (permanent)
or 302 (temporary)
:
301
redirect:- This is a ==permanant redirect==, which means the browser will cache the response. Subsequent requests will not be send to our shortening server, but send to the long URL server directly.
- This can improve the perfomance by reducing load on our server.
302
redirect:- This is a ==termporary redirect==, which means the browser will not cache any response. Subsequent request will still be sending to our servers for shortening.
- This is better if we want to keep track of data analytics such as click rate, IP and so on⦠. However it will create load to our server.
An intuitive way to implement redirection is we can do a hash table:
longURL = hashTable.get(shortUrl)
redirect longURL
We must satisfy the following requirement:
- Each
longURL
hashed toshortURL
- Each
shortURL
can be mapped back to alongURL
Step 3 - Design deep dive
Data model
Hash table is not feasible for real-world system as memory is limited and expensive.
A better option would be to store in a relational database
Hash function
Since the hash value consist of
- 0 - 9: 10 characters
- a-z: 26 characters
- A-Z: 26 characters
We have a total of 62 possible characters. From Back of envelop estimation, we conclude that the system must be able to support 365 billion URLs.
Given the length of the shortURL
is n
. The following table shows what's the minimum n
we need to store this maxURL
When we have n = 7
we can store ~3.5 trillion URLs which is enough to store 365 billions
Hash + collision resolution
So now we need to find a hash function that can have a 7-length character string.
We have the following common hash function:
However, even the CRC32
is way too long for our requirement.
A solution would be collect the first 7 characters of the hash value. However this method could potentially lead to a hash-collisions.
To resolve this, we can recursively append new predefined string until there is no hash-colision:
However it's expensive to check if there is a shortURL exists for every request. We can use Bloom Filter to improve performance if the shortURL
is in the database.
Base62 conversion
Base62 is used as there are 62 possible character. We can simply hash the ID
of the URL in our table. However the URL length is not fixed as it goes up with the ID
. But we have a maximum of 7
characters.
Comparision of the 2 approaches
URL shortening deep drive
We have the following system
So for example:
- We have a
longURL
: https://en.wikipedia.org/wiki/Systems_design - Unique ID generator return: 2009215674938
- Convert the ID to shortURL using base 62 conversion:
- ID
2009215674938
converted tozn9edcu
- ID
- Save the
ID
,shortURL
,longURL
in the database
The unique ID is worth mentioning as it needs to be working in a highly distributed system. For this refer to Design Unique ID generator in distributed system
URL redirect deep dive
- User click on shortURL:
tinyurl.com/zn9edcu
- Load balancer forwards the request to our server
- If short URL already in cache, we return the
longURL
directly - If it's not in cache, we fetch the
longURL
from the database. If it's not in the database either, it's likely that the user has entered an invalid shortURL - The
longURL
is returned to the user
Step 4 - Wrap up
If there's extra time, we can consider the following:
- Rate limiter: malicious users send overwhelmingly large number of URL shortening. Refer to Design rate limiter
- Web server scaling: Since the web tier is stateless, it's easy to scale the web tier by adding or removing the webservers
- Database scaling:
- Database replication and sharding
- Analytics:
- We can use data-analytics in the URL to make sure if they're clicking the link or not
- Avialiability, consistency reliability