Design A URL Shortener

We follow the same steps as System Design Framework

Step 1. Establish a design scope

A few question we can ask, just clarify and gather all the information as much as possible

  1. Can you give me an example of how the shorten url works
    • Assume URL https://www.systeminterview.com/q=chatsystem&c=loggedin&v=v3&l=long is the original URL. Your service creates an alias with shorter length: https://tinyurl.com/ y7keocwj. If you click the alias, it redirects you to the original URL.
  2. What's the traffic volume
    • 100 million URLs are generated per day.
  3. How long is the URL
    • As short as possible.
  4. What characters are allowed in shortened URL
    • 0-9, (a-z, A-Z)
  5. Can the shortened URL be deleted or updated
    • No

As a result, we can conclude some of the basic usage:

  1. URL shortening:
    • Given a long url, return a shorter URL
  2. URL redirecting:
    • Given a short URL, redirect to original URL
  3. High availability, scalability and fault tolerance

Back of envelop estimation

QPS estimation

  • Write operation:
    • 100 million URLS per day
      • 100 million / 24 / 3600 = 1160 write per seconds
  • Read operation:
    • Assume the read ratio vs write is 10:1, read operation per seconds = $1160 \times 10 = 11600$

Storage estimation

  • Assume that we need to support this service for 10 years. That means we need to support:
    • 100 (million urls / day) $\times$ 365 days $\times$ 10 (years) = 365 billion records
  • Assume average URL length is 100 characters
    • Storage requirement over 10 years: 365 billion $\times 100 bytes $\times$ 10 years = 365 TB

Step 2 - Propose a high-level design and get buy in

API design

Since the system is web-based, we allow our user to communicate with our system via REST endpoints:

1. URL Shortening:

User can send a POST request, contains 1 parameter: originalURL.

The URL will look similar to this:

POST api/v1/data/shorten
- request parameter: originalURL
- return: shortenURL

2. URL redirecting

First we need to determine which status code we would then return to our users. For redirecting we have either 301 (permanent) or 302 (temporary):

  • 301 redirect:
    • This is a ==permanant redirect==, which means the browser will cache the response. Subsequent requests will not be send to our shortening server, but send to the long URL server directly.
    • This can improve the perfomance by reducing load on our server.
  • 302 redirect:
    • This is a ==termporary redirect==, which means the browser will not cache any response. Subsequent request will still be sending to our servers for shortening.
    • This is better if we want to keep track of data analytics such as click rate, IP and so on… . However it will create load to our server.

An intuitive way to implement redirection is we can do a hash table:

longURL = hashTable.get(shortUrl)
redirect longURL

Pasted image 20230712081015.png

We must satisfy the following requirement:

  • Each longURL hashed to shortURL
  • Each shortURL can be mapped back to a longURL

Step 3 - Design deep dive

Data model

Hash table is not feasible for real-world system as memory is limited and expensive.

A better option would be to store in a relational database

Pasted image 20230712081149.png

Hash function

Since the hash value consist of

  • 0 - 9: 10 characters
  • a-z: 26 characters
  • A-Z: 26 characters

We have a total of 62 possible characters. From Back of envelop estimation, we conclude that the system must be able to support 365 billion URLs.

Given the length of the shortURL is n. The following table shows what's the minimum n we need to store this maxURL

Pasted image 20230712081516.png

When we have n = 7 we can store ~3.5 trillion URLs which is enough to store 365 billions

Hash + collision resolution

So now we need to find a hash function that can have a 7-length character string.

We have the following common hash function:

Pasted image 20230712081641.png

However, even the CRC32 is way too long for our requirement.

A solution would be collect the first 7 characters of the hash value. However this method could potentially lead to a hash-collisions.

To resolve this, we can recursively append new predefined string until there is no hash-colision:

Pasted image 20230712081808.png

However it's expensive to check if there is a shortURL exists for every request. We can use Bloom Filter to improve performance if the shortURL is in the database.

Base62 conversion

Base62 is used as there are 62 possible character. We can simply hash the ID of the URL in our table. However the URL length is not fixed as it goes up with the ID. But we have a maximum of 7 characters.

Comparision of the 2 approaches

Pasted image 20230712083348.png

URL shortening deep drive

We have the following system

Pasted image 20230712083513.png

So for example:

  1. We have a longURL: https://en.wikipedia.org/wiki/Systems_design
  2. Unique ID generator return: 2009215674938
  3. Convert the ID to shortURL using base 62 conversion:
    • ID 2009215674938 converted to zn9edcu
  4. Save the ID, shortURL, longURL in the database

Pasted image 20230712083707.png

The unique ID is worth mentioning as it needs to be working in a highly distributed system. For this refer to Design Unique ID generator in distributed system

URL redirect deep dive

Pasted image 20230712083830.png

  1. User click on shortURL: tinyurl.com/zn9edcu
  2. Load balancer forwards the request to our server
  3. If short URL already in cache, we return the longURL directly
  4. If it's not in cache, we fetch the longURL from the database. If it's not in the database either, it's likely that the user has entered an invalid shortURL
  5. The longURL is returned to the user

Step 4 - Wrap up

If there's extra time, we can consider the following:

  • Rate limiter: malicious users send overwhelmingly large number of URL shortening. Refer to Design rate limiter
  • Web server scaling: Since the web tier is stateless, it's easy to scale the web tier by adding or removing the webservers
  • Database scaling:
    • Database replication and sharding
  • Analytics:
    • We can use data-analytics in the URL to make sure if they're clicking the link or not
  • Avialiability, consistency reliability