Design Key-Value Store

Single server key-value store

For single server, we can simply use a hashmap to store the key, value.

For optimisation we can do:

Data compression
Store frequent used data in memory and the rest on disk

Distributed server key-value store

Consideration

CAP theorem needed to be considered in which case we need to decide between:

CP (consistentcy + partition network tolerance): Support consistency and sacrifices availability
- For this we want our communication to be Transactional
AP (avialablity + partition network tolerance): Support availbility and sacrifices consistency

Data partition

Problem

Since it's not possible to fit everything into a single server so we need to distribute data in the way that's:

Evenly
Minimise data movement when it's modified

Solution

Using Consistent Hashing, we can evenly distribute our data to relevant server and minimise data movement. Which is good for ^952639

Automatic scaling: servers could be added and removed automatically depending on the load.
Heterogeneity: number of nodes is propotional to server capacity. Server with higher capacity will have more virtual nodes.

Data replication

Replication is important to achieve availability and reliability. We choose the first $N$ server next to the ring from Consistent Hasing to replicate the data.

Pasted image 20230623204651.png

For example in here we replicate key0 with $N = 3$

Note: These $N$ server are actual physical servers not virtual nodes. This is guarantee that the data is spread uniquely amongst those server.

For more reliability, replicas are placed at different data centers for disaster recovery.

Data Consistency, Synchronisation

Using the concept of Consensus and Quorum. We can synchronise a data by having a fixed number of nodes agree that a data is valid

So if we have

$N$: number of node in total
$W$: minimum number of node needs to be acknowledged for write
$R$: minimum number of node needs to be acknowledeged for read

For example in the system which $N = 3$ we have this following diagram.

Pasted image 20230623220624.png

We have the following condition:

$W + R \gt N$: Strong consistency
- Because in the case where 1 node is either $W$ or $R$. We have $W$ + $R$ $=$ $N$.
- However, when $W + R > N$ that means 1 node is acknowledger for both read and write. Therefore that node always have latest data. Therefore strong consistency.
$W = 1$ and $R = N$:
- Fast write, consistent read
$R = 1$ and $W = N$:
- Fast read, consistent write
$W + R \leq N$: Strong consistency not guaranteed
- There possibly exists a node that is not acknowledger for both read and write

Strong consistency: Always most up to date version
Eventually consistency: Given enough time, all updates are propagated and replicas are consistent
- The same concept in BASE
Weak consistency: value is not always most up to date

Inconsistency resolution

For example when the 2 clients all put different data to our key/value store the same time, this could create conflicts between 2 servers.

Pasted image 20230623222658.png

In this case, we can use a Vector Clock to detect where there is a conflict and then code our client to resolve the conflict.

Error handling

Failure detection

We can use Gossip Protocol to detect if 1 node is down. It's not enough to conclude that 1 node is down because another node say so. We need to have a number of node confirm the same time.

Instead of All-to-all multicasting, Gossip Protocol is more efficient in the case we have many servers in our system.

Handling failure

After detected a failure we need to detect if it's a

Temporary error
- For temporary failure, we can use Sloppy Quorum to make sure that all the write and read are still serve even though a node is offline.
- When a node is down, another node will process the request of the down node temporary and hand the job back when that process is available (Hinted Handoff)
Permanent error
- For permanent error, we can use Anti Entropy protocol for synchronisation with stale nodes. This is a very helpful technique when we have a replica that potentially have the same information of unavailable node. Which then we can use Anti Entropy to synchronise over with our replica
Server outage error
- For server outage error, we make sure that our servers are distributed across multiple data centers.