Vector Clock
Is a [server, version]
pair with a data item $D$ to detect conflicts in a distributed system. For example a data item can look like this:
D([S1, V1], [S2, V1], [S3, V2], ... [Sn,Vn])
This means the data is distributed equally to different servers. Each server keeps track of a version of the data. In this case, our data $D$ was distributed in $n$ different server. In which in $S_1$ it has 1 version, and in $S_3$ it was 2 versions - which means was modified twice.
Rules for writing data
When writing data $D$ to the server Sx
. The server must then increase its corresponding Vi
for [Sx, Vi]
if its exists. Otherwiese it should add a new entry.
For example:
- Client write data
D
that handled byS1
. S1
now addD
as a vector clock with valueD([S1,V1])
- Client read data
D
and write dataD
that again received byS1
. S1
now increment its vector clock inD
which results intoD([S1,V2])
- Client read data
D
and write dataD
that received byS2
. S2
now addD
as a vector clock with valueD([S1,V2], [S2, V1])
Conflict detection
In order to resolve a conflict, we need to start code our own client to do it. However, Vector Clock provides an easy way to detect confliction. A conflict happens if you can detect different in version if your ancestor (the previous change) vector's clock has any participant that greater than your current vector clock.
For example, supposed we have
D1([S0, 1], [S1, 2]) -> D2([S0, 2], [S1, 1])
In this case, D1 is the parent of D2, we compare each element in D1
and make sure it's <
than each corresponding element in D2
.
[S0, 1]
<[S0, 2]
: No conflict[S1, 2]
>[S1, 1]
: Conflict
Therefore this 2 keypair in here is a conflicts. Of course now you need to know which one comes first. So in a real-world system, a timestamp or other criteria is necessary to be tie-breaking rule.
Example
Considering the following scenario:
- Client write data
D
that served byS1
. S1
now addD
as a vector clock with valueD([S1,V1])
- Client write data
D
to the server.
Now for whatever reasons, network partition happens and the servers are not able to talk to eachother
S1
picks up the write and updateD([S1, V2])
S2
also picks up the write and updateD([S1, V1], [S2, V1])
Now network has been recovered. and the servers can talk to each other again.
- Client read
D
, however there is a conflict. Server doesn't know how to resolve this conflict and send both back to our client. Therefore our client see:D([S1, V2])
: value for the data isfoo
D([S1, V1], [S2, V1])
: value for the data isbar
- The client now needs to resolve this conflict. Let's supposed that the client decided to merge these changes and determined that the latest value is
bar
. We have the final valueD([S1, V2], [S2, V1])
with valuebar
- This value is then sent to the server let's say
S1
takes this value. This value is then get updated and store in the server:D([S1, V3], [S2, V1])
with valuebar
Note
For DynamoDB, conflict resolvation decided to resolve when there is a
client
that reads the conflicted value. Otherwise, it stays conflicted — therefore eventually consistentcyDue to the keypair
[server, version]
keeps growing, DynamoDB has set a threshold to remove the oldest pair when the size exceeded this limit. So far Amazon has not encountered any problem with this approach in production.
Resources: Vector Clocks and Conflicting Data - Grokking the Advanced System Design Interview (designgurus.io)