Database Sharding

There are multiple type of sharding, we need to consider doing the easy one first before continue with the hard one. The following order will be from easy to hard.

1. Vertical sharding

A table could be split to smaller table. For example:

Pasted image 20230930112059.png

In this example, we have defined 2 group of fields:

  • In frequently changes (blue)
  • Frequently changes (red)

If the red fields in here are updated more frequent, we can split that table out to one in-frequently table and one fre0quently table.

Pros:

  • reduce the potential lock for querying on the product_basic table
    Cons:
  • Table size still increasing
    • To address this problem, we can consider cold storage for archiving

2. Table partitioning

Pasted image 20230930112653.png

One table can create multiple partition for example:

CREATE TABLE sales (
    sale_id serial PRIMARY KEY,
    sale_date date,
    amount numeric
);

CREATE TABLE sales_january PARTITION OF sales
    FOR VALUES FROM ('2023-01-01') TO ('2023-01-31');

CREATE TABLE sales_february PARTITION OF sales
    FOR VALUES FROM ('2023-02-01') TO ('2023-02-28');

As a result, the data will be stored in the partition table. The main table will only be fore the schema and so on.

If you want each table in different location. You can create a table space:

CREATE TABLESPACE my_tablespace LOCATION '/path/to/disk1';

CREATE TABLE sales_january PARTITION OF sales
    FOR VALUES FROM ('2023-01-01') TO ('2023-01-31')
    TABLESPACE my_tablespace;

Pros:

  • Reduce the table size and potentially can store the table across different disk
    Cons:
  • Still one machine serving the database hence bottle neck at the machine level

3. Range based sharding

Pasted image 20230930115142.png

In here we can shard our table based on range. The sharded table can be located in different machine.

If one machine has a problem, it only affect a particular customer.

You can also choose to shard between different range, for example customer_id, user_id, location_id or so on.

Pros:

  • Reduce table size
  • Split the table to different machines, hence off load from one machine
  • Can still use incremental id as the primary key
    Cons:
  • hotspot — for example if we choose to shard our data based on the key location_id, if everyone stays in the same place then it will create a hotspot.

4. Key-based sharding

Pasted image 20230930115646.png

In here we have to use a hash function. The hashfunction has to be unique enough so that it distribute to different shard equally. We can useConsistent Hashing concept

Pros:

As a result, maybt this is more suitable for NoSQL, for Database/SQL is very complicated and has drawback.