Chapter 01 - Reliable, Scalable, And Maintainable Applications

CPU is rarely a problem, the problem is the throughput and amount of data

When design a system, there are 3 most important aspects:

  1. Reliability: how reliable your system is under software fault/hardware fault
  2. Scalability: how the system perform under high load
  3. Maintainability: how easy for new engineer to maintain it

Reliability

Means the application continues to work correctly even in the event that something failed. Sometimes it make sense to randomly fail something to see if the application can survive (i.e Netflix Chaos Monkey)

Hardware errors:

  • Hardware components often have errors, we can deal with this by adding RAID, Hotswap CPU etc.
  • Solution:
    • Multiple servers for the same hardware, we're leaning towards to build hard-ware fault tolerance application. i.e we afford to have a single server down for maintenance

Software errors:

  • Software bug, stale process that consuming rams, cascading failures from one process to the other
  • These failure are often hidden until there is a specific condition that trigger the problem.
  • Solution:
    • it's important to isolate process, testing, allow process to crash and restart etc

Human errors:

  • Configuration errors etc
  • Solution:
    • Make the API easy to do the right thing and discourage the wrong thing
    • Make throughough unittest or system integration, manual testing
    • Make fast to roll back config
    • Setup telemetry for monitoring errors - failure

Reliability is very important to the application whether it's small or large. In some scenario, we need to sacrifice reliability due to operation and development cost, we should be very conscious if we're cutting corners here.

Scalability

Is to answer the question of if the system grow to a certain level, what's our option to handle the growth.

Describe the load:

  • To answer the question related to growth, we need to find out what's our load parameter — what determine the load in the system, whether it's: number of request per seconds, number of concurrent user, number of jobs per server etc.
    • For example:
      • Twitter has the load parameter is number of follower per user. This is determine by its own operations: Tweet Post + view tweet from home

Describe Performance:

  • When we have the load parameter, we can then investigate what happen when we increase the load:
    1. Increase the load, keep the same number of resources (CPU/RAM/Network), how does the system perform?
    2. How much resources you need to increase to balance the load back
  • To determine this, we first need to figure out the performance number — what determine the server as good performance (throughput, latency, response time, etc…)

[!NOTE]
Latency: how long does it take until the request is taken to process
Response time: what the client see of the whole process end to end

Because of this, it's important to think of the response time as percentile (i.e the same request hits over and over again can have different Percentile).

  • p50 (median): Good metric on how long users typically have to wait
  • p95, p99, p99.9: tail latency — very important metrics since the outliers are often the one that process the majority of data.

Depends on this we can have our Service Level Agreement (SLA) i.e p50 is under 200ms and p99 is under 1s.

The majority of response time is slow because of head-of-line blocking: i.e the user experience a slow response because 1 request is taken very long time.

sequenceDiagram
        participant C as Client
        participant S as Server (1 worker)
        participant Q as Queue

        Note over C,Q: 3 requests arrive together. Worker handles ONE at a time.

        C->>Q: R1 (slow query, 500ms)
        C->>Q: R2 (fast, 5ms) — waits
        C->>Q: R3 (fast, 5ms) — waits

        Note over Q,S: R2 & R3 are stuck behind R1 (head of line)

        Q->>S: R1 starts
        activate S
        Note right of S: processing R1… 500ms
        S-->>C: R1 done @ 500ms
        deactivate S

        Q->>S: R2 starts (only now!)
        activate S
        S-->>C: R2 done @ 505ms
        deactivate S

        Q->>S: R3 starts
        activate S
        S-->>C: R3 done @ 510ms
        deactivate S

        Note over C: R2/R3 needed 5ms of work<br/>but waited ~500ms. Latency = queue + service.