Chapter 01 - Reliable, Scalable, And Maintainable Applications

CPU is rarely a problem, the problem is the throughput and amount of data

When design a system, there are 3 most important aspects:

  1. Reliability: how reliable your system is under software fault/hardware fault
  2. Scalability: how the system perform under high load
  3. Maintainability: how easy for new engineer to maintain it

Reliability

Means the application continues to work correctly even in the event that something failed. Sometimes it make sense to randomly fail something to see if the application can survive (i.e Netflix Chaos Monkey)

Hardware errors:

  • Hardware components often have errors, we can deal with this by adding RAID, Hotswap CPU etc.
  • Solution:
    • Multiple servers for the same hardware, we're leaning towards to build hard-ware fault tolerance application. i.e we afford to have a single server down for maintenance

Software errors:

  • Software bug, stale process that consuming rams, cascading failures from one process to the other
  • These failure are often hidden until there is a specific condition that trigger the problem.
  • Solution:
    • it's important to isolate process, testing, allow process to crash and restart etc

Human errors:

  • Configuration errors etc
  • Solution:
    • Make the API easy to do the right thing and discourage the wrong thing
    • Make throughough unittest or system integration, manual testing
    • Make fast to roll back config
    • Setup telemetry for monitoring errors - failure

Reliability is very important to the application whether it's small or large. In some scenario, we need to sacrifice reliability due to operation and development cost, we should be very conscious if we're cutting corners here.

Scalability

Is to answer the question of if the system grow to a certain level, what's our option to handle the growth.

Describe the load:

  • To answer the question related to growth, we need to find out what's our Load Parameter — what determine the load in the system, whether it's: number of request per seconds, number of concurrent user, number of jobs per server etc.
    • For example:
      • Twitter has the load parameter is number of follower per user. This is determine by its own operations: Tweet Post + view tweet from home

Describe Performance:

  • When we have the load parameter, we can then investigate what happen when we increase the load:
    1. Increase the load (refer to load parameter), keep the same number of resources (CPU/RAM/Network), how does the system perform?
    2. How much resources you need to increase to balance the load back
  • To determine this, we first need to figure out the Performance Number — what determine the server as good performance (throughput, latency, response time, etc…)

[!NOTE]
Latency: how long does it take until the request is taken to process
Response time: what the client see of the whole process end to end

Because of this, it's important to think of the response time as percentile (i.e the same request hits over and over again can have different Percentile).

  • p50 (median): Good metric on how long users typically have to wait
  • p95, p99, p99.9: tail latency — very important metrics since the outliers are often the one that process the majority of data.

Depends on this we can have our Service Level Agreement (SLA) i.e p50 is under 200ms and p99 is under 1s.

The majority of response time is slow because of head-of-line blocking: i.e the user experience a slow response because 1 request is taken very long time.

sequenceDiagram
        participant C as Client
        participant S as Server (1 worker)
        participant Q as Queue

        Note over C,Q: 3 requests arrive together. Worker handles ONE at a time.

        C->>Q: R1 (slow query, 500ms)
        C->>Q: R2 (fast, 5ms) — waits
        C->>Q: R3 (fast, 5ms) — waits

        Note over Q,S: R2 & R3 are stuck behind R1 (head of line)

        Q->>S: R1 starts
        activate S
        Note right of S: processing R1… 500ms
        S-->>C: R1 done @ 500ms
        deactivate S

        Q->>S: R2 starts (only now!)
        activate S
        S-->>C: R2 done @ 505ms
        deactivate S

        Q->>S: R3 starts
        activate S
        S-->>C: R3 done @ 510ms
        deactivate S

        Note over C: R2/R3 needed 5ms of work<br/>but waited ~500ms. Latency = queue + service.

Approach for adapt to the load

It's normal and likely that your application need to rethink the architecture for every order of magnitude load increase

There are two way of Horizontal vs Vertical. Good architects will have a mixture of both:

  • Using a more powerful machine is often simpler than a bunch of smaller machine

A system is elastic if it's able to add/remove the load on demand. Otherwise a static system is much simpler. Elastic system is suitable for unpredictable workload

Breaking a stateful application into distributed system will introduce a lot of additional complexity.

The common wisdom is to keep your database on a single node until you're forced to scale out.

  • However this common wisdom may change as the tools for distributed system get better

The architect for distributed system needs to be tailored for your application, there is no such thing as one size fit all. We build it around the assumption of Load Parameter

  • Due to this, at the early stage of the application, it's more important to iterate quickly on product feature than focusing on improving the future load
  • Scalable architect is still built from general purpose building blocks

Maintainability

This includes fixing bugs, keeping its sytems operational, investigating failures, adapting to new platforms, modifying for new usecases, add new features

Most people dislike maintenance of legacy system. We should design our software to hopefully minimize the pain during maintenance — avoid creating the legacy software ourselves. There are 3 key points we need to focus on

  1. Operability: Make it easy for user to operate to keep the software runnign smoothly
    • Good software cannot run relably with bad operations. Operation teams is very important to keep the software running smoothly, they do:
      • Monitoring health, tracking down the problems, system failures or degrade performance
      • Keep software and platforms up to date
      • Keep tabs on how different system affect each other
      • Anticipate future problem before they occur, perform maintenance, security patch etc
  2. Simplicity: Make it easy for new engineers to understand the system, removing as much complexity as possible from the system.
    • Early application have simple and expressive code, as the project get larger, it's very complex and difficult to understand.
      • Slow down everyone who needs to work under system
    • Complexity makes maintenance hard, budgets and schedules over run, greater risk of introducing bugs when making a change
    • Best way to make the system simpler is via abstraction, which leads to higher-quality software. Abstracted component benefit all applications that used it
  3. Evolvability: Make it easy for engineers to make changes in the future, adapting unanticipated use cases as requirements change
    • We should be easy to make change since it's unlikely that your system requirements remain unchanged forever. Learn new facts, new business priorities, user requests
    • Agile working patterns provide a framework for adapting to change, using Test Driven Development and refactoring.
    • This is closely linked to simplicity and abstractions: simple software is easy to maintain and modify than complex ones.

Summary

  1. Application needs 2 type of requirements: functional requirements (how application works), and non functional requirements (reliability, scalability, maintainability)
  2. Reliability: making the system works correctly even when failure occurs - i.e in the hardware, software, and humans
  3. Scalability: strategies for keeping performance good, even when load increases
  4. Maintainability: making life better for the engineering and operations teams who need to work with the system