Replication Explained - Designing Distributed Systems (Part 2)


Replication in distributed systems involves the process in which we can have multiple copies of data at various locations to guard against unknown and random hardware failures, hence ensuring the availability of the system as a whole.

While considering replication, we have to consider the basic assumption that machines will have uncorrelated failures, otherwise, replication will not help in any way.

Replication should be considered or not, the number of replicas, etc all depends on the use case and the amount of inconvenience or how much it will cost you if you lose the data and compute power at a given point in time.

Replication is achieved, intuitively, when we have two servers, one primary and a replica server, and we have to keep them in synchronize in a way if the primary server fails at any point in time, the replica server should have everything it needs to take over from the primary and start behaving like nothing has ever happened from a client's perspective.

Replication architectures in distributed systems:

There are multiple ways to replicate data, but here we will focus on the two most widely used architectures, best suited architecture for you will depend on the different consistency models used like Linearizability, Sequential consistency, Causal consistency, Eventual consistency, etc.
  1. State Transfer(Passive Replication): One way we can keep primary and replica in sync, is we can configure the primary server to send a copy of its entire state, including RAM, Disk, etc to the replica server instantaneously or at some predefined frequency. The replica server just restores the entire state and takes over in case the primary server fails.
    • Upsides:
      • State transfer is a very robust approach for systems supporting multi-core parallelism.
      • If we are starting from an intermediate state, and the replica server is not aware of the current state of the primary server, this can be a good way to initiate the replication process.
      • Downsides:
        • Heavy data transfer has to be done, as the entire state has to be sent, and the time taken by network delay will add a lot to the latency to keep primary and replica in sync.
          • As an optimization, we can solve this by transferring only the delta of the previous state and current state, i.e. only the data that got updated/deleted. This way we can avoid heavy data transfer. Though this will reduce the amount of data transferred by a big factor, still it will be a bottleneck.
        • Frequency of data transfer from primary to the replica. As soon as data is updated on primary, we can do two things:
          • Send a copy of the data to the replica as soon as it is updated on the primary. This will incur more network calls, hence higher network latency and cost.
          • Batch the data for a certain amount of time, and send the copy of data as a batch to the replica. This will incur lower network calls, but the replica has to wait till the "batching time" to receive a new update, so it will be difficult to keep the replica in sync with the primary, and till the update is sent to replica, the replica will be in stale state.
    1. Replicated State Machine(Active Replication): This approach assumes that every application will have some internal operations until an external event happens upon the system. An external event can be in any form such as an API call, network packets arriving at a random time to the server, and so on. If we consider the system as a state machine, and there are no external events to a system, then it just executes one function after another, and every function is deterministic in nature and depends on what is in memory and registers. It's only when the external event comes which can change the state of the system. Replicated State Machine approach doesn't send the states from primary to backup instead they just send those input events from the external world. Intuitively, if two computer starts from the same state and they come across exactly same inputs, in the same order, at the same time, the two of them will continue to replicate each other at any point in time.
    [source: created using]

    Consider two servers, primary and replicated server, running virtual machines with the same application and OS, the goal is to make them identical irrespective of application or OS running on them.

    A client sends an input to the primary server in form of a network packet, which generates an interrupt that goes to the hypervisor(or virtual machine monitor).

    1. The hypervisor simulates the interrupt into the primary guest OS to deliver it to the primary application.
    2. Along with that, it sends a copy of the network packet to the hypervisor on the replicated server.
    3. Now both the primary and the replicated server will have a copy of the network packet and they process it in a similar manner.
    4. The primary application generates a reply/acknowledgment and sends it to the client.

    Dealing with Non-deterministic events:

    Non-deterministic events can be defined as events which when triggered can have different outcomes on different runs.
    Some common non-deterministic events that we deal with:
    • The input will always in the form of a network packet which = data + timing of interrupt(which depends on network interface card(NIC), as it is the hardware it can vary from server to server even though data arrives at the same time).
    • Randomized instructions - random number generator, unique id generator, time of day, current processor number, etc.
    Instruction like these have to be executed only on a single server, if we execute these types of instruction on replica as well as primary, they will have a different answer and will divert from each other. 
    We can execute these instructions on primary, and send the results to replicas through a logging channel, and replicas will get results for these particular instructions from the logging channel instead of computing on their own.

    These events have the form of a tuple of 3 fields (instruction number, instruction type, data):
    • Instruction number - the number of instructions since the machine booted, we need to know this to execute the instruction in the same order on the replica.
    • Instruction type - if the instruction is a network interrupt or randomized instruction.
    • Data - packet data in case of a network interrupt, the result of instruction in case of randomized instruction.

    Replicated State Machine approach has its own upsides and downsides:

    • Upsides
      • We only send the external events to the replica, there is no heavy data that is transferred across the network like in the other approach.
      • There is no network lag, or batch waiting time as was in the State Transfer approach, so the primary and replica can be assumed as fully synchronized with each other at any point in time.
    • Downsides
      • More complicated to design.
      • Rely on more assumptions about how computers operate.
      • The basic assumptions only deal with the single-core machine, is not clear how it can be extended to multi-core machines where interleaving of instructions is also non-deterministic.

      Q: What if the replica is somehow faster in execution than primary due to some reason, will it get ahead of primary as an event is sent to the replica as soon as it gets to primary?
      • We cannot let the replica get ahead of primary in execution.
      • For this, we'll need to have a buffer where events will wait in the replica that arrived from primary, and we will restrict that at least one event is always there in the events buffer.
      Q: Consider a scenario
      • when primary gets an update request from a client
      • primary made the update, sends an acknowledgment to the client, and fails/goes down, without sending an event to replica
      • and then replica takes over, as primary went down
      • the client will not see the update on the replica, though it already has got the acknowledgment of the update
      This is an inconsistent state that we were avoiding throughout!!

      We can solve this by restricting the acknowledgment to be only sent when we are sure we have sent the event to the replica's buffer.
      It's not we will restrict the execution on primary, it's just we will hold the acknowledgment output till then.



      Post a Comment

      Popular posts from this blog

      Designing Distributed File Storage Systems - Explained with Google File System(GFS) - Part 1

      How they work - The 3 Magical Functions of Python : map, filter and lambda