Designing Distributed File Storage Systems - Explained with Google File System(GFS) - Part 1


Original paper: The Google File System
Authors: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

A distributed file system(DFS) is a file system that allows you to access files stored in multiple hosts distributed across them.

In contrast to other data stores like MySQL or NoSQL which can also be made distributed, the data is DFS is stored in raw form and as files instead of a table or a fixed-document format.

The use cases where DFS is used are very vast. From storing media content like Youtube videos to Instagram images to blogs like these, anything that can be stored as a file, is big and is not suitable to store in other data stores, DFS is generally a preferred choice for everyone.

Key points to consider while designing a DFS:

Understanding from the perspective of using a distributed file system, there are some characteristics that need to be considered while choosing the data store and tweaked according to the business requirements of the project.
  1. Consistency - Strong consistency for a user means the system should behave as if the user is talking to a single server in spite of behind the scene they are in 1000s. If a user chooses to build a distributed system that is also consistent in nature, performance should be traded off for this, which is mostly the write performance. The general case is a write is acknowledged successful if and only if the writes to all the replicas are done, and the write performance takes a hit with this. Most of the systems are made eventual consistent due to this reason if it can be afforded for the busses use case. If we want strong consistency, it will trade-off performance, as it cannot send an acknowledgment until all replicas are updated.
  2. Parallel performance(through sharding and replication) - As most of the data is independent, we can split up the data(shard) to store in multiple machines to scale horizontally with an increased amount of data as well as traffic load. This will surely increase the query and maintenance complexity of storing data, as we have to manage the metadata at the application layer as well, but a good way to store if the data is very huge, like for a typical file data store. We can achieve increase query performance as most of the queries can be done parallelly, even for the same file if we split the file as well(we'll talk about chunking)
  3. Fault-tolerant (through replication) - As we want the system to be highly available, i.e. even if a few servers fail, it should continue working without any change for the end-user. This can be achieved with replication, i.e. duplicating and storing the same data across multiple machines. Though replication can be very tricky to achieve, and we have to trade off consistency to achieve this, this can certainly be very useful if losing a file, even a chunk of it can be very costly for the business. Ex: if we are trusting google drive to keep all our data for a lifetime, we don't want them to lose any of it, and losing even a little data can be very harmful for their reputation.
  4. Automatic failure recovery - As we will be dealing with 1000s of servers storing and maintaining the data, any failure if fixed manually can take some amount of time and human hours, which also cost the company. And if the servers are in 1000s, the chances of any kind of failure is very high. We can achieve this by persisting some metadata so that after any failure or restart, the node can backup the state from where it was before the failure.


Metadata service(servers)

Metadata service(servers) is responsible for keeping track of all the metadata like where a file is stored, how it is chunked, how it is replicated, update pattern(versions).

Considering the above points, we will be using a master-slave architecture means there will be a single node that will take care of the update to files and there will be replicas that will be used for reading the files, and in the case of the master failure, coordinate between them to select a new master among them.

Responsibilities of the master node:

  • master keeps mapping from filename to chunks
  • we query the master to ask for servers where the file is stored
  • the master contains two main tables
    • map of the file name to an array of chunk ids or chunks handles
    • maps chunk handle to 
      • a list of chunk servers
      • the version number of the chunk
      • which chunk servers are primary
  • all the tabular data should be kept in memory in master for faster reads and writes
    • periodically syncs all data to non-volatile memory(disks) in case of failure, the newly elected master will load these data from disk
Responsibilities of slave node(s):
  • keep all data in sync with the master
  • in case of failure, elect a new master among the remaining nodes to act as the primary server for redirection

Chunking service(servers)

The chunking service(servers) is responsible for chunking the file and storing it on disk(non-volatile). Files are chunked in smaller parts, as the files can be very big, and to update, we may not require the whole file to be loaded again and again. It will also be easier to replicate and synchronize small files across secondary servers and clients if multiple clients are connected at the same time.

How reads and writes happen?

Both metadata service and chunking service are involved in any kind of reads or write.

  • API: read(file_name, offset)
    • file_name: will be a unique identifier for the file
    • offset: will be the starting point in the whole file, from where it will start reading
  • the read request from the client will first go to metadata service, which will return
    • chunk handle for the file and offset
    • list of chunk servers, including the primary and secondary(replicas) servers
  • the above information will also be cached, as the same call will be very frequent
  • then the read request will go to the chunk server as read(chunk_handle, offset) where chunk_handle will have the information about the chunk version
    • this will return the chunk of the file from one of the secondary chunk servers
  • API: write(data, file_name)
    • data: is the data that will be appended to the end of the file
    • file_name: will be a unique identifier of file to be written
  • the request will go to metadata service, which will create a new chunk identifier for new data and
    • will store chunk handle, version
    • find up to data replicas(chunk servers)
      • elect one of them to be primary, others to be secondary
  • master of metadata server will increment the version number and writes it to disk
  • sends primary and secondary a message for the updated version number, elected primary server, and list of secondary servers for the given chunk handle
  • the primary server picks and sets the offset for the data
  • send all replicas a request to write at the offset
    • if all replicas respond with "success", returns "success" to the client
    • if any secondary server responds with "fail" or doesn't respond within a given time, returns "failed" to the client

WHAT IF the metadata service thinks the primary chunk server is dead?

This may be due to various scenarios such as network failure, packet loss, etc due to which the metadata service thinks the primary chunk server is dead, so it will elect and assign a new primary server and update its database.

But if in actuality, it's not dead, it can be the case that the primary server is even responding to the client and to other secondary servers as well, but disconnected from metadata service.

In this case, it will elect a new primary server, and we will be having two primary servers, which is a big problem.

This problem is called a "split-brain" problem which mainly occurs due to network partition.

To solve this, the master can assign a TTL to primary, i.e. for how much time it will be primary, and stop taking any requests after the TTL ends.
After TTL ends, it will check for available nodes and chooses one of them to be primary for a given lease again making the remaining secondary, and updates its database accordingly.

The value of TTL can be chosen as per the consistency and availability requirements, as it will not be updating it's primary even if the primary dies until the TTL ends.

* The details about replication and fault tolerance coming soon in part 2


Post a comment

Popular posts from this blog

Natural Language Interface for Relational Database - What is it and How it can be made?

How they work - The 3 Magical Functions of Python : map, filter and lambda