Designing Distributed File Storage Systems - Explained with Google File System(GFS) - Part 1
A distributed file system(DFS) is a file system that allows you to access files stored in multiple hosts distributed across them.
In contrast to other data stores like MySQL or NoSQL which can also be made distributed, the data is DFS is stored in raw form and as files instead of a table or a fixed-document format.
The use cases where DFS is used are very vast. From storing media content like Youtube videos to Instagram images to blogs like these, anything that can be stored as a file, is big and is not suitable to store in other data stores, DFS is generally a preferred choice for everyone.
Key points to consider while designing a DFS:
- Consistency - Strong consistency for a user means the system should behave as if the user is talking to a single server in spite of behind the scene they are in 1000s. If a user chooses to build a distributed system that is also consistent in nature, performance should be traded off for this, which is mostly the write performance. The general case is a write is acknowledged successful if and only if the writes to all the replicas are done, and the write performance takes a hit with this. Most of the systems are made eventual consistent due to this reason if it can be afforded for the busses use case. If we want strong consistency, it will trade-off performance, as it cannot send an acknowledgment until all replicas are updated.
- Parallel performance(through sharding and replication) - As most of the data is independent, we can split up the data(shard) to store in multiple machines to scale horizontally with an increased amount of data as well as traffic load. This will surely increase the query and maintenance complexity of storing data, as we have to manage the metadata at the application layer as well, but a good way to store if the data is very huge, like for a typical file data store. We can achieve increase query performance as most of the queries can be done parallelly, even for the same file if we split the file as well(we'll talk about chunking)
- Fault-tolerant (through replication) - As we want the system to be highly available, i.e. even if a few servers fail, it should continue working without any change for the end-user. This can be achieved with replication, i.e. duplicating and storing the same data across multiple machines. Though replication can be very tricky to achieve, and we have to trade off consistency to achieve this, this can certainly be very useful if losing a file, even a chunk of it can be very costly for the business. Ex: if we are trusting google drive to keep all our data for a lifetime, we don't want them to lose any of it, and losing even a little data can be very harmful for their reputation.
- Automatic failure recovery - As we will be dealing with 1000s of servers storing and maintaining the data, any failure if fixed manually can take some amount of time and human hours, which also cost the company. And if the servers are in 1000s, the chances of any kind of failure is very high. We can achieve this by persisting some metadata so that after any failure or restart, the node can backup the state from where it was before the failure.
Responsibilities of the master node:
- master keeps mapping from filename to chunks
- we query the master to ask for servers where the file is stored
- the master contains two main tables
- map of the file name to an array of chunk ids or chunks handles
- maps chunk handle to
- a list of chunk servers
- the version number of the chunk
- which chunk servers are primary
- all the tabular data should be kept in memory in master for faster reads and writes
- periodically syncs all data to non-volatile memory(disks) in case of failure, the newly elected master will load these data from disk
- keep all data in sync with the master
- in case of failure, elect a new master among the remaining nodes to act as the primary server for redirection
How reads and writes happen?
- API: read(file_name, offset)
- file_name: will be a unique identifier for the file
- offset: will be the starting point in the whole file, from where it will start reading
- the read request from the client will first go to metadata service, which will return
- chunk handle for the file and offset
- list of chunk servers, including the primary and secondary(replicas) servers
- the above information will also be cached, as the same call will be very frequent
- then the read request will go to the chunk server as read(chunk_handle, offset) where chunk_handle will have the information about the chunk version
- this will return the chunk of the file from one of the secondary chunk servers
- API: write(data, file_name)
- data: is the data that will be appended to the end of the file
- file_name: will be a unique identifier of file to be written
- the request will go to metadata service, which will create a new chunk identifier for new data and
- will store chunk handle, version
- find up to data replicas(chunk servers)
- elect one of them to be primary, others to be secondary
- master of metadata server will increment the version number and writes it to disk
- sends primary and secondary a message for the updated version number, elected primary server, and list of secondary servers for the given chunk handle
- the primary server picks and sets the offset for the data
- send all replicas a request to write at the offset
- if all replicas respond with "success", returns "success" to the client
- if any secondary server responds with "fail" or doesn't respond within a given time, returns "failed" to the client