Dropbox system design | Google Drive system design
Table of Contents
Overview
In this blog let’s understand the architecture design of how file storage systems like Google Drive or DropBox works. All of us use file hosting services almost daily to upload, share, edit files and important documents. It’s a good interview question for system design interviews, as it will require an understanding of scalability, concurrency, file storage, caching, etc. we will go through an overview of how we can approach design such a huge system in a matter of just 45-60 minutes. In this article, we are taking the reference of Dropbox, but a similar approach applies to Google Drive, One Drive as well.
Here, we will be concentrating on designing a Dropbox in 45 minutes.
Before starting with the interview, one must not directly jump onto technical aspects and details, the idea must be to discuss high-level ideas about designing any such system. Analyzing the problem and discussing the approach will give the interviewer a perspective about how are you going to design the system. There is a step-by-step guide for the system design interview.
Discuss the requirements
Before we start designing any system, we will make all our assumed functional and non-functional requirements clear with the interviewer. Here, we need to ask as many questions as we need to ask the interviewer.
Functional Requirements:
- Users should be able to upload and download their files from any device.
- Users should be able to share files and folders with other users.
- File versioning should be able to restore the previous version of the file.
- The system should support automatic synchronization across the devices.
- The system should support offline editing. Users should be able to add/delete/modify files offline and once they come online, changes should be synchronized to remote servers and other online devices.
- The system should support storing large files up to a GB. (Dropbox currently supports up to 50 GB)
Non-Functional Requirements:
- Reliability: The system should be highly reliable. Any file uploaded should not be lost. Users should be able to trust the system to store their important documents.
- Availability: The system should be highly available.
- Scalability: Users should be able to trust the system to have cloud storage you have unlimited storage as long as they are ready to pay for it.
- Interoperability: Users must be able to integrate their system with the Dropbox easily. Exchange of information must be fluent.
Capacity Estimation
- Number of users = 500 Million
- Number of active users = 100 Million
- The average number of files stored by user = 200
- The average size of each file = 100 KB
Storage Estimations
- Total number of files = Number of users * Number of files = 500 Million * 200 = 100 Billion
- Total storage required = Number of files * Average size of file = 100 billion * 100 KB = 10 PB
Design Consideration
The discussion about various aspects of space and concurrency utilization becomes more important:
More bandwidth and cloud space utilization: In order to store multiple versions of the same file changed multiple times, the amount of bandwidth and space required increases significantly. Hence we need to think of such a system which overcomes this limitation and also updating whole file for any changes you made into file will not be a good idea as uploading to cloud again and again is in itself an overhead.
Latency or Concurrency Utilization: Uploading and downloading complete file again and again from cloud also requires more time. There must be a system such that it overcomes all the limitations and perform the task in optimized and efficient manner. Apart from this you cannot use multi-processes to write a single file.
System APIs
- upload(string uploadToken, fileInfo file, userInfo user)
- edit(string authToken, fileInfo file, userInfo user)
- delete(string authToken, fileInfo file, userInfo user)
- download(string authToken, fileInfo file, userInfo user)
- generateToken(string userName, string password)
Component Design and Architecture
Now lets design high level components of the system.
Client
Client is your desktop or mobile application which is installed in your system to access Dropbox. It keeps synchronizing the files with remote server to maintain consistency. On a broader view a client consists of these basic components Watcher, Chunker, Indexer and Internal DB. Following are the basic responsibilities of a client:
- Uploading/downloading the files, synchronizing the file changes with remote server
- It is also responsible for handling concurrency issues while multiple operations are being performed on single file.
- Client actively look up for all the changes happening around files.
- There is a need to maintain metadata corresponding to each file, for which client interacts with Metadata DB, Messaging Queue and Synchronization Service.
- To store actual files, there is a need of some cloud storage (like Amazon S3) to store files, this also takes care of availability of data.
Let’s assume we have a 1 GB file and four changes were made to this file. Because of this the file was sent four times to the remote server and four times another client downloaded it. In this whole process 4 GB of bandwidth was consumed for upload and 4 GB for download.
Also, if some connectivity issue occurs in middle of some process, the client has to upload or download complete file each time.
We can break the files into multiple chunks to overcome the problem we discussed above. There is no need to upload/download the whole single file after making any changes in the file. You just need to save the chunk which is updated (this will take less memory and time). It will be easier to keep the different versions of the files into various chunks.
We have considered one file which is divided into various chunks. If there are multiple files then we need to know which chunks belong to which file. To keep this information we will create one more file named as a metadata file. This file contains the indexes of the chunks (chunk names and order information). You need to mention the hash of the chunks (or some reference) in this metadata file and you need to sync this file into the cloud. We can download the metadata file from the cloud whenever we want and we can recreate the file using various chunks.
Client Components
- The watcher is responsible for maintaining the synchronization folder which keeps a track of operations performed by the users like create, update or delete files/folders. It also notifies to Indexer and Chunker in case of any updates in files or folders.
- Chunker, as the name suggests it breaks the file into multiple small pieces called chunks and uploads them to cloud storage with some unique id. This can also be used to recreate a file by joining chunks together. Whenever any update is being performed in the file, it only uploads that particular chunk instead of whole file which saves the bandwidth and latency.
- Indexer updates the internal database, as soon as watcher notifies it regarding changes occurred in files/folders. It also receives the URL of the chunks from the Chunker with its unique id and in turn it updates the file with modified chunk. Indexer also communicates with Synchronization Service with the help of Message Queue, once the chunks are successfully submitted to the cloud Storage.
- Internal database is responsible for storing all files and their metadata corresponding to chunks which include their version and location in the file system.
Other Components:
Metadata Database
The metadata database maintains the indices of various chunks. As the name suggests, it stores metadata of the chunks. It contains files/chunks’ names, their different versions along the information of users and workspace. Consistency plays a very important role here as multiple clients will be working on the same file, hence RDBMS is preferred over NoSQL databases (as NoSQL provides you with eventual consistency). In the case of RDBMS, scalability is often an issue as they are difficult to scale. You can then use sharding to scale the RDBMS database. This can also be implemented using some lightweight database like SQLite. As per Dropbox’s design, File metadata is stored in a MySQL-backed database service and is sharded and replicated as needed to meet performance and high availability requirements.
{
"chunk_id": "string",
"chunk_order": "number",
"object": {
"version": "number",
"is_folder": "boolean",
"modified": "number",
"file_name": "string",
"file_extention": "string",
"file_size": "number",
"file_path": "string"
},
"user": {
"user_name": "string",
"email": "string",
"quota_limit": "number",
"quota_used": "number",
"device": {
"device_name": "string",
"sync_folder": "string"
}
}
}
Message Queuing Service
Messaging Queue provides with asynchronous communication between client and the synchronizing service.
Why there arises a need of Messaging Service?
- Handle bulk of read and write requests.
- Since there will be lot of messages, availability and reliability are the major factors to be incorporated, messaging queue facilitates the same.
- High performance and high scalability.
- Load balancing and elasticity for multiple instances of the Synchronization Service.
Components of Messaging Queue:
Request Queue
This queue will be shared among all the clients. Whenever a client performs any update or requests, the request is sent to request queue, which is further taken care by Synchronizing service which in turn updates the Metadata database.
Response Queue:
The response queue is associated with each and every client, each client has their separate response queue. This along with synchronization service provides the ability to keep all the files and folders in sync with the remote server. As soon as a file is updated, synchronization service notifies all the response queues about the change and response queues will be responsible to notify their client regarding this change and the change will be made. The most important advantage of using messaging queue is even if client is offline, they will not lose this update and once they are online, they can also be in sync with the remote server.
Synchronisation Service
This is one of the most important component of the dropbox design. This component is responsible for synchronizing all the files and folders at local side with remote server content. Client communicates with Synchronization Service to fetch latest details from Cloud Storage or to provide updates to Cloud Storage.
It receives the request from request queue, updates the metadata database as per request. Once the update is done, it broadcast the updates to all the other clients with the help of response queue and all the clients gets the latest copy of the data, this happens in the way that client’s indexer can fetch the chunks from Cloud Storage and recreate file with latest updates.
Client also updates its own local database with the updated information in Metadata database to maintain the consistency with remote server.
Cloud Storage
You can use any cloud storage service like Amazon S3 to store chunks of the file. The client can then communicate with the cloud storage using the APIs exposed by cloud service provider to update their local version of files.
About Author
Surendra Lalwani is a passionate software engineer and large scalable system systems are his favorites. His expertise in big data technologies is great. He has worked with companies like Impetus, ServiceNow, Morgan Stanley. He has a hobby of participating in code competitions and hackathons.
Contact him on LinkedIn