Tuesday, January 5, 2021

What'sApp System Design

Problem: We need to design a real time messaging system like What'sAPP or WeChat. We just can't cover each and every feature of WhatsApp so I came up with the following feature set which we are going to discuss:

  1. One to one messaging
  2. Acknowledgement of message
  3. Last Seen
  4. Media message
  5. Group messaging

Design: 

1. One to one messaging:

Let's say UserA wants to send a message to UserB and UserA knows the address of UserB(intranet). In that case I think we don't need to do anything. The design will look like:


Right, there is no need of a server as both users knows each other. Now in the real world, its totally a different story. On internet its almost impossible that one device knows another so we need a man in middle or a server whose address is known by all of its client and whenever one client say UserA wants to send a message to another client say UserB, UserA just asked the server to send his message to UserB so now the design will look like:



Now let's see the server more in details. First let's see what kind of connection we want in between client and server. We can use HTTP here but there will be a problem because of it. Let's see what it is; say UserA wants to send a message to UserB, it can just send the request to server, no problem here but how UserB will get the message. In case of HTTP there is no server to client request, I mean HTTP is a pull protocol right. The work around here is we can use something HTTP long polling (more can be found here) but it will unnecessary make the process / design complex. 

To make it simple, we can use web sockets (WSS) which provides bidirectional / peer to peer communication TCP connections. Here there is no client or server so I am going to use WSS only. 

Given the huge traffic which What'sApp faced, it is obvious that just one server won't work. That means we will be needing a lots of servers and a load balancer to balance the loads of different machines. Load balancer can redirect the client request to a particular server based on the loads or based on the sticky session which will tell load balancer that this client was previously connected to a particular server.

Let's not jump further and see what is current design till now. I am putting DB and cache also in the image below. Don't worry about it yet. I will explain it later.





Now let's see what happens when a client first time connects to server. Client initiate a connection request and that request goes to load balancer. Load balancer choose a machine / server based on load to serve this request. Say load balancer chose Server 2 like it is shown in the above image. Now a WSS is established between Server 2 and User A. At this point Server 2 make an entry to the cache. This cache could be any famous distributed cache like redis or memcache. The entry will look like:


Client Id

Server Id

UserA

Server 2

UserB

Server 4


If you see I have made another cache entry for UserB too as same steps needs to be followed in case of UserB connection too. Once these entries are made and a WSS is established now either these clients can directly talk to related servers or they can still talk to load balancer using sticky sessions. I would prefer the earlier approach where client directly talks to server using web sockets. It will also help to manage the load on load balancers. There are few questions here:

1. What if Server 2 dies, how UserA will operate? The answer is simple, if a corresponding server crashes, client will get to know as client won't get any response. Now at client side we can check whether the internet is down at client side or actually server has been crashed. Once client knows that server has been crashed, after some retries (if any), it can send a new connection request to load balancer which will ultimately follow the same process and establish a new WSS between a new server and client with a entry overwritten in the cache for UserA.

2. Why put this entry into cache: First cache are faster than DB and we will be using this entry for frequent operations, we will see it later. Second these entries are not of big data size. Even for 1 billion users, we don't need many machines.

3. Time to retain WSS and cache entry: We can retain this WSS connection as long as possible. If client goes out of network obviously connection is gone. If there are no activities from client say no send / receive message for a time above threshold, server can close this connection. Whenever there is a closure of connection we can remove the entry from the cache too.

Fine. Now the final step to understand how server machine internally manage so many web socket tcp connections. Obviously server process needs to attach a thread with each connection. Now say server creates a very light weight thread to handle each connection. Once the connection has been made, server makes an entry into a cache which is basically a in memory cache:

Client Id

Thread Id

UserA

Thread 1

UserC

Thread 2


Now again is it right approach to maintain this cache in memory. I do think so as I think the data size can be fit in memory of a single machine but it is debatable so if it is really not the case we can attach an external cache or we can use the same cache which we are using at the time of connection establishment. Now the 1st cache will look like: 

Client Id

Server Id

Thread Id

UserA

Server 2

Thread 1

UserB

Server 4

Thread 7


We can use the mix of 2 approaches where may be for frequently used clients use in memory cache and for others we will use Global cache.Whatever approach works with big data size, is fine. So now if you see our server will look like following:



Fine. Now we have understood how the connections has been established and managed, let's come to our original feature one to one messaging. Here are the steps for User A sending a message to User B:

  1. UserA sends message "To: UserB, Text: Hi" to Server 2 using the established connection. Before that client save the message into its local device database like SQLite etc.
  2. Server 2's Thread 1 validates if WSS and user id is correct.
  3. Thread 1 check for UserB in Server 2's cache if found it just put the message to the corresponding Thread's queue, if not, it just send the data to a different microservice say "Messaging service" which checks the global cache to see the corresponding server. In this case it is Server 4. Once the record is found the messaging service handovers the message to Server 4. Messaging service also asynchronously write this message with timestamp into the Database. It calls the Database server's API to write it.
  4. Once Server 4 receive the message, it looks for the thread id for User B in its cache and put it in the corresponding thread's queue. Now the corresponding thread 'Thread 7' will push this message to client User B.

Here is the flow:


Now few queries here:

1. What if User B is not connected: That's why we are saving the message into the database. Once the User B again got connected, all the messages from the database will be delivered to User B.

2. What if Server 7 dies after receiving the message: Again the same thing. If Sever 7 dies, User B will be making a new connection request and once the connection is established, like query 1, the messages will be read from the database and delivered to User B.

The only problem here is we will be bombarding DB with lots of requests. The solutions could be:

1. We will save messages into DB only in case of  a client is not connected and ignore the fact that sever can die after receiving the message which is a rare case anyway. This will actually save lots of DB requests.

2. We can write another microservice over DB which will receive all the messages and add these messages in batches.


2. Acknowledgement of messages:

We have understood the flow of messages. Now let's understand, how acknowledgements actually works. In What'sApp, there are three types of acknowledgements:
  1. Single tick: Message is sent from the sender and reached the sever.
  2. Double ticks: Message is reached to receiver.
  3. Blue ticks: Receiver read the message. 
Here is flow of acknowledgements:
  1. Once the server received the message from client, server sends the acknowledgement to the client that your message is received. That's ack #1; single tick.
  2. Once the receiver received the message. It can send a message back to sender, something like: "To: UserA, MSG Id # {message id} Recvd, Type: ACK". Now this message will be sent to UserA  using the same steps which we followed to send a message (can ignore the db write part) in above section. That's our ack #2; double tick.  
  3. Similarly once the receiver read the message, It can send a similar kind of message to UserA, something like  "To: UserA, MSG Id # {message id} Read, Type: ACK". Once it is received at UserA device, it acts like our ack #3; blue tick.
That's all about this feature. Please note that in case we opted the approach where while sending the message we are always saving the message in DB, we can remove the message from DB once it is delivered to client. That is at the time of ack #2, we can remove the message entry from DB.


3. Last seen: 

To enable this feature let's add another microservice to our server which is LastSeen microservice. Let's add another field to our Global cache:

Client Id

Server Id

Last activity at

UserA

Server 2

Timestamp1

UserB

Server 4

Timestamp2


Now once we added this entry in our cache, it is very simple to enable this feature. Say User B wants to know the last seen time of User A. It just query the LastSeen microservice to know the last activity time of User A and this service internally read this data from cache and return it to the client.

Now let's see when to update the last activity time of client. We can use either of below 2 approaches:
  1. Whenever a user sends a message whether its actual message or read acknowledgements or even query for other client's last seen time, we can asynchronously update the entry in the cache.
  2. A keep alive message sent from the client app whenever user is using the app at a regular interval.
A keep alive messaging will give more accurate timing but it will keep the server busy but if you see its just a cache update so it is very fast but still there will be a number of requests. If we opt approach #1, we don't need to deal with any additional message but timing might not be very accurate but in general if you see when user is using the app at least he/she will be querying the last seen time of another user.


4.  Media message:

In What'sApp we can send images, audios and videos too. Let's see how we can design this feature. For this we will introduce a new server which is HTTP Media Server. Let's take an example where UserA wants to send an image to UserB, we can follow the following steps to enable it:  
  1. UserA sends a HTTP request to HTTP Media Server to upload the image to server. This server might be storing the image to file system and metadata to DB or server is storing it to Blob storage. Once the uploading is successful server can return a hash or unique id to client corresponding to the image.
  2. Once a hash is received UserA now sends a message "To: UserB, Type: Media, MediaType: JPG, Hash: {received_hash}" to UserB.
  3. Once UserB receive this message, it looks at the message type and sends a request to HTTP Media Server to download the image using the MediaType and the Hash.
  4. UserB gets the image :).
Here is the flow:


 That's it for this feature.

5. Group messaging: 

Basically to handle group messaging, we will add another microservice say Group microservice. The role of this service is to create a group, add a user/users to a group, query a group and all other functionalities related to groups. Now with this new service, here are the steps which we can follow:
  1. UserA sends a message to GroupA.
  2. Message Service get the members of GroupA by calling Group service API. Say the output is User1, User2, User3... UserN, UserA.
  3. Message service follow the steps for one to one messaging to all the the users except the sender in the output of step 2 asynchronously.
That's all. Now we have designed all the features which we targeted. There are other features like:

  1. Encryption
  2. Audio call
  3. Video call
  4. Profie
Or may be more but I could not cover it. Please note that it is not a perfect design, these are just my thoughts on how to design What'sApp. 

No comments:

Post a Comment