Zig-Zag World of Algorithm and Data Structures

Friday, August 1, 2025

[Netflix Tech Blog] Behind the Scenes of Pushy - Netflix’s Real-Time WebSocket Proxy (Part 1)

In today’s connected world, user experiences are expected to be instant, interactive, and deeply personalized. Whether it’s a mobile notification about a new episode, a voice command to your smart TV, or a prompt to resume your last show; the ability to push messages in real-time has become critical.

At Netflix’s scale, serving hundreds of millions of devices across the globe, building and maintaining a real-time communication layer isn’t just desirable, it’s mission-critical.

That’s where Pushy comes in which is Netflix’s in-house WebSocket proxy, designed to reliably deliver low-latency, bi-directional communication between client devices (TVs, phones, browsers, game consoles) and backend services. Originally created to support use cases like voice search, remote control pairing, casting, and other interactive UI elements, Pushy quickly became a foundational service for Netflix.

But as the streaming giant continued evolving its product and infrastructure, cracks began to appear in Pushy’s original design and some of them are scalability, observability, and operational resilience needed a rethinking.

In this two-part breakdown series, I’ll walk you through:

What Pushy is and why Netflix built it
The original architecture and its role in powering interactive features
What worked well and where it started to show strain
And in Part 2: how Netflix rearchitected Pushy to meet the future head-on

📖 Reference: Pushy to the Limit: Evolving Netflix’s WebSocket Proxy for the Future

Let’s start by understanding the need for Pushy in the first place.

💡 Why Netflix Needed Pushy in the First Place

As Netflix expanded from being a content streaming service to an interactive entertainment platform, the demands on its device communication infrastructure grew rapidly. Simple HTTP-based polling or long-polling mechanisms were no longer sufficient for the level of interactivity users expected.

Here are the emerging use cases that demanded real-time, low-latency communication between Netflix devices and backend services:

Voice Search & Navigation: Smart TVs and streaming devices increasingly support voice commands. Users expect their spoken input to be recognized, interpreted, and acted upon in real-time, requiring a persistent, fast channel between device and service.
Second-Screen Experiences & Remote Control Pairing: Netflix enables pairing between mobile devices and TVs, allowing the phone to act as a remote or companion device. This requires devices to discover each other and exchange information instantly.
Casting: When users cast Netflix content from their phones to another device (like Chromecast or a Smart TV), there’s a need for near-instant coordination between devices.
Interactive UI Elements: Think of countdowns, real-time error reporting, or UI nudges (e.g., "Resume from where you left off") , these need to be delivered to the client immediately and reliably.

To power these experiences, Netflix needed a solution that:

Could maintain persistent, bidirectional connections at scale.
Worked across millions of heterogeneous devices.
Offered low-latency delivery for time-sensitive events.
Could be centrally managed, observed, and extended to future use cases.

That solution was Pushy, a custom-built WebSocket proxy service that sits between client devices and backend systems, managing long-lived connections and ensuring seamless real-time communication.

Here is one example of alexa voice commands where pushy played a vital role:

📡 Pushy’s Original Architecture — End-to-End Overview

Pushy was Netflix’s custom-built WebSocket proxy designed to maintain persistent, low-latency, bidirectional communication between millions of client devices and backend services.

At a high level, Pushy accepted WebSocket connections from clients (TVs, mobile devices, etc.), maintained their sessions, and routed messages to the appropriate backend services over gRPC. However, under the hood, it contained several sophisticated components that enabled it to operate at Netflix scale.

Here’s how it worked, step-by-step:

Client Connection

A client (e.g., a smart TV) initiates a WebSocket connection to Pushy.
Pushy authenticates the client and establishes a session.
It stores routing metadata (e.g., region, app version, device ID) for the session.

Session Registration in Push Registry

Pushy registers the client session in a Push Registry, which acts as an in-memory mapping of client sessions to backend service instances or logical partitions.
This registry is consulted for routing incoming and outgoing messages.
Initially, Netflix used Dynomite (a high-availability Redis wrapper) to back this registry.

Message Ingestion via Queue & Processor

Client sends a message over the WebSocket connection.
The message is queued in an internal Message Queue to absorb bursts and prevent backpressure from backend systems.

Processing via Message Processor

The Message Processor consumes messages from the queue.
It applies session-specific logic, enrichment, or transformation.
Messages are then serialized into gRPC-compatible formats.

Routing to Backend Services

Based on the Push Registry’s mapping and routing rules from the Control Plane, the message is routed to the correct backend service over gRPC.

Outbound Messaging

Backend services can also push messages to clients.
Pushy uses the same registry to identify the client session and deliver the message over the existing WebSocket connection.

🔍 Supporting Control Plane (External)

Pushy constantly syncs with an external Control Plane, which provides:

Dynamic routing rules
Backend service discovery
Deployment topology
Traffic shaping configs

🛡️What is Dynomite

Dynomite is a plugin layer built by Netflix to transform existing, standalone storage engines like Redis or Memcached into a fully distributed, highly available, and multi‑data‑center system. Designed around concepts from Amazon’s Dynamo architecture, it enables:

Active‑active replication across data centers
Auto‑sharding, linear horizontal scalability
Configurable consistency levels, node‑warmup, and backup/restore support

Netflix has used Dynomite in production extensively, up to 1,000+ nodes, serving over a million operations per second at peak, with clusters reaching multiple terabytes in size.

A standard Redis or Memcached instance is a single‑server cache. Dynomite allows Netflix to:

Seamlessly scale to many servers and data centers
Replicate data synchronously across regions for high availability
Avoid a single point of failure while maintaining low latency and consistent behavior under heavy load.

There is also a management tool called Dynomite Manager that helps automate tasks like node replacement and bootstrap, especially useful when deploying via AWS Auto Scaling.

🔁 Scalability and Auto-Panning

To deal with global scale, Pushy’s initial strategy was to scale out WebSocket servers horizontally across availability zones. But this wasn’t enough, some zones still ended up with more load due to device churn and reconnect storms.

Netflix introduced auto-panning, a strategy where incoming connections would be balanced based on current server load and connection counts, not just simple round-robin DNS-based routing.

This was a precursor to what would become a more intelligent, connection-aware load balancing strategy in the evolved architecture.

✅ Summary

Pushy wasn’t just a simple proxy but it was a session-aware, fault-tolerant, scalable message router for real-time device communication. By combining message queuing, smart session routing, and dynamic discovery, it helped Netflix deliver seamless user experiences across millions of connected devices.

⚠️ Challenges with the Original Pushy Architecture

While Pushy’s original design served Netflix well at first, it began to strain under the weight of scale, evolving use cases, and operational demands. As the number of connected devices grew into the millions, and as Pushy became central to real-time features across multiple Netflix applications, a range of technical and architectural issues surfaced.

Let’s break them down:

1. Frequent Reconnect Storms

Netflix devices tend to reconnect all at once:

Power outages, Wi-Fi changes, or Netflix app updates often trigger mass reconnects.
WebSocket servers had to handle a spike of TLS handshakes, registry updates, and message backlog replays and all within seconds.

This led to:

Thundering herd problems
Sudden CPU and memory pressure on specific nodes
Occasional partial outages during large-scale reconnects

2. Push Registry Memory Pressure and Latency

The Push Registry, backed by Dynomite, became a bottleneck:

Every message delivery required low-latency lookups in the registry to identify active sessions.
With millions of devices, memory usage ballooned.
Dynomite’s replication overhead and Redis-like architecture couldn’t scale linearly, resulting in lookup slowdowns and stale data.

3. Limited Fault Isolation

A major issue was no strong isolation between customers or device types:

A bug or reconnection surge in one type of client (say, Android TVs) could affect Pushy performance globally.
Without sharding or tenant separation, a misbehaving device model could overwhelm shared infrastructure.

4. Insufficient Observability and Debuggability

The original message processor had limited metrics and retry intelligence.
Operators lacked visibility into where messages got stuck or which sessions were stale.
Diagnosing delivery failures required deep dives across multiple logs and systems.

This made it hard to:

Guarantee message delivery SLAs
Track end-to-end latency
Quickly respond to production issues

5. Network Topology Challenges

Since WebSocket servers needed to stay globally distributed and session-aware:

Load balancing at the TCP level was tricky (WebSockets aren’t natively friendly to CDN-style routing).
Auto-scaling had to preserve session stickiness.
Edge nodes required careful management of memory, state, and connection churn.

6. New Use Cases Were Outgrowing the Architecture

Pushy started with basic notifications, but:

Features like real-time device sync, voice input responsiveness and cross-device experiences demanded ultra-low latency and higher delivery guarantees.
The original system wasn’t optimized for delivery retry, QoS policies, or global message routing.

🔚 Wrapping Up Part 1: The Rise and Limits of Pushy

In this first part, we explored why Netflix needed a solution like Pushy in the first place and how it enabled seamless, real-time interactivity across devices, from second-screen controls to voice-based commands. We broke down the architecture that powered this WebSocket-based infrastructure, including its registry, message queues, and processors.

Pushy served Netflix exceptionally well for years, supporting a growing range of use cases and hundreds of millions of concurrent connections. But as we saw, scale and diversity came with operational trade-offs. From resource fragmentation to uneven scaling, the cracks began to show, not because Pushy was inadequate, but because Netflix’s needs had fundamentally evolved.

In Part 2, we’ll dive into how Netflix re-architected Pushy to meet these new demands: the innovations they introduced, how they tackled each bottleneck systematically, and what benefits the evolved system now unlocks—including hints of what might come next.

Tuesday, July 22, 2025

Design Swiggy

Designing a highly scalable food delivery application like Swiggy or Zomato is a complex challenge that involves optimizing for scale, consistency, low latency, and real-time tracking. These apps handle millions of users, orders, and real-time delivery updates across cities, all while ensuring a seamless user experience.

In this post, I’ll walk through a step-by-step architectural design of a food delivery system, starting from functional and non-functional requirements to API design, services, data modeling, and finally addressing performance, availability, and consistency trade-offs.

Requirements:

1. Functional requirements:

It's an app only system.
Only registered users can order food.

Delivery partner registration is manual and out of scope

User and delivery partners need to login to order food and deliver food.
Delivery partners need to make themselves available to receive the delivery order.
Restuarants need to register and upload the menu to receive the food order.
Restaurants need to make themselves available to receive the order.
We need to minimize the:

Wait time for the users
Distance / Time for delivery partners to deliver the food.

User should be able to search the dishes, cuisines and restaurants.
User should be able to see the top rated nearby restaurants on home page.
Collect payment information from user.
Collect bank information from restaurant to pay them
Delivery Partner should be able to mark the order state like in transit, delivered
Users should be able to track their order.
Users should be able to rate restaurants and delivery partners.

2. Non functional requirements:

Scalability

100 M total users.

10 M daily orders. ~ 115 orders / seconds or 1000 - 2000/second during peak hours.

100 K delivery partners
Collect delivery partner's location every 5 seconds i.e. 20 K messages / second.
300 K total restaurants
100 items per restaurant menu so total menu items are 30 M.

Availability

99.99 or higher

Performance

Search < 200 ms
Loading home screen < 500 ms
Loading Menu < 500 ms

CP over AP

Consistency is imporant over availability here.
Can't have older menu or older addresses etc.

Design:

Step 1: API design:

As usual, for the api design let's first translate our functional requirements in a sequence diagram which should look like follows:

Given the sequence diagram, we can see the need of having a bidirection connection between server & delivery partners and also server & users.

We now have clear APIs:

Restaurants:

POST /restaurants
GET /restaurants
GET /restaurants/{restaurant_id}
PUT /restaurants/{restaurant_id}
POST /restaurants/login
POST /restaurants/{restaurant_id}/menu
GET /restaurants/{restaurant_id}/menu
POST /restaurants/{restaurant_id}/join
POST /restaurants/{restaurant_id}/state

Delivery Partners:

POST /dpartners/login
POST /dpartners/{partner_id}/state
POST /dpartners/{partner_id}/location
POST /dpartners/{partner_id}/rate

Users:

POST /users
POST /users/login
PUT /users/<user_id>
GET /users/<user_id>

Order:

POST /orders
GET /orders?restaurant_id=<id>
GET /orders?partner_id=<id>
GET /orders?user_id=<id>
POST /orders/{order_id>/state
GET /orders/{order_id>

Search:

GET /search/restaurants
GET /search/dishes

Step 2: Mapping functional requirments to architectural diagram:

1. User need to register and login to order food, Should be able to update the info

We will have a User service to serve these two requirements. we can collect the payment info while registeration. User service will use a SQL DB which will have two tables:

User:

user_id
user_name
password

User_Profile:

user_id
phone_no
email
address
PIN

User Payment information will be sent to another service that is Payment service from User Service. This service will have its own SQL DB which will have following table:

UserPaymentInfo:

user_id
credit_card
billing_address

2. Restaurant registration, login, upload menu and update these details:

We will have a separate service say Restaurant Service which will have its own SQL DB. It will have following tables:

Restaurants

id
user_name
password
name
location
address
pin
IsActive
menu_id
avg_rating
num_rating

RestaurantMenu

id
restaurant_id
name

RestaurantMenuItem

id
menu_id
item_id
price
IsAvailable
avg_rating
num_rating

MenuItem

id
Name
Description

The payment info will be sent to Payment Service to store the bank details. Payment Service will have another table in its SQL DB to store this info:

RestaurantPaymentInfo

restaurant_id
bank_details

The menu and restaurant metadata will be sent to a new Search service. Search service will store this metadata in it's Elastic Search database.

3. Restaurants need to make themselves available to receive the order.:

Restaurant admin simply call Restarant service. Restaurant service will mark the restaurant's IsActive as True. The diagram for both 2 and 3 FRs is as below:

4. Delivery Partners need to login and make themeselves available to recieve the orders:

To support delivery partners, we will have another service say DPartner Service. This has its own SQL DB to store the delivery partners info. Here are the tables:

DeliveryPartner

id
name
user_name
password
rating
is_active
num_deliveries

Deliver Partners can make themselves available using DPartner Service. It will just change the state in the DeliveryPartner table. Once they made themselves available, DPartner service will create a bidireactional connection using web socket. This will have following benefits:

Delivery Partners need to send location very frequently so it will remove the overhead of TCP connection every time.
Server can also push messages like pick_up order etc.

To record these location we will use a new service say Location Service. DPartner service will continue to queue these locations to Location service. We will use a write performant LSM tree based NoSQL DB to store these locations updates. This will contain partner_id and location like lat and long and we continue to update this record until the partner is active. Here is the schema of the collection:

partner_id
latitude
longitude
order_id

We are going to have another service say ConnectionManager Service. It will store user_id and connection_id in a highly performant KeyValue pair DB like redis so once the WSS connection is established, DPartner service will call ConnectionManager service to store this info. For now it is not required but it will be required when server push some mesage to delivery partner or user to know which connection to use.

5. When user order the food, system assign the best possible Delivery Partner:

As mention in the functional requirement, user should get the order in minimum time and delivery partner must travel as less as possible.

To order the food, we will have a new service say Order service. Order service will have its own SQL DB. Here are the tables:

Order

id
user_id
restaurant_id
payment_id
dpartner_id

OrderDetails:

order_id
restaurant_menu_item_id

For the order to complete, the payment will be done using Payment Service. Payment Service will connect with credit card company and restaurant's bank to complete the payment using third party payment service.

Once the payment is done, order service queue the message to Notification service and Notification service push the notification to Retaurant. Restaurant then fetch the order details from order service and set the state of order as approved by calling Order Service API.

To match the delivery partner, we will have a new service Matching Service. Let's see what it does to match the delivery partner to an order.

Matching service send the user's and restaurant's location to Location service
Location service finds all the closed available delivery partners by drawing a circle of 2 - 3 KMs. Restaurant could be the center of the circle.
Location service now sends the triplet {partner_location, restaurant_location, user_location} to a third party service.
Third party service sends the ETA for every every triplet.
Location service then send the {partner_id, ETA} to matching service.

Matching service will then match the partner with order based on ETAs and other criterias. Once matched, Matching service will send the Location service {partner_id, order_id} and Location Service will save order_id corresponding to partner_id to indicate that delivery partner is serving an order.

Now Matching Service will send {order_id, partner_id} to Order Service which will update the partner_id corresponding to order_id.

Once the partner_id is saved, Order Service queue another message with order info like restaurant location, user location and order_id to Notification service which pushes these details to delivery partners.

6. User should be able to track their orders

Once the User reaches to order tracking page, user app creates an WSS connection with a new service say Order Tracking Service. Here is the flow:

Once the order is confirmed and delivery partner is assigned, it queues the order_info like order_id, user_id, state, dpartner_id etc to Order Tracking Service.
Whenever restaurant update order state using Order Service API, Order Service also queue this order_info to Order Tracking Service.
Location Service when it sees that and order_id is attached the delivery partner id, then it queues the order_id, location to Order Tracking Service.
Order Tracking Service maintains redis cache to store this info.
When user creates an WSS with Order Tracking Service, it also sends it to Connection Manager Service to store it.
When Order Tracking Service receives any messages, it first checks with Connection Manager to get the connection id and use the connection id to send the update to User.

So this is architectural diagram for order placement and tracking:

7. Delivery partner should be able to mark the order state:

Delivery partners can simply call Order service to mark the order state. With this, this is the compelet flow of Orders:

8. User should be able to rate delivery partners and restaurants:

For this FR we can have another service say Rating service. It will have its own SQL DB. We can have different tables or we can use the same table for both restaurants and delivery partners with one more column say type but I am going to use different tables.

Both tables have the same schema:

RestaurantRating

user_id
restaurant_id
rating
created_at

DPartnerRating

user_id
dpartner_id
rating
created_at

In this way we can also restrict same user to rate restaurants or delivery Partners multiple times. Now we also show restaurant's and delivery partner's rating to users. I know we are not handling NFRs for now but I am handling little performance right now. Otherwise to show details of restaurants or partners we need to make 2 calls (restaurant/dpartner and rating). We are going to use timer jobs at Restaurant Service and DPartner Service which will run at scheduled interval say every 1/2/4/8 hours and fetch the data from Rating service using created_at field and then calculate rating and save it in their own DB.

9. User should be able to search restaurants and menu items.

For this we have already a Search Service which is using Elastic Search. Restaurant Service is already pushing the data there on onboarding or updates.

10. User should be able to see the top rated nearby restaurants on home page:

This is little tricky but we will deal with it during NFRs handling for now, you can see we have every data available at Restaurant Service DB and we can fetch this from its DB only and show it to user.

Step 3: Mapping non functional requirments to architectural diagram:

1. Scalability:

For achieving the scalability we need to have multiple instances of each service and also replicas of DBs.

2. Availabilty:

We have already achieved it using multiple instances and replicas of DBs,

3. Performance:

a. Search < 200 ms

For search performance we already are using Elastic search which should serve the purpose. If required we can add additional details like rating, state in the Elasic Search.

b. Home page load < 500 ms

Here we need to show the Top rated nearby restaurants around the user location but as of now it is not possible as we are storing lat and long.

We can use Geohash here to divide the earch into geo cells. Now these geo cell ids will be stored in the Restaurant DB as per the restaurant's locations. That means we are going to add a new table RestaurantLocation table in the Restaurant DB. Here is the schema:

geo_cell_id
restaurant_id
restaurant_name
restaurant_rating

Now the circle which we are drawing centered at User's location can have around 3 or 4 geo cells. Here is how the flow looks like for getting the nearby restaurants:

Get the geo cell id as per the user's location.
Get all the geo cell ids which is covered by 2 KM radius circle
Now the query to DB will be something like SELECT * from Location where geo_cell_ids in [list of geo cells]

To make this query efficient, we can partition this table based on geo_cell_id.

If it is still required we can cache top restaurnts of geo_cell_ids in the Redis.

* For matching Delivery Partners we will use the same approach of geo hash and a column "geo_hash_id" will be added to Location DB. Just like what we did in design of uber.

c. Menu load < 500 ms

There are four tables; Restaurants, RestarantMenu, RestaurantMenuItem and MenuItems, we need to join if we want to get the whole menu which is making this very inefficient. What we can do is maintain another table, RestaurantMenuCache which will just have following fields:

restaurant_id
menu_blob
last_updated_at

We can partition based on restaurant_ids but on range based method to make it more efficient.

d. CP over AP

We are already using SQL DB where consistency is required.

In this post, we methodically designed a scalable, consistent, and performant food delivery application from scratch. We identified critical services, database designs, WebSocket communication for real-time updates, and employed techniques like Geohashing for efficient geo-based queries.

We also mapped our non-functional requirements (NFRs) like scalability, performance, availability, and consistency across the architecture, ensuring that the system remains robust even at 100M users and 10M daily orders scale.

This design is a strong foundation — in a real-world scenario, we can extend it further by adding observability, security layers, fallback mechanisms, and cost optimizations.

Friday, June 20, 2025

[Netflix Tech Blog] Inside Netflix's Ads Event Engine: A Deep Dive (Part 2)

In Part 1, we explored how Netflix moved from a bloated token model to a more elegant, scalable architecture using a centralized Metadata Registry Service (MRS). That shift allowed Netflix to scale its ad event tracking system across vendors, while keeping client payloads small and observability high.

But Netflix wasn’t done yet.

Netflix built it's own Ad tech platform: Ad Decision Server:

Netflix made a strategic bet to build an in-house advertising technology platform, including its own Ad Decision Server (ADS). This was about more than just cost or speed, it was about control, flexibility, and future-proofing. It actually began with a clear inventory of business and technical use cases:

In-house frequency capping
Pricing and billing for impression events
Robust campaign reporting and analytics
Scalable, multi-vendor tracking via event handler

But this Ad Event Processing Pipeline needed a significant overhaul to:

Match or exceed the capabilities of existing third-party solutions
Rapidly support new ad formats like Pause/Display ads
Enable advanced features like frequency capping and billing events
Consolidate and centralize telemetry pipelines across use cases

Road to a versatile and unified pipeline:

Netflix recognized future ad types like Pause/Display ads that would introduce variant telemetry. That means the upstream telemetry pipelines might vary, but downstream use‑cases remain largely consistent.

So what is upstream and what is downstream in this Netflix event processing context:

Upstream (Producers of telemetry / Events): This refers to the systems and sources that generate or send ad-related telemetry data into the event processing pipeline.

Example of upstream in this context:

Devices (Phones, TVs, Browsers)
Different Ad Types (eg. Video Ads, Display Ads, Pause Ads)
Ad Media Players
Ad Servers

These upstream systems send events like:

Ad Started
Ad Paused
Ad Clicked
Token for ad metadata

These system may:

Use different formats
Have different logging behavior
Be customized per ad type or region

Downstream (Consumers of processed data): This refers to the systems that consume the processed ad events for various business use cases.

Example of downstream in this context:

Frquency capping service
Billing and pricing engine
Reporting dashboards for advertisers
Anlytical systems (like Druid)
Ad session generators
External vendor tracking

These downstream consumers:

Expect standardized, enriched, reliable event data
Rely on consistent schema
Need latency and correctness guarantees

So "upstream telemetry pipelines might vary, but downstream use‑cases remain largely consistent" means:

Even if ad events come from diverse sources, with different formats and behaviors (upstream variation),
The core processing needs (like billing, metrics, sessionization, and reporting) remain the same (downstream consistency).

Based on that they outline a vision:

Centralized ad event collection service that:

Decrypts tokens
Enriches metadata (via the Registry)
Hashes identifiers
Publishes a unified data contract, agnostic of ad type or server

All consumers (for reporting, capping, billing)/read from this central service, ensuring clean separation and consistency.

Building Ad sessions & Downstream Pipelines:

A key architectural decision was "Sessionization happens downstream, not in the ingestion flow" but what does that mean? Lets break it down:

Ingestion flow (Upstream): This is the intial (real time) path where raw ad telemetry events are ingested into the system, like:

Ad started playing
User clicked
25% of video watched

These events are lightweight, real time and are not yet grouped into a meaningful session. Think of it as raw, granular events flying into Kafka or a central service.

Sessionization (Downstream Processing): This is a batch or stream aggregation process where the system:

Looks at multiple ad events from the same ad playback session
Groups them together
Creates a meaningful structure called an "Ad Session"

An Ad session might include:

When did the ad start and end
What was the user's interaction (viewed / clicked / skipped)
Was it complete or partial view
What was pricing and vendor context

So "Sessionization happens downstream, not in the ingestion flow" means:

The ingestion system just collects and stores raw events quickly and reliably.
The heavier work of aggregating these into a complete session (sessionization) happens later, in the downstream pipeline where there is:

More time
Access to more context
Better processiong capabilities

Why it was required?

Separation of concerns: Ingestion stays fast and simple; sessionization logic is centralized and easier to iterate on.
Extensibility: You can add new logic to sessionization without touching ingestion.
Scalability: Real time pipelines stay light; heavy lifting is done downstream.
Consistency: One place to define and compute what an "ad session" means.

They also built out a suite of downstream systems:

Frequency Capping
Ads Metrics
Ads Sessionizer
Ads Event Handler
Billing and revenue jobs
Reporting & Metrics APIs

Final Architecture of Ad Event Processing:

Ad Event Processing Pipeline

Here is the breakdown of the flow:

Netflix Player -> Netflix Edge:

The client (TV, browser, mobile app) plays an ad and emits telemetry (e.g., impression, quartile events).
These events are forwarded to Netflix’s Edge infrastructure, ensuring low-latency ingestion globally.

Telemetry Infrastructure:

Once events hit the edge, they are processed by a Telemetry Infrastructure service.
This layer is responsible for:

Validating events
Attaching metadata (e.g., device info, playback context)
Interfacing with the Metadata Registry Service (MRS) to enrich events with ad context based on token references.

Kafka:

Events are published to a Kafka topic.
This decouples ingestion from downstream processing and allows multiple consumers to operate in parallel.

Ads Event Publisher:

This is the central orchestration layer.
It performs:

Token decryption
Metadata enrichment (macros, tracking URLs)
Identifier hashing
Emits a unified and extensible data contract for downstream use

Also writes select events to the data warehouse to support Ads Finance and Billing workflows.

Kafka -> Downstream Consumers:

Once enriched, the events are republished to Kafka for the following consumers:

Frequency Capping:

Records ad views in a Key-Value abstraction layer.
Helps enforce logic like “don’t show the same ad more than 3 times per day per user.”

Ads Metrics Processor:

Processes delivery metrics (e.g., impressions, completion rates).
Writes real-time data to Druid for interactive analytics.

Ads Sessionizer:

Groups related telemetry events into ad sessions (e.g., full playback window).
Feeds both data warehouse and offline vendor integrations.

Ads Event Handler:

Responsible for firing external tracking pixels and HTTP callbacks to third-party vendors (like Nielsen, DoubleVerify).
Ensures vendors get event pings per their integration needs.

Analytics & Offline Exchange:

The sessionized and aggregated data from the warehouse is used for:

Offline measurement
Revenue analytics
Advertiser reporting

Benefits of design:

Principle	Implementation
Separation of concerns	Ingestion (Telemetry Infra) ≠ Enrichment (Publisher) ≠ Consumption (Consumers)
Scalability	Kafka enables multi-consumer, partitioned scaling
Extensibility	Central Ads Event Publisher emits a unified schema
Observability	Real-time metrics via Druid; sessionized logs via warehouse
Partner-friendly	Vendor hooks via Event Handler + Offline exchange

By centralizing sessionization and metadata logic, Netflix ensures that:

Client-side logic remains lightweight and stable
Backend systems remain flexible to change and easy to audit

Few clarificatons:

1. What is the Key-Value (KV) abstraction in Frequency Capping?

In Netflix’s Ad Event Processing Pipeline, frequency capping is the mechanism that ensures a user doesn’t see the same ad (or type of ad) more than a specified number of times e.g., “Don’t show Ad X to user Y more than 3 times a day.”

To implement this efficiently, Netflix uses a Key-Value abstraction, which is essentially a highly scalable key-value store optimized for fast reads and writes.

How it works:

Key: A unique identifier for the "audience + ad + context"

Example: user_id:ad_id:date

Value: A counter or timestamp that tracks views or interactions

Example: {count: 2, last_shown: 2025-06-20T12:15:30}

Let's understand it by example:

Key	Value
user123:ad789:2025-06-20	{ count: 2, last_shown: 12:15 }
user123:ad456:2025-06-20	{ count: 1, last_shown: 10:04 }

When a new ad event comes in:

Netflix looks up the key
Checks the count
If under the cap (e.g., 3), they allow it and increment the counter
If at or above cap, the ad is skipped or replaced

Why "Abstraction"?

Netflix may use multiple underlying KV stores (like Redis, Cassandra, or custom services), but abstracts it behind a unified interface so:

Ad engineers don’t worry about implementation
Multiple ad systems can share the same cap logic
It can scale across regions and millions of users

Benefits:

Feature	Why it matters
Speed	Real-time lookups for every ad play
Scalability	Handles millions of users per day
Flexibility	Supports different capping rules
Multi-tenancy	Can cap by user, device, campaign, etc.

2. What is Druid and why it is getting used here?

Apache Druid is a real-time analytics database designed for low-latency, high-concurrency, OLAP-style queries on large volumes of streaming or batch data.

It’s built for interactive dashboards, slice-and-dice analytics, and fast aggregations over time-series or event data.

Think of it as a high-performance hybrid of a data warehouse and a time-series database.

Netflix uses Druid in the Ads Metrics Processor and Analytics layer. Here’s what it powers:

Streaming Ad Delivery Metrics

As ads are played, Netflix captures events like:

Impressions
Quartile completions (25%, 50%, etc.)
Clicks, errors, skips

These events are streamed and aggregated in near-real-time via Flink or similar streaming processors
Druid ingests these metrics for interactive querying.

Advertiser Reporting and Internal Dashboards

Ad operations teams and partners want to know:

"How many impressions did Ad X get in Region Y yesterday?"
"What's the completion rate for Campaign Z?"

These queries are run against Druid and return in milliseconds, even over billions of records

Why Nerflix uses Druid (instead of just a data warehouse like Redshift or BigQuery)?

Feature	Why it matters for Netflix Ads
Real-time ingestion	Metrics appear in dashboards within seconds of the event
Sub-second queries	Advertiser reports and internal dashboards need speed
High concurrency	Supports thousands of parallel queries from internal tools
Flexible aggregation	Group by ad_id, region, time, device_type, etc. instantly
Schema-on-read & roll-ups	Druid efficiently stores pre-aggregated data for common metrics

Conclusion:

As Netflix shifted toward building its own advertising technology platform, the Ads Event Processing Pipeline evolved into a thoughtfully architected, modular telemetry platform capable of handling diverse ad formats, scaling globally, and meeting the demands of both real-time analytics and long-term reporting.

This transformation was not just about infrastructure, but about platformization:

A centralized event publisher that abstracts complexity.
A unified schema that serves a wide range of downstream consumers.
A clean separation of concerns that allows rapid innovation in ad types without touching backend logic.
Observability and control, made possible through systems like Druid and the Metadata Registry Service.

With this evolution, Netflix ensured:

Scalability: Handling billions of ad events across regions and devices
Agility: Enabling quick integration of new ad formats like Pause/Display Ads
Partner alignment: Supporting vendor integrations with minimal friction
Business confidence: Powering frequency capping, billing, and reporting with precision

Here is the original Netflix blog.

Monday, June 16, 2025

[Netflix Tech Blog] Inside Netflix's Ads Event Engine: A Deep Dive (Part 1)

As Netflix steps into the ad-supported streaming world, its engineering backbone must operate with the same level of scale, reliability, and precision that powers its core viewing experience. But building a system that can track billions of ad events like impressions, clicks, quartile views, across diverse devices and ad partners is anything but trivial.

In this two-part series, I’ll dissect the architecture behind Netflix’s Ads Event Processing Pipeline, inspired by their tech blog. Part 1 focuses on the early phase, from ad request initiation to the evolution of encrypted tokens and the eventual need for a Metadata Registry Service. You'll see how Netflix tackled key design challenges such as bandwidth optimization, cross-device compatibility, vendor flexibility, and observability at scale.

If you're interested in distributed systems, real-time telemetry, or just want to peek behind the curtain of Netflix’s ad tech, you’re in the right place.

Let’s dive in!

The Initial System (Pilot) At a Glance:

When a Netflix-supported device is ready to show an ad, it doesn’t pick an ad on its own. Instead, the system goes through the following high-level steps:

Client Ad Request: The device (TV/mobile/browser) sends a request to Netflix’s Ads Manager service when it reaches an ad slot.
VAST Response from Microsoft Ad Server: Ads Manager forwards this request to the Microsoft’s ad server, which returns a VAST XML. This contains the actual ad media URL, tracking pixels, and metadata like vendor IDs and macros.
Encrypted Token Created: Instead of sending this raw VAST data to the device, Ads Manager:

Extracts relevant metadata (e.g., tracking URLs, event triggers, vendor IDs)
Encodes it using protobuf
Encrypts it into a compact structure called a token
Sends this token along with the actual ad video URL to the device

Ad Playback and Event Emission: As the ad plays, the device emits telemetry events like impression, firstQuartile, click etc., attaching the token to each event.
Event Processing via Kafka: These events are queued in Kafka, where the Ads Event Handler service reads them, decrypts the token, interprets the metadata, and fires the appropriate tracking URLs.

Ad Events flow

What is VAST?

VAST stands for Video Ad Serving Template. It’s an XML-based standard developed by the IAB (Interactive Advertising Bureau) to standardize the delivery of video ads from ad servers to video players (like Netflix's video player). In simpler terms:

VAST is a template that tells the video player:
What ad to play
Where to find it
How to track it (e.g., impressions, clicks, view progress)

Key Parts of a VAST Response:

Ad Media URL

Link to the video file
The player will use this to stream the ad.

<MediaFile delivery="progressive" type="video/mp4" width="640" height="360">
    https://adserver.com/path/to/ad.mp4
</MediaFile>

Tracking Events

URLs the player should "ping" (via HTTP GET) when certain things happen:

Impression (ad started)
FirstQuartile (25% watched)
Midpoint (50% watched)
ThirdQuartile (75% watched)
Complete (100%)
Click (User clicked the ad)

<TrackingEvents>
  <Tracking event="impression">https://track.com/imp</Tracking>
  <Tracking event="firstQuartile">https://track.com/q1</Tracking>
  ...
</TrackingEvents>

Click Tracking

URL to notify when the user clicks the ad.
Destination URL (where user is taken on click).

<VideoClicks>
  <ClickThrough>https://brand.com</ClickThrough>
  <ClickTracking>https://track.com/click</ClickTracking>
</VideoClicks>

Companion Ads (Optional)

Static banner or rich media that shows alongside the video ad

Why VAST?

Purpose	How it helps
Standardization	All ad servers speak the same language to video players
Tracking	Advertisers know when/how their ads were shown
Flexibility	Supports linear, companion, and interactive video ad formats
Easy integration	Netflix could plug in with Microsoft without building a custom API

Why Netflix chose protobuf and not Json:

Token size matters on client devices: Protobuf is highly compact, small tokens means less overhead. For example, a JSON payload that’s 1 KB might only be 100–200 bytes in protobuf.
Performance: Ad events are high volume (billions per day). Protobuf parsing is much faster than JSON, both on backend and client.
Encryption efficiency: Netflix encrypted the protobuf payload to create the token. Encrypting a compact binary blob is faster and lighter than encrypting JSON text.
Schema Evolution and Compatibility: Protobuf supports versioning, optional/required fields and backward/forward compatibility. That means Netflix could evolve the ad metadata schema without breaking clients or backend systems. JSON lacks this.

Feature	JSON	Protobuf
Format	Text-based	Binary (compact)
Size	Larger	Much smaller
Parse performance	Slower	Faster
Schema enforcement	No (free-form)	Yes (strongly typed schema)
Encryption impact	Heavier (more bytes to encrypt)	Lighter (smaller encrypted payload)
Tamper resistance	Weaker (can be read/edited)	Stronger (harder to interpret)

Why Kafka over other queues like RabitMQ:

Massive Scale and Parallelism

Kafka can handle billions of events/day, easily.
Partitions allow horizontal scaling of consumers:

e.g., 100 partitions means 100 consumers i.e. massive parallel processing.

RabbitMQ struggles when you need high fan-out and high throughput.

Event Replay and Reliability

Ad tracking must be durable (e.g., billing, audits)
Kafka stores messages on disk with configurable retention (e.g., 7 days)
If a service fails, it can re-read from the last offset i.e. no data lost.
RabbitMQ can lose messages unless you enable heavy persistence (with a perf hit).

Built-in Partitioning and Ordering

Kafka guarantees message ordering per partition.
Ads Event Handler can partition by:

ad_id, device_id, etc.

This ensures all related events are processed in order, by the same consumer.

Topic: ad_events
Partitions: 100
Key: device_id (events from same device always lands in same partition)

Problem in the Pilot System: The Token got too smart

In the pilot, the Ads Manager was:

Extracting all tracking metadata from the VAST (like URLs, vendor IDs, event types)
Packing all of it into a token (as protobuf)
Encrypting and sending this token to the client
The token was self-contained — everything the backend needed to fire tracking URLs was inside it

It worked for small scale but it didn't scale:

As Netflix onboarded multiple ad partners (e.g., Nielsen, DoubleVerify, Microsoft, etc.), each vendor wanted:

Their own tracking logic.
Unique macros to be replaced in URLs
Complex conditional tracking rules
Custom logic based on device type, timing, campaign, etc.

What are these macros we are talking about:

Macros are placeholder variables in tracking URLs that get dynamically replaced with actual values at runtime, before the URL is fired.

They let ad vendors capture contextual information like:

The timestamp when an ad was viewed
The type of device
Whether the user clicked or skipped
The playback position, etc.

Example:

A vendor might provide a tracking URL like:

https://track.vendor.com/pixel?event=impression&ts=[TIMESTAMP]&device=[DEVICE_TYPE]

Here:

[TIMESTAMP] is a macro for when the ad started playing
[DEVICE_TYPE] is a macro for the kind of device (TV, phone, etc.)

Before this URL is fired, those macros must be resolved to actual values:

https://track.vendor.com/pixel?event=impression&ts=1717910340000&device=tv

Why Macros become a problem at scale:

Inconsistent Syntax Across Vendors

Some used [TIMESTAMP], others used %ts%, or even custom placeholders.
Netflix had to implement logic per partner to resolve them correctly.

Dynamic Values Grew

New use cases brought in more macros: region, episode title, ad ID, player version, UI language, etc
Supporting new macros meant:

Updating the token schema
Updating Ads Manager logic
Sometimes even updating clients

Difficult to Test or Validate

Because everything was encrypted in the token, validating whether macros were substituted correctly was hard.
A bug in macro resolution could break tracking silently.

This meant the token payload was growing; more vendors, more metadata, more per-event config.

So to summarize the problems, here is the table:

Problem	Why it hurt
Payload bloat	Token size increased → slower to transmit → harder to encrypt
Hard to evolve	Any schema change required new client logic and re-deployment
No dynamic behavior	Logic was fixed at ad decision time; couldn’t update later
Redundant processing	Every token had repeated config for the same vendors
Hard to debug/troubleshoot	No human-readable registry of how tokens should behave

How did Netflix solve the problem: Meatdata Registry Service (MRS):

To overcome the scaling bottlenecks of self-contained tokens, Netflix introduced a new component: the Metadata Registry Service (MRS).

Its core idea was simple but powerful, separate the metadata from the token. Instead of encoding everything inside the token, Netflix now stores reusable metadata centrally in MRS, and the token merely references it via an ID.

How MRS works:

Ads Manager Gets VAST: Ads Manager still fetches the VAST response from the ad decisioning system (e.g. Microsoft’s ad server).
Metadata Extracted & Stored: Instead of encrypting all metadata into a token, Ads Manager:

Extracts metadata (tracking pixels, macros, event mappings, etc.)
Stores it in Metadata Registry Service (MRS) with a registry key

Token Now Just an Id: The token now just contains:

The registry key
A small payload (e.g. dynamic fields like ad_start_time or session ID)

Device Sends Events with Token: Devices emit telemetry events with this lightweight token.
Backend Looks Up MRS: Ads Event Handler uses the registry key to:

Fetch the tracking metadata from MRS
Apply macros
Fire tracking URLs

Ad Events flow with MRS

So summary of what exactly changed:

Before MRS	After MRS
Token = All tracking logic inside	Token = Just a reference id
Tracking metadata repeated often	Metadata stored once in registry
Hard to update or debug	Dynamic updates via registry
Schema tightly coupled to client	Centralized schema owned by server

Benefits of MRS approach:

Dynamic Metadata Updates

Vendor changes their URLs or macro format? Just update MRS.
No need to redeploy client or regenerate tokens.

Improved Performance

Tokens are smaller means less encryption overhead, less network cost.
Registry entries are cached means fast and efficient.

Better Observability

MRS is a source of truth for all campaign metadata.
Easy to search, debug, and trace logic across vendors and campaigns.

Reusability

Multiple campaigns can reference the same registry entry.
No duplication of tracking logic.

Separation of Concerns

Client focuses on playback and telemetry.
Backend owns tracking logic and event firing.

* You might be thinking there is no logic in client for tokens i.e. the logic to fire tracking pixels was not on the client. The client just played the ad and emitted telemetry events along with the token so why client redeployment is in the picture even without MR. According to me, the protobuf schema can become a bottlenec here as the schema changes, client code also need a change. With MRS we can freeze the client payload schema.

In this first part of our deep dive into Netflix’s Ads Event Processing Pipeline, we traced the evolution from a tightly-coupled, token-heavy system to a more modular and scalable architecture powered by the Metadata Registry Service (MRS).

By decoupling metadata from tokens and centralizing macro resolution and tracking logic, Netflix unlocked:

Smaller, lighter tokens
Faster and safer iteration
Dynamic partner integration
Debuggability at scale

What started as a necessary fix for bloated tokens ended up becoming the foundation for a much more agile and partner-friendly ad tracking ecosystem.

But Netflix didn’t stop there!

In [Part 2], we’ll explore the next big leap: how Netflix brought ad decisioning in-house — giving them end-to-end control over targeting, creative rotation, pacing, and auction logic and how this plugged seamlessly into the MRS-backed pipeline we just explored.

We will learn how they:

Built an Ad Decision Server (ADS) from scratch
Integrated it with the metadata pipeline
Enabled real-time, personalized ad serving on a global scale

Thanks a lot and stay tuned, the story gets even more fascinating.

Monday, October 28, 2024

Design Uber

Problem: Design a highly scalable rideshare service.

Requirements:

1. Functional requirements:

Only registered users can request a cab.

Driver registration is manual and is out of scope here.

Driver and user needs to login to request and receive the ride.
Driver needs to make himself available to receive the rides.
Each driver can be matched to multiple riders depends on users and the occupancy of the car.
Price of the ride depends on base flat fee, distance and duration.

Exact formula is not given and is subject to change

We need to minimize the:

Wait time of users
Distance / Time for drivers to pickup the user.

Collect payment information from users.
Collect bank information of drivers to pay them.
Driver should start/end the trip.
For actual payment process, we can use third party services.

2. Nonfunctional requirements:

Scalability:

100 M to 200 M daily users.
100 K to 200 K drivers.
Peak hours support where demand is 10 times of average.
Collect driver's location every ~5-10 seconds =~ 3B messages / day

Availability:

99.99% uptime or higher

Performance:

Match users with drivers 10 seconds at 99 percentile

AP over CP:

Higher availability is required here over consistency.

Design:

Step 1: API Design:

This looks like a very complex system which makes it even more difficult to come up with the all the APIs which are needed for the system to work.

However we will use sequence diagram to make it simple which we are doing in our previous design problems:

Just one thing to notice, given the sequence diagram, we can see the need of having a bidirection connection between server & drivers and also server & riders.

From the sequence diagram we can now have the clear APIs:

Drivers:

POST /drivers/login
POST /drivers/{driver_id}/join
POST /drivers/{driver_id}/location
POST /trips/{trip_id}/start
POST /trips/{trip_id}/end

Riders:

POST /riders/ - Register a rider
POST /riders/login
POST /rides

Step 2: Mapping functional requirements to architectural diagram:

1. Only registered users can request a cab.

2. Driver and user needs to login to request and receive the ride.

7. Collect payment information from users.

8. Collect Bank information of drivers to pay them.

To serve these requirements 1, 2, we can have a single User service. User service will have a SQL DB which contains two tables:

Rider:

rider_id
username
password
age
image_url

Driver:

driver_id
username
password
license
vehicle
image_url

For the profile image we will store the image in object store and put the image url in DB.

To support requirement 7 and 8, we will have a separate service Payment Service which will have it's own SQL DB. This DB will contain the RiderPaymentInfo table and DriverPaymentInfo table.

RiderPaymentInfo:

user_id
credit_card
billing address

DriverPaymentInfo:

driver_id
bank_name
account_number

This payment service will connect with credit card companies / banks for the payment processing.

Here is how our architectural diagram looks like:

3. Driver needs to make himself available to receive the rides:

For this Driver needs to call join API and keep sending it's location. We need to have a new service say Driver Service which will create a bidirection connection using web sockets when Driver call the join API. This will have following benefits:

Driver needs to send location very frequently so it will remove the overhead of TCP connection every time.
Server can also push message to driver such as pick_up rider etc.

To record these location we will use a new service say Location Service. Driver service will continue to queue these locations to Location service. We will use a write performant LSM tree based NoSQL DB to store these locations updates. This will contain driver_id and location like lat and long and we continue to update this record until the driver is booked for a trip.

6. Rider can request the ride and the driver must be matched with minimum wait time:

4. Each driver can be matched to multiple riders depends on users and the occupancy of the car:

To request a ride, we will have a new service called Rider Service. Upon requesting the ride, a bidirectional websocket connection has been stablished between rider and this service as we need to send the updates to the riders.

To match a rider to a driver we will have a new service Matching Service. Let's double click into this matching service to see how to match a driver.

First approach which directly comes into our mind is simple:

Take the lat, long of rider.
Draw a circle centered at rider's location of radius of 2-3 KM and try to find drives within this circle
Given the location of drivers is already present in Location Service DB, we can use this DB to get the list of drivers.
Take the driver with the least distance.

However the direct distance of two points is not good measure because of road structures, traffic etc.

Given that the ETA calculation is complex and is not our functional requirement, we can use a third party ETA calculation service. With this third party service here is how our flow changes:

Matching service sends the rider location to Location service.
Location service finds all the closed drivers as per our previous approach of 2-3 KM radius circle.
Location service now sends the list of pairs {user location, driver location} to third party service.
Third party service sends the ETA of every pair of locations to Location service.
Location service send these pairs {driver_id, ETA} back to Matching service to make the final decision.

There are few considerations for matching the driver based on that we can match the driver:

Match driver with the lowest ETA
Math the driver who is about to finish the trip nearby rider.
Allow multiple riders to share a ride
Prioritize drivers whose rating is high for high rated rider.
Permium service
Prioritize drivers who have earned very less.

Once the Matching service matches the driver, it needs to store this info somewhere. For this we will have new service say Trip service with it's own SQL DB which will contain source and destination, user_id, driver_id, fare, start_time, end_time, distance etc. Here are the next set of steps:

Create a new Trip using the Trip service. Get the trip_id in response.
Send the signal to Location service to tell it that Driver is booked. It will also send the trip_info. The trip_info will contain trip_id and rider_id for sure.
Location service will add two new columns: Driver state and Trip Info.
Matching service now send the trip info and driver details including location to Rider service which will push this info to Rider.
Similarly Matching service send the trip info and rider details including location to Driver service which will push this info to driver
Now when driver send the location to Driver service, Location service append this record to it's DB and also send this location update to Rider service which will update the rider.

You can see how much this web socket connection is useful in these scenarios.

9. Driver should start/end the trip:

Upon reaching the source/destination or where the user wants a drop, driver will start/end the trip by calling the APIs. This will be redirected to trip service. Now Trip service will notify Location Service to change the status of driver as In_Trip / Free.

Trip service can get the details of locations from Location Service to calculate the distance and everything else for the payments etc.

5. Price of the ride depends on base flat fee, distance and duration:

10. For actual payment process, we can use third party services:

Let's come to the Post trip part where we need to collect the payment from rider and send driver fee to driver account. We also need to send this info as an Email to rider and driver.

Here is the flow:

Trip servivce will calculate the ride cost and driver fee.
Trip service will queue the trip_info including user_id, driver_id and payment_info to Payment Service.
Payment service will interact with third party service to mak the payment.
Once the payment is successful, It will queue two messages; one for user and one for driver including trip info and their corresponding payment_info to new service Notification Service.
Notification service will send email to driver and user.

That's all about the functional requirements

Step 3: Mapping non functional requirements to architectural diagram:

1. Scalability:

Given the traffic we need to run multiple instances of every service. Even the notification service is not public facing service but still at the time of peak, we might need to scale up this service too.

Now the problem is about the Driver service and Rider service as these are having web socket persistent connections. How the location service will come to know which instance to connect to send driver location to rider or how Matching service will know which instance to connect to send trip info to driver.

To solve this issue, we will have a new service called Connection Manager which will have its very efficient key value pair NoSql DB like Redis to maintain this mapping. Now Location Service or Matching Service can query Connection Manager Service to get the right instance.

2. Availability:

For this we just need to add replica of our DBs in order to achieve availability.

3. Performance:

We need to match Rider to Driver with 10 seconds. Let's see what are major steps we have for match:

Get the closest drivers list to the rider.
Get the ETA of these drivers from third party service.
Take the Driver with the lowest ETA.

If you see we can't do anything about step 2 and step 3 is fairly simple. Let's see how we can get the list of closest drivers.

We need to calculate the distance of every free/in_trip driver with the user's location which is fairly expensive. Say somehow we are able to apply certain index on Location DB over Lat and Long (I am not sure how we will use it), it still be very expensive.

Also these floating point calculations are little time consuming for computers. It looks like we won't able to achieve our ETA.

We will use Geohash here to divide the earch into geo cells. Now these geo cell ids will be stored in the Location DB as per the driver locations. That means we are going to add a new column geo_cell_id in the Location DB.

Now the circle which we are drawing centered at User's location can have around 3 or 4 geo cells. Here is how the flow looks like for the step 1 of getting the closes drivers:

Get the geo cell id as per the user's location.
Get all the geo cell ids which is covered by 2 KM radius circle
Now the query to DB will be something like SELECT * from Location where geo_cell_ids in [list of geo cells]

To make the above query we can shard the Location DB based on geo_cell_id or create a hash index over it.

That's all about the performance. Now that we have covered every functional and non function requirements, here is our final architectural diagram:

I know the diagram became a bit messy so I'll improve it later!