Friday, June 20, 2025

[Netflix Tech Blog] Inside Netflix's Ads Event Engine: A Deep Dive (Part 2)

In Part 1, we explored how Netflix moved from a bloated token model to a more elegant, scalable architecture using a centralized Metadata Registry Service (MRS). That shift allowed Netflix to scale its ad event tracking system across vendors, while keeping client payloads small and observability high.

But Netflix wasn’t done yet.

Netflix built it's own Ad tech platform: Ad Decision Server:

Netflix made a strategic bet to build an in-house advertising technology platform, including its own Ad Decision Server (ADS). This was about more than just cost or speed, it was about control, flexibility, and future-proofing. It actually began with a clear inventory of business and technical use cases:

  • In-house frequency capping
  • Pricing and billing for impression events
  • Robust campaign reporting and analytics
  • Scalable, multi-vendor tracking via event handler
But this Ad Event Processing Pipeline needed a significant overhaul to:
  • Match or exceed the capabilities of existing third-party solutions
  • Rapidly support new ad formats like Pause/Display ads
  • Enable advanced features like frequency capping and billing events
  • Consolidate and centralize telemetry pipelines across use cases


Road to a versatile and unified pipeline:

Netflix recognized future ad types like Pause/Display ads that would introduce variant telemetry. That means the upstream telemetry pipelines might vary, but downstream use‑cases remain largely consistent. 

So what is upstream and what is downstream in this Netflix event processing context:

Upstream (Producers of telemetry / Events): This refers to the systems and sources that generate or send ad-related telemetry data into the event processing pipeline.

Example of upstream in this context:

  • Devices (Phones, TVs, Browsers)
  • Different Ad Types (eg. Video Ads, Display Ads, Pause Ads)
  • Ad Media Players
  • Ad Servers
These upstream systems send events like:
  • Ad Started
  • Ad Paused
  • Ad Clicked
  • Token for ad metadata
These system may:
  • Use different formats
  • Have different logging behavior
  • Be customized per ad type or region

Downstream (Consumers of processed data): This refers to the systems that consume the processed ad events for various business use cases.

Example of downstream in this context:

  • Frquency capping service
  • Billing and pricing engine
  • Reporting dashboards for advertisers
  • Anlytical systems (like Druid)
  • Ad session generators
  • External vendor tracking
These downstream consumers:
  • Expect standardized, enriched, reliable event data
  • Rely on consistent schema
  • Need latency and correctness guarantees

So "upstream telemetry pipelines might vary, but downstream use‑cases remain largely consistent" means:

  • Even if ad events come from diverse sources, with different formats and behaviors (upstream variation),
  • The core processing needs (like billing, metrics, sessionization, and reporting) remain the same (downstream consistency).

Based on that they outline a vision:

  • Centralized ad event collection service that:
    • Decrypts tokens
    • Enriches metadata (via the Registry)
    • Hashes identifiers
    • Publishes a unified data contract, agnostic of ad type or server
  • All consumers (for reporting, capping, billing)/read from this central service, ensuring clean separation and consistency.


Building Ad sessions & Downstream Pipelines:

A key architectural decision was "Sessionization happens downstream, not in the ingestion flow" but what does that mean? Lets break it down:

Ingestion flow (Upstream): This is the intial (real time) path where raw ad telemetry events are ingested into the system, like:

  •  Ad started playing
  • User clicked
  • 25% of video watched

These events are lightweight, real time and are not yet grouped into a meaningful session. Think of it as raw, granular events flying into Kafka or a central service.

Sessionization (Downstream Processing): This is a batch or stream aggregation process where the system:

  • Looks at multiple ad events from the same ad playback session
  • Groups them together
  • Creates a meaningful structure called an "Ad Session"

An Ad session might include:

  • When did the ad start and end
  • What was the user's interaction (viewed / clicked / skipped)
  • Was it complete or partial view
  • What was pricing and vendor context
So "Sessionization happens downstream, not in the ingestion flow" means:
  • The ingestion system just collects and stores raw events quickly and reliably.
  • The heavier work of aggregating these into a complete session (sessionization) happens later, in the downstream pipeline where there is:
    • More time
    • Access to more context
    • Better processiong capabilities
Why it was required?
  • Separation of concerns: Ingestion stays fast and simple; sessionization logic is centralized and easier to iterate on.
  • Extensibility: You can add new logic to sessionization without touching ingestion.
  • Scalability: Real time pipelines stay light; heavy lifting is done downstream.
  • Consistency: One place to define and compute what an "ad session" means.

They also built out a suite of downstream systems:
  • Frequency Capping
  • Ads Metrics 
  • Ads Sessionizer
  • Ads Event Handler 
  • Billing and revenue jobs
  • Reporting & Metrics APIs


Final Architecture of Ad Event Processing:


Ad Event Processing Pipeline


Here is the breakdown of the flow:

  1. Netflix Player -> Netflix Edge:
    • The client (TV, browser, mobile app) plays an ad and emits telemetry (e.g., impression, quartile events).
    • These events are forwarded to Netflix’s Edge infrastructure, ensuring low-latency ingestion globally.
  2. Telemetry Infrastructure:
    • Once events hit the edge, they are processed by a Telemetry Infrastructure service.
    • This layer is responsible for:
      • Validating events
      • Attaching metadata (e.g., device info, playback context)
      • Interfacing with the Metadata Registry Service (MRS) to enrich events with ad context based on token references.
  3. Kafka:
    • Events are published to a Kafka topic.
    • This decouples ingestion from downstream processing and allows multiple consumers to operate in parallel.
  4. Ads Event Publisher:
    • This is the central orchestration layer.
    • It performs:
      • Token decryption
      • Metadata enrichment (macros, tracking URLs)
      • Identifier hashing
      • Emits a unified and extensible data contract for downstream use
    • Also writes select events to the data warehouse to support Ads Finance and Billing workflows.
  5. Kafka -> Downstream Consumers:
    • Once enriched, the events are republished to Kafka for the following consumers:
      • Frequency Capping:
        • Records ad views in a Key-Value abstraction layer.
        • Helps enforce logic like “don’t show the same ad more than 3 times per day per user.”
      • Ads Metrics Processor:
        • Processes delivery metrics (e.g., impressions, completion rates).
        • Writes real-time data to Druid for interactive analytics.
      • Ads Sessionizer:
        • Groups related telemetry events into ad sessions (e.g., full playback window).
        • Feeds both data warehouse and offline vendor integrations.
      • Ads Event Handler:
        • Responsible for firing external tracking pixels and HTTP callbacks to third-party vendors (like Nielsen, DoubleVerify).
        • Ensures vendors get event pings per their integration needs.
  6. Analytics & Offline Exchange:
    • The sessionized and aggregated data from the warehouse is used for:
      • Offline measurement
      • Revenue analytics
      • Advertiser reporting

Benefits of design:

Principle Implementation
Separation of concerns Ingestion (Telemetry Infra) ≠ Enrichment (Publisher) ≠ Consumption (Consumers)
Scalability Kafka enables multi-consumer, partitioned scaling
Extensibility Central Ads Event Publisher emits a unified schema
Observability Real-time metrics via Druid; sessionized logs via warehouse
Partner-friendly Vendor hooks via Event Handler + Offline exchange

By centralizing sessionization and metadata logic, Netflix ensures that:
  • Client-side logic remains lightweight and stable
  • Backend systems remain flexible to change and easy to audit

Few clarificatons:

1. What is the Key-Value (KV) abstraction in Frequency Capping?

In Netflix’s Ad Event Processing Pipeline, frequency capping is the mechanism that ensures a user doesn’t see the same ad (or type of ad) more than a specified number of times e.g., “Don’t show Ad X to user Y more than 3 times a day.

To implement this efficiently, Netflix uses a Key-Value abstraction, which is essentially a highly scalable key-value store optimized for fast reads and writes.

How it works:
  • Key:  A unique identifier for the "audience + ad + context"
    • Example: user_id:ad_id:date
  • Value: A counter or timestamp that tracks views or interactions
    • Example: {count: 2, last_shown: 2025-06-20T12:15:30}
Let's understand it by example:

KeyValue
user123:ad789:2025-06-20{ count: 2, last_shown: 12:15 }
user123:ad456:2025-06-20{ count: 1, last_shown: 10:04 }

When a new ad event comes in:
  • Netflix looks up the key
  • Checks the count
  • If under the cap (e.g., 3), they allow it and increment the counter
  • If at or above cap, the ad is skipped or replaced
Why "Abstraction"?

Netflix may use multiple underlying KV stores (like Redis, Cassandra, or custom services), but abstracts it behind a unified interface so:
  • Ad engineers don’t worry about implementation
  • Multiple ad systems can share the same cap logic
  • It can scale across regions and millions of users
Benefits:

FeatureWhy it matters
SpeedReal-time lookups for every ad play
ScalabilityHandles millions of users per day
FlexibilitySupports different capping rules
Multi-tenancyCan cap by user, device, campaign, etc.


2. What is Druid and why it is getting used here?

Apache Druid is a real-time analytics database designed for low-latency, high-concurrency, OLAP-style queries on large volumes of streaming or batch data.
It’s built for interactive dashboards, slice-and-dice analytics, and fast aggregations over time-series or event data.
Think of it as a high-performance hybrid of a data warehouse and a time-series database.

Netflix uses Druid in the Ads Metrics Processor and Analytics layer. Here’s what it powers:
  • Streaming Ad Delivery Metrics
    • As ads are played, Netflix captures events like:
      • Impressions
      • Quartile completions (25%, 50%, etc.)
      • Clicks, errors, skips
    • These events are streamed and aggregated in near-real-time via Flink or similar streaming processors
    • Druid ingests these metrics for interactive querying.
  • Advertiser Reporting and Internal Dashboards
    • Ad operations teams and partners want to know:
      • "How many impressions did Ad X get in Region Y yesterday?"
      • "What's the completion rate for Campaign Z?"
    • These queries are run against Druid and return in milliseconds, even over billions of records
Why Nerflix uses Druid (instead of just a data warehouse like Redshift or BigQuery)?
 
FeatureWhy it matters for Netflix Ads
Real-time ingestionMetrics appear in dashboards within seconds of the event
Sub-second queriesAdvertiser reports and internal dashboards need speed
High concurrencySupports thousands of parallel queries from internal tools
Flexible aggregationGroup by ad_id, region, time, device_type, etc. instantly
Schema-on-read & roll-upsDruid efficiently stores pre-aggregated data for common metrics


Conclusion:

As Netflix shifted toward building its own advertising technology platform, the Ads Event Processing Pipeline evolved into a thoughtfully architected, modular telemetry platform capable of handling diverse ad formats, scaling globally, and meeting the demands of both real-time analytics and long-term reporting.

This transformation was not just about infrastructure, but about platformization:
  • A centralized event publisher that abstracts complexity.
  • A unified schema that serves a wide range of downstream consumers.
  • A clean separation of concerns that allows rapid innovation in ad types without touching backend logic.
  • Observability and control, made possible through systems like Druid and the Metadata Registry Service.
With this evolution, Netflix ensured:
  • Scalability: Handling billions of ad events across regions and devices
  • Agility: Enabling quick integration of new ad formats like Pause/Display Ads
  • Partner alignment: Supporting vendor integrations with minimal friction
  • Business confidence: Powering frequency capping, billing, and reporting with precision
Here is the original Netflix blog

Monday, June 16, 2025

[Netflix Tech Blog] Inside Netflix's Ads Event Engine: A Deep Dive (Part 1)

As Netflix steps into the ad-supported streaming world, its engineering backbone must operate with the same level of scale, reliability, and precision that powers its core viewing experience. But building a system that can track billions of ad events like impressions, clicks, quartile views, across diverse devices and ad partners is anything but trivial.

In this two-part series, I’ll dissect the architecture behind Netflix’s Ads Event Processing Pipeline, inspired by their tech blog. Part 1 focuses on the early phase, from ad request initiation to the evolution of encrypted tokens and the eventual need for a Metadata Registry Service. You'll see how Netflix tackled key design challenges such as bandwidth optimization, cross-device compatibility, vendor flexibility, and observability at scale.

If you're interested in distributed systems, real-time telemetry, or just want to peek behind the curtain of Netflix’s ad tech, you’re in the right place.

Let’s dive in!

The Initial System (Pilot) At a Glance:

When a Netflix-supported device is ready to show an ad, it doesn’t pick an ad on its own. Instead, the system goes through the following high-level steps: 

  1. Client Ad Request: The device (TV/mobile/browser) sends a request to Netflix’s Ads Manager service when it reaches an ad slot.
  2. VAST Response from Microsoft Ad Server: Ads Manager forwards this request to the Microsoft’s ad server, which returns a VAST XML. This contains the actual ad media URL, tracking pixels, and metadata like vendor IDs and macros.
  3. Encrypted Token Created: Instead of sending this raw VAST data to the device, Ads Manager:
    • Extracts relevant metadata (e.g., tracking URLs, event triggers, vendor IDs)
    • Encodes it using protobuf
    • Encrypts it into a compact structure called a token
    • Sends this token along with the actual ad video URL to the device
  4. Ad Playback and Event Emission: As the ad plays, the device emits telemetry events like impression, firstQuartile, click etc., attaching the token to each event.
  5. Event Processing via Kafka: These events are queued in Kafka, where the Ads Event Handler service reads them, decrypts the token, interprets the metadata, and fires the appropriate tracking URLs.




Ad Events flow


What is VAST?
VAST stands for Video Ad Serving Template. It’s an XML-based standard developed by the IAB (Interactive Advertising Bureau) to standardize the delivery of video ads from ad servers to video players (like Netflix's video player). In simpler terms:
  • VAST is a template that tells the video player:
  • What ad to play
  • Where to find it
  • How to track it (e.g., impressions, clicks, view progress)

Key Parts of a VAST Response:
  • Ad Media URL
    • Link to the video file
    • The player will use this to stream the ad.
<MediaFile delivery="progressive" type="video/mp4" width="640" height="360">
    https://adserver.com/path/to/ad.mp4
</MediaFile>
  • Tracking Events
    • URLs the player should "ping" (via HTTP GET) when certain things happen:
      • Impression (ad started)
      • FirstQuartile (25% watched)
      • Midpoint (50% watched)
      • ThirdQuartile (75% watched)
      • Complete (100%)
      • Click (User clicked the ad)
<TrackingEvents>
  <Tracking event="impression">https://track.com/imp</Tracking>
  <Tracking event="firstQuartile">https://track.com/q1</Tracking>
  ...
</TrackingEvents>
  • Click Tracking
    • URL to notify when the user clicks the ad.
    • Destination URL (where user is taken on click).
<VideoClicks>
  <ClickThrough>https://brand.com</ClickThrough>
  <ClickTracking>https://track.com/click</ClickTracking>
</VideoClicks>
  • Companion Ads (Optional)
    • Static banner or rich media that shows alongside the video ad

Why VAST?

PurposeHow it helps
StandardizationAll ad servers speak the same language to video players
TrackingAdvertisers know when/how their ads were shown
FlexibilitySupports linear, companion, and interactive video ad formats
Easy integrationNetflix could plug in with Microsoft without building a custom API


Why Netflix chose protobuf and not Json:
  • Token size matters on client devices: Protobuf is highly compact, small tokens means less overhead. For example, a JSON payload that’s 1 KB might only be 100–200 bytes in protobuf.
  • Performance: Ad events are high volume (billions per day). Protobuf parsing is much faster than JSON, both on backend and client.
  • Encryption efficiency: Netflix encrypted the protobuf payload to create the token. Encrypting a compact binary blob is faster and lighter than encrypting JSON text.
  • Schema Evolution and Compatibility: Protobuf supports versioning, optional/required fields and backward/forward compatibility. That means Netflix could evolve the ad metadata schema without breaking clients or backend systems. JSON lacks this.

FeatureJSONProtobuf
FormatText-basedBinary (compact)
SizeLargerMuch smaller
Parse performanceSlowerFaster
Schema enforcementNo (free-form)Yes (strongly typed schema)
Encryption impactHeavier (more bytes to encrypt)Lighter (smaller encrypted payload)
Tamper resistanceWeaker (can be read/edited)Stronger (harder to interpret)


Why Kafka over other queues like RabitMQ:
  • Massive Scale and Parallelism
    • Kafka can handle billions of events/day, easily.
    • Partitions allow horizontal scaling of consumers:
      • e.g., 100 partitions means 100 consumers i.e. massive parallel processing.
    • RabbitMQ struggles when you need high fan-out and high throughput.
  • Event Replay and Reliability
    • Ad tracking must be durable (e.g., billing, audits)
    • Kafka stores messages on disk with configurable retention (e.g., 7 days)
    • If a service fails, it can re-read from the last offset i.e. no data lost.
    • RabbitMQ can lose messages unless you enable heavy persistence (with a perf hit).
  •  Built-in Partitioning and Ordering
    • Kafka guarantees message ordering per partition.
    • Ads Event Handler can partition by:
      • ad_id, device_id, etc.
    • This ensures all related events are processed in order, by the same consumer.
Topic: ad_events
Partitions: 100
Key: device_id (events from same device always lands in same partition)


Problem in the Pilot System: The Token got too smart

In the pilot, the Ads Manager was:
  • Extracting all tracking metadata from the VAST (like URLs, vendor IDs, event types)
  • Packing all of it into a token (as protobuf)
  • Encrypting and sending this token to the client
  • The token was self-contained — everything the backend needed to fire tracking URLs was inside it
It worked for small scale but it didn't scale:

As Netflix onboarded multiple ad partners (e.g., Nielsen, DoubleVerify, Microsoft, etc.), each vendor wanted:
  • Their own tracking logic.
  • Unique macros to be replaced in URLs
  • Complex conditional tracking rules
  • Custom logic based on device type, timing, campaign, etc.

What are these macros we are talking about:

Macros are placeholder variables in tracking URLs that get dynamically replaced with actual values at runtime, before the URL is fired.
They let ad vendors capture contextual information like:
  • The timestamp when an ad was viewed
  • The type of device
  • Whether the user clicked or skipped
  • The playback position, etc.
Example:

A vendor might provide a tracking URL like:

https://track.vendor.com/pixel?event=impression&ts=[TIMESTAMP]&device=[DEVICE_TYPE]
Here:
  • [TIMESTAMP] is a macro for when the ad started playing
  • [DEVICE_TYPE] is a macro for the kind of device (TV, phone, etc.)
Before this URL is fired, those macros must be resolved to actual values:

https://track.vendor.com/pixel?event=impression&ts=1717910340000&device=tv
Why Macros become a problem at scale:
  1. Inconsistent Syntax Across Vendors
    • Some used [TIMESTAMP], others used %ts%, or even custom placeholders.
    • Netflix had to implement logic per partner to resolve them correctly.
  2. Dynamic Values Grew
    • New use cases brought in more macros: region, episode title, ad ID, player version, UI language, etc
    • Supporting new macros meant:
      • Updating the token schema
      • Updating Ads Manager logic
      • Sometimes even updating clients
  3. Difficult to Test or Validate
    • Because everything was encrypted in the token, validating whether macros were substituted correctly was hard.
    • A bug in macro resolution could break tracking silently.

This meant the token payload was growing; more vendors, more metadata, more per-event config.
So to summarize the problems, here is the table:

Problem Why it hurt
Payload bloat Token size increased → slower to transmit → harder to encrypt
Hard to evolve Any schema change required new client logic and re-deployment
No dynamic behavior Logic was fixed at ad decision time; couldn’t update later
Redundant processing Every token had repeated config for the same vendors
Hard to debug/troubleshoot No human-readable registry of how tokens should behave



How did Netflix solve the problem: Meatdata Registry Service (MRS):

To overcome the scaling bottlenecks of self-contained tokens, Netflix introduced a new component: the Metadata Registry Service (MRS).

Its core idea was simple but powerful, separate the metadata from the token. Instead of encoding everything inside the token, Netflix now stores reusable metadata centrally in MRS, and the token merely references it via an ID.


How MRS works:
  1. Ads Manager Gets VAST: Ads Manager still fetches the VAST response from the ad decisioning system (e.g. Microsoft’s ad server).
  2. Metadata Extracted & Stored: Instead of encrypting all metadata into a token, Ads Manager:
    • Extracts metadata (tracking pixels, macros, event mappings, etc.)
    • Stores it in Metadata Registry Service (MRS) with a registry key
  3. Token Now Just an Id: The token now just contains:
    • The registry key
    • A small payload (e.g. dynamic fields like ad_start_time or session ID)
  4. Device Sends Events with Token: Devices emit telemetry events with this lightweight token.
  5. Backend Looks Up MRS: Ads Event Handler uses the registry key to:
    • Fetch the tracking metadata from MRS
    • Apply macros
    • Fire tracking URLs


Ad Events flow with MRS


So summary of what exactly changed:

Before MRSAfter MRS
Token = All tracking logic insideToken = Just a reference id
Tracking metadata repeated oftenMetadata stored once in registry
Hard to update or debugDynamic updates via registry
Schema tightly coupled to clientCentralized schema owned by server



Benefits of MRS approach:
  1. Dynamic Metadata Updates
    • Vendor changes their URLs or macro format? Just update MRS.
    • No need to redeploy client or regenerate tokens.
  2. Improved Performance
    • Tokens are smaller means less encryption overhead, less network cost.
    • Registry entries are cached means fast and efficient.
  3. Better Observability
    • MRS is a source of truth for all campaign metadata.
    • Easy to search, debug, and trace logic across vendors and campaigns.
  4. Reusability
    • Multiple campaigns can reference the same registry entry.
    • No duplication of tracking logic.
  5. Separation of Concerns
    • Client focuses on playback and telemetry.
    • Backend owns tracking logic and event firing.

* You might be thinking there is no logic in client for tokens i.e. the logic to fire tracking pixels was not on the client. The client just played the ad and emitted telemetry events along with the token so why client redeployment is in the picture even without MR. According to me, the protobuf schema can become a bottlenec here as the schema changes, client code also need a change. With MRS we can freeze the client payload schema.


In this first part of our deep dive into Netflix’s Ads Event Processing Pipeline, we traced the evolution from a tightly-coupled, token-heavy system to a more modular and scalable architecture powered by the Metadata Registry Service (MRS).

By decoupling metadata from tokens and centralizing macro resolution and tracking logic, Netflix unlocked:
  • Smaller, lighter tokens
  • Faster and safer iteration
  • Dynamic partner integration
  • Debuggability at scale
What started as a necessary fix for bloated tokens ended up becoming the foundation for a much more agile and partner-friendly ad tracking ecosystem.

But Netflix didn’t stop there!

In [Part 2], we’ll explore the next big leap: how Netflix brought ad decisioning in-house — giving them end-to-end control over targeting, creative rotation, pacing, and auction logic and how this plugged seamlessly into the MRS-backed pipeline we just explored.

We will learn how they:
  • Built an Ad Decision Server (ADS) from scratch
  • Integrated it with the metadata pipeline
  • Enabled real-time, personalized ad serving on a global scale

Thanks a lot and stay tuned, the story gets even more fascinating.



Monday, October 28, 2024

Design Uber

Problem: Design a highly scalable rideshare service.

Requirements:

1. Functional requirements:

  1. Only registered users can request a cab.
    • Driver registration is manual and is out of scope here.
  2. Driver and user needs to login to request and receive the ride.
  3. Driver needs to make himself available to receive the rides.
  4. Each driver can be matched to multiple riders depends on users and the occupancy of the car.
  5. Price of the ride depends on base flat fee, distance and duration.
    • Exact formula is not given and is subject to change
  6. We need to minimize the:
    1. Wait time of users
    2. Distance / Time for drivers to pickup the user.
  7. Collect payment information from users.
  8. Collect bank information of drivers to pay them.
  9. Driver should start/end the trip.
  10. For actual payment process, we can use third party services.


2. Nonfunctional requirements:

  1. Scalability:
    • 100 M to 200 M daily users.
    • 100 K to 200 K drivers.
    • Peak hours support where demand is 10 times of average.
    • Collect driver's location every ~5-10 seconds =~ 3B messages / day
  2. Availability:
    • 99.99% uptime or higher
  3. Performance:
    • Match users with drivers 10 seconds at 99 percentile
  4. AP over CP:
    • Higher availability is required here over consistency.


Design:

Step 1: API Design:

This looks like a very complex system which makes it even more difficult to come up with the all the APIs which are needed for the system to work.

However we will use sequence diagram to make it simple which we are doing in our previous design problems:




Just one thing to notice, given the sequence diagram, we can see the need of having a bidirection connection between server & drivers and also server & riders.

From the sequence diagram we can now have the clear APIs:

Drivers:

  • POST /drivers/login
  • POST /drivers/{driver_id}/join
  • POST /drivers/{driver_id}/location
  • POST /trips/{trip_id}/start
  • POST /trips/{trip_id}/end

Riders:

  • POST /riders/ - Register a rider
  • POST /riders/login
  • POST /rides


Step 2: Mapping functional requirements to architectural diagram:

1. Only registered users can request a cab.

2. Driver and user needs to login to request and receive the ride.

7. Collect payment information from users.

8. Collect Bank information of drivers to pay them.

To serve these requirements 1, 2,  we can have a single User service. User service will have a SQL DB which contains two tables:

Rider: 

  • rider_id
  • username
  • password
  • age
  • image_url

Driver:

  • driver_id
  • username
  • password
  • license
  • vehicle
  • image_url

For the profile image we will store the image in object store and put the image url in DB.

To support requirement 7 and 8, we will have a separate service Payment Service which will have it's own SQL DB. This DB will contain the RiderPaymentInfo table and DriverPaymentInfo table.

RiderPaymentInfo:

  • user_id
  • credit_card
  • billing address

DriverPaymentInfo:

  • driver_id
  • bank_name
  • account_number

This payment service will connect with credit card companies / banks for the payment processing.

Here is how our architectural diagram looks like:




3. Driver needs to make himself available to receive the rides:

For this Driver needs to call join API and keep sending it's location. We need to have a new service say Driver Service which will create a bidirection connection using web sockets when Driver call the join API. This will have following benefits:

  • Driver needs to send location very frequently so it will remove the overhead of TCP connection every time.
  • Server can also push message to driver such as pick_up rider etc.

To record these location we will use a new service say Location Service. Driver service will continue to queue these locations to Location service. We will use a write performant LSM tree based NoSQL DB to store these locations updates. This will contain driver_id and location like lat and long and we continue to update this record until the driver is booked for a trip.





6. Rider can request the ride and the driver must be matched with minimum wait time:

4. Each driver can be matched to multiple riders depends on users and the occupancy of the car:

To request a ride, we will have a new service called Rider Service. Upon requesting the ride, a bidirectional websocket connection has been stablished between rider and this service as we need to send the updates to the riders. 

To match a rider to a driver we will have a new service Matching Service. Let's double click into this matching service to see how to match a driver.

First approach which directly comes into our mind is simple:

  • Take the lat, long of rider. 
  • Draw a circle centered at rider's location of radius of 2-3 KM and try to find drives within this circle
  • Given the location of drivers is already present in Location Service DB, we can use this DB to get the list of drivers.
  • Take the driver with the least distance.

However the direct distance of two points is not good measure because of road structures, traffic etc.

Given that the ETA calculation is complex and is not our functional requirement, we can use a third party ETA calculation service. With this third party service here is how our flow changes:

  • Matching service sends the rider location to Location service.
  • Location service finds all the closed drivers as per our previous approach of 2-3 KM radius circle.
  • Location service now sends the list of pairs {user location, driver location} to third party service.
  • Third party service sends the ETA of every pair of locations to Location service.
  • Location service send these pairs {driver_id, ETA} back to Matching service to make the final decision.

There are few considerations for matching the driver based on that we can match the driver:

  • Match driver with the lowest ETA
  • Math the driver who is about to finish the trip nearby rider.
  • Allow multiple riders to share a ride
  • Prioritize drivers whose rating is high for high rated rider.
  • Permium service
  • Prioritize drivers who have earned very less.

Once the Matching service matches the driver, it needs to store this info somewhere. For this we will have new service say Trip service with it's own SQL DB which will contain source and destination, user_id, driver_id, fare, start_time, end_time, distance etc. Here are the next set of steps: 

  • Create a new Trip using the Trip service. Get the trip_id in response.
  • Send the signal to Location service to tell it that Driver is booked. It will also send the trip_info. The trip_info will contain trip_id and rider_id for sure.
  • Location service will add two new columns: Driver state and Trip Info. 
  • Matching service now send the trip info and driver details including location to Rider service which will push this info to Rider.
  • Similarly Matching service send the trip info and rider details including location to Driver service which will push this info to driver
  • Now when driver send the location to Driver service, Location service append this record to it's DB and also send this location update to Rider service which will update the rider.

You can see how much this web socket connection is useful in these scenarios.


9. Driver should start/end the trip:

Upon reaching the source/destination or where the user wants a drop, driver will start/end the trip by calling the APIs. This will be redirected to trip service. Now Trip service will notify Location Service to change the status of driver as In_Trip / Free. 

Trip service can get the details of locations from Location Service to calculate the distance and everything else for the payments etc.



5. Price of the ride depends on base flat fee, distance and duration:

10. For actual payment process, we can use third party services:

Let's come to the Post trip part where we need to collect the payment from rider and send driver fee to driver account. We also need to send this info as an Email to rider and driver.

Here is the flow:

  • Trip servivce will calculate the ride cost and driver fee.
  • Trip service will queue the trip_info including user_id, driver_id and payment_info to Payment Service.
  • Payment service will interact with third party service to mak the payment.
  • Once the payment is successful, It will queue two messages; one for user and one for driver including trip info and their corresponding payment_info to new service Notification Service.
  • Notification service will send email to driver and user.


That's all about the functional requirements 


Step 3: Mapping non functional requirements to architectural diagram:

1. Scalability:

Given the traffic we need to run multiple instances of every service. Even the notification service is not public facing service but still at the time of peak, we might need to scale up this service too.

Now the problem is about the Driver service and Rider service as these are having web socket persistent connections. How the location service will come to know which instance to connect to send driver location to rider or how Matching service will know which instance to connect to send trip info to driver.

To solve this issue, we will have a new service called Connection Manager which will have its very efficient key value pair NoSql DB like Redis to maintain this mapping. Now Location Service or Matching Service can query Connection Manager Service to get the right instance.




2. Availability:

For this we just need to add replica of our DBs in order to achieve availability.




3. Performance:

We need to match Rider to Driver with 10 seconds. Let's see what are major steps we have for match:

  1. Get the closest drivers list to the rider.
  2. Get the ETA of these drivers from third party service.
  3. Take the Driver with the lowest ETA.

If you see we can't do anything about step 2 and step 3 is fairly simple. Let's see how we can get the list of closest drivers.

We need to calculate the distance of every free/in_trip driver with the user's location which is fairly expensive. Say somehow we are able to apply certain index on Location DB over Lat and Long (I am not sure how we will use it), it still be very expensive.

Also these floating point calculations are little time consuming for computers. It looks like we won't able to achieve our ETA.

We will use Geohash here to divide the earch into geo cells. Now these geo cell ids will be stored in the Location DB as per the driver locations. That means we are going to add a new column geo_cell_id in the Location DB.

Now the circle which we are drawing centered at User's location can have around 3 or 4 geo cells. Here is how the flow looks like for the step 1 of getting the closes drivers:

  • Get the geo cell id as per the user's location.
  • Get all the geo cell ids which is covered by 2 KM radius circle
  • Now the query to DB will be something like SELECT * from Location where geo_cell_ids in [list of geo cells]

To make the above query we can shard the Location DB based on geo_cell_id or create a hash index over it.

That's all about the performance. Now that we have covered every functional and non function requirements, here is our final architectural diagram:




I know the diagram became a bit messy so I'll improve it later!

Design Youtube

Problem: Design a highly scalable video on demand (VOD) streaming platform.

Requirements:

Before we jump into requirements, we should realize we have two different types of users:

  • Content creators
  • Viewers

If we try to observe how they use our platform, we can see the requirements are different both functional as well as nonfunctional requirements.

1. Functional requirements:

a. Content creators:

  1. Upload any format/codec video.
    • Once the video is uploaded, it can be deleted but not be modified.
  2. Each video should have metadata:
    • Mandatory: title, author, description.
    • Optional: List of categories / tags.
    • Metadata can be updated anytime.
  3. Get an email notification when the video is available publically.
  4. Live streaming is not supported.

b. Viewers:

  1. Only registered users can view the content.
  2. Can search using free text in all of video metadata.
  3. Can watch vidoes on any kind of device (desktop / phone / TV) and network conditions.

2. Nonfunctional requirements:

a. Content creators:

  1. Scalability:
    • Thousands of content creators.
    • Upload 1video/week per creator
    • Average video size ~50 GB (50 TB / week)
  2. Consistency:
    • We prefer to have consistency here over availability.
  3. Availability:
    • 99.9 percentile
  4. Performance:
    • Response time of page load < 500 ms at 99 percentile.
    • Video becomes available to view in hours.

b. Viewers:

  1. Scalability:
    • 100K-200K daily active users
  2. Availability:
    • 99.99 percentile
  3. Performance:
    • Search results and page loads < 500 ms at 99 percentile
    • Zero buffer time for video play


Design:

Step 1: API design:

As usual, for the api design let's first translate our functional requirements in a sequence diagram which should look like follows:




Now we can easily identify the entities and URIs using above sequence diagram:

  • Users
    • /users
  • Videos:
    • /videos
    • /videos/{video_id}
  • Search:
    • /search

Now let's put HTTP method:

  • Users:
    • POST /users: Create / register user
    • POST /users/login: Login a user
  • Videos:
    • POST /videos: Start upload a data
    • PUT /videos/{video_id}/metadata: Create/update metadata of video
    • GET /videos/{video_id}: Get the video url
    • POST /videos/{video_id}/play: Send the video content to client
    • DELETE /videos/{video_id}: Delete a video
  • Search:
    • GET /search


Step 2: Mapping functional requirements to architectural diagram:

a.1 Content creator can upload any format/codec video:

b.3 Viewers can watch video on any device and at different network conditions:

I am taking both of these requirements together as these are inter related. We need to upload any video in such a manner that it can be supported on any device and adapted to different network conditions to view.

Let's first understand what is a video file. 

The video file is a container which contains:

  • Video stream
  • Audio stream
  • Subtitles
  • Metadata like codec, bitrate, resolution, frame rate

The binary representation of these containers like mpg/avi/mp4 can be different based on how these streams are encoded and algos to encode and decode these streams are called codecs like H.264 or AV1 or VP9 etc.

The video captured using a camera is encoded based on lossless compression algorithm which makes it suitable for professional editing but it is too big in size so this kind of codec is not suitable for streaming and storage at scale so the first step is too apply a lossy compression algorirthm. This method of converting a encoded stream to a different encoded stream is called Transcoding.

Size of a video = Bit rate (bits/second) * Video length (seconds)

So this is obvious that we need to reduce the bit rate in order to reduce the size of video but given that we need to support different devices and different network bandwidth. We can't depend on only one bit rate. We need to support multiple outputs file supporting different bit rates which is directly proportonal to resolutions. The standard resolutions are 360p, 640p, 720p, 1080p and 4K.

This will partially take care of supporting multiple devices but it won't support varying network conditions like home network getting used by multiple users or the user is travelling. To support this requirement, we will use technique called Adaptive bit rate or Adaptive streaming.

In Adaptive streaming, we break our stream into multiple small chunks of size say 5 seconds or 10 seconds. We put references to all of these streams in a text file called manifest(mpd). When player tries to play the video. It first download this manifest file and then choose a default resolution (say 720p) and play it's say first 4-5 chunks to analyze how the network conditions are. If the network conditions are better it switches to better resolution say 1080p or even 4K. It there are download delays then it goes for lesser resolution chunks. Player keeps analysing the chunks download speed to decide which resolution to go for.

The next step is to fully support every kind of devices. For this we need to package our video content to support different streaming protocols. Different OS / browser supports different protocols. Here we can also apply DRM(Digital Rights Management) to protect our video in order to support FR# b.1 Only registered users can watch the video. Using DRM we can also support subscrption when we want to intoduce it.

Now if you see there are steps which we need to take in order to upload the video and making it available for different devices and different locations / network bandwidth. We will use pipes and filter pattern here to support it. 

We will have a Video service as public interface for this whole activity. This service will have it's own DB which is NoSql DB in order to support fluid schema. So here is the flow of video upload:

  1. Content creator call Video service to upload video.
  2. Video service will start the upload to object store asynchronously and save the metadata into it's DB and return the confirmation to user with video id.
  3. Once the video upload is complete to object store, it queues this message to a new service Transcoding service.
  4. Transcoding service first convert this video to 5-10 seconds chunks and transcode each chunk into multiple resolution and upload it into it's object store. It also generates manifest file for adaptive streaming.
  5. Transcoding service now queue the message to new service say Packaging service.
  6. Packaging service package these streams according to streaming protocols and save it into it's own object store.
  7. Package service now queue the video_id and video_url(which is ultimately manifest download) back to Video service and video service updates it's DB using the video_id.  

We can debate over a point on Transcoding service where we can propose a new service to break the uncompressed video in chunks, queue these chunks to let transcoding service just transcode these chunks in parallel. But that's what we can achieve using multithreading in the transcoding service itself. In that way it will be much easiser to debug issues and also much easier to support the restartabilty as all these chunkings and transcoding will happen in just one service.

We can support FR# a.3 Notify creatore when video is publically available here only by adding a new service say Notification service. Video service will queue the video details to this service. Notification service now can send the notification (email) to content creator.

With this knowlege let's see how our architectural diagram looks like.



a.2 Update the video metadata:

a.1 Delete the video:

User can simply call the video service to perform these opeations so now here is how the diagram looks like



b.2 Viewer can search for the video against the video metadata:

To support search we need to have a different service say Search service whose DB is optimized for search like Elastic search. Video service can queue the metadata to this service so you can assume it will also become the part of video upload / update of metdata / deletion of metadata. 

We also need to use pagination here.



b.3. Watch video on any device:

We have already done enough to support this requirement. Client first need to call Video service to get the video_url. Client then downloads the manifest file and then client/player will directly stream the chunks of video from object store as per adaptive streaming which I have already explained.


 So now we are done with every functional requirement we have.


Step 3: Mapping non functional requirements to architectural diagram:

a.1 Content creator scalability:

There is not much to do in terms scalability for the first scalability requirement as the frequency of video upload is not much (1/week/user) That means ~10K video uploads in a week. However it can happen that at a particular time we can get thousands of upload requests. To support those we can have multiple instances of Video service and Web app service. 

To tackle the video size we have already created a pipeline to compress the video size but there is still one problem; 

If we take the whole video content to first video service and then upload it to object store, this whole process will consume lot's of resources and as it can take hours to upload uncompressed video, we might end up scaling video service too much. 

To resolve the problem we can use presigned urls of object store. Presigned urls are the urls with limited permissions and limited time. With presigned urls, the flow of video upload looks like as below:

  1. Client send request for video upload to video service.
  2. Video service requests presigned url from object store with it's own permissions.
  3. Send the presigned url as response of video upload API to the client.
  4. Client now directly upload the video to object store using presigned url. 
  5. Rest of the pipeline remains same.

With this flow, we can see Video service doesn't have to scale and we will save lot's of resources so now with these changes here is how architecture diagram looks like:




a.2 Content creator availability:

We have already take care of the availabity using the multiple instances of web app and video services. We can use the cloud native services for video transcoding and packaging like AWS Elemental MediaConvert and AWS Elemental MediaPackage to support the availability.

We can replicate the Video service DB to support the availability.



a.3 Content Creator Performance:

We have already taken care of the performance as there is not much to do. However the only problem is when all/many content creators try to upload videos at the same time. In such scenario, we might not able to complete the video upload pipeline in hours. 

We need to parallelize this process. That means we need to have multiple instances of Transcoding service and Packaging service.




a.4 Content Creator CP over AP:

To achieve this we just need to choose the right Video service DB or DB's configuration which supports consistency over availability. That's all!


b.1 Viewers Scalability:

To support this 100K - 200K user visits we have already scale our web app service and video service but to scale the search functionality, we need to have multiple instances of search service. This will take care of the scalabilty of service. 

As elastic search is not cloud native, we can use AWS opensearch / Elastic cloud on AWS which autoscale itself.




b.2 Viewers Availability: 

We have done mostly everything for the availability but as here our availability requirement is high. We can use muti region deployment and a global load balancer too. This will also help with the performance.


b.3. Viewers Performance:

We are using adaptive bitrate streaming for the zero buffer time but as you see still we need to download intial chunks, we need to go to object store which might be expensive so we can use CDN to provide it. We can have initial chunks in the CDN to increase the performance of download and then the client can use adaptive streaming to go for the right chunks.

Please note that we can put the whole video too on the CDN which will definitely improve the performance and the quality of the video but it can be very expensive so I am still opting for initial chunks of videos.





 

b.4. Viewer AP over CP:

We are already using lot's of async operations / message broker which guarantees availability but eventual consistency. For search we are already receiving metadata updates using queue and also Elastic search provides AP over CP so this requirement is already satisfied.


With this we are done with every functional and non functional requirements and here is how our final architectural diagram looks like:





That's all!

* Please note that here we should have a User service to support user login and registration but that's very obvious so I have not discussed it here. Given the user's volume is in hundreds of thousand, we don't need to do many things.

Friday, October 18, 2024

Design Instagram

Problem: Design highly scalabale image sharing social media platform like instagram.

Requirements:

1. Functional Requirements:

  1. Only registered user can access the platform. They need to provide following info while registering:
    • Mandatory: first name, last name, email, phone number, profile image, password
    • Optional: age, sex, location, interests etc.
  2. User can share/post only images
    • We need to design it in a way to extend it to videos or text.
  3. Search a user using different attibutes
  4. Unidirectional relationship: User A follows User B does not mean User B also follows User A.
  5. Load a timeline of latest images posted by people they follow sorted by recency in descending order
 

2. Non-functional Requirements:

  1. Scalability:
    • ~1-2 billion active users
    • 100-500 million visits / day
    • Each user uploads ~1 image / day
    • Each image size ~2 MB.
    • Data processing volume: ~1 PB / day
  2. Availability: Prioritize availability over consistency as it is okay even if user does not see the latest data. We are targetting 99.99% here.
  3.  Performance: 
    • Response time < 500 ms at 99 percentile.
    • Timeline load time <1000 ms at 99 percentile.


Design:

Step 1: API design:

For the api design, let's first translate our functional requirements in a sequence diagram which should look like follows:


Now if you see above, we can easily identify the entities and URIs:

  • Users
    • /users
    • /users/{user-id}
  • Images
    • /images
    • /images/{image-id}
  • Search
    • /search
Now let's put HTTP methods:

  • Users 
    • POST /users - Create user / register user
    • POST /users/login - Login a user
    • POST /users/{user-id}/follow - Follow a user
    • DELETE /users/{user-id}/follow - Unfollow a user
    • GET /users/{user-id}/timeline - Get user time line. This requires pagination.
  • Images
    • POST /images - Post an image
    • GET /images/{image-id} - Get an image
  • Search
    • Get /search - Search a user


Step 2: Mapping functional requirements to architectural diagram:

Before we move to our main functional requirements, I want to add a service say "Web app service" to serve the html pages since we have support on desktop web browser too.

1. Registering user to our platform:

We will have a User service with it's own DB which will help with login and registration of users. User service database will have following fields as per FR#1:

  • user_id
  • user_name
  • first_name
  • last_name
  • email
  • password
  • phone_number
  • profie_image_url
  • and some optional fields like age, sex, interests, location etc.
If you see we have so many optional fields which can change with time like some optional fields can be removed and some can be added. That means our schema is fluid and that's why we are going with NoSql Document DB. Given we have around ~1-2 billions users, these schma changes become an considerable overhead.

In the schema we are storing the profile image url instead of profile image because DBs are not optimized for blob storage so what we will do is we will store the profile image into a blob store or object store and then save the image url in our DB.

Once the registration is completed, client will get user id and auth token which will help client with further operations.



2. Post an image:

Let's have a post service for this service, we are not calling it as image service so that we can extend it to any type of post. This service will have it's own DB which contains the following fields:

  • post_id
  • user_id
  • post_url
  • timestamp
  • post_type - This will be by default image but can be extended to video etc.
Since this is a fixed schema, I am going to use SQL DB for storing this info. Here is the flow of image upload:
  1. User send the request to Post service to share an image
  2. Post service send the image to object store
  3. Object store return the url of uploaded image
  4. Save the image metdata and url in the DB
  5. Return confirmation to user.


3. Search users using different attributes:

We can use the same user service to search the users but if you see this DB is not optimized for search As we don't know in advance what all the attributes are searchable or what kind of search we are going to support, it's better if we use a different service say Search service with DB which is optimized for search scenario like elastic search or other lucene based DB.

So now whenever there is a new user added or there is an update in user's record we can queue it to the search service. Search service will updates it's DB. 




4.: Follow/Unfollow user:

We can create another service to handle follow/unfollow activity but for me it's not that much useful and is unnecessary. We can use existing User service only, we just need to create another collection with following fields:

  1. follower_user_id
  2. target_user_id 
  3. target_user_name



5. Loading the timeline of the user:

That's the most complex problem. Within the current design here is how we can achieve it.

  • Client send the request to User service
  • User service gets all the users who the user is following.
  • It will then query the posts of all the users sorted by timestamps with pagesize of 20-50 using the Post service.
  • Send page size number of posts sorted based on timestamp to the client.
  • Client can then dowload the images using the post_urls from object store.
This will definitely solve the problem but it is very inefficient. I know we are not solving the non-functional requirements now but we should think about the performance.

To make it efficient, we will use CQRS pattern. We will have a new service called Timeline service. This will have an efficient key value pair DB where key is the user_id and value is the list of post records of all the users followed by the user with user_id(key). Every post record will have {post_url, user_id, timestamp}

Now here is what we do with this new service:

  • POST service will queue the new posts containing post_url, user_id, timestamp to Timeline service.
  • Timeline service will take the user_id and get it's followers from User service.
  • It then add this post to the front of all the followers' list of post records. It can remove the posts if the list is have more records than what we need to show in the timeline.
With this new service, our flow of showing timeline will become straight forward:
  • Client requests Timeline service for the timeline.
  • Timeline service returns the list of post records from it's key value db with key as requested user's user_id.
  • Now client can download the images from object store using post_url.
This will be eventual consistent but that's okay as per our non functional requirements. This will be much efficient than the older design.





Step 3: Mapping nonfunctional requirements to architectural diagram:


1. Scalability: 

If you see we have following scalability requirements here:

  • Number of users: We have 1-2 billion users data which will be huge so we can't rely on just one DB instance so we have to shard the User service DB. We can shard using hashing technique:
    • User DB: Shard on the basis of user_id.
    • Follower DB: Shard based on target_user_id as our main case is to get the followers which is a call from Timeline service.
  • For search we are already using elastic search and we can shard it too.
  • Number of visists: To support these number of visits, we have to run multiple instances of different services behind load balancer.
  • Number of posts: This has two parts:
    1. Data in the DB: This will be huge also so we have to shard Post DB as well as Timeline DB. Post DB can be sharded using post_id and Timeline DB using user_id.
    2. Image Data: As per requirement we need to save petabytes of data because we are getting uncompressed image. Not only these images will take lots of space but also these are not optimized for viewing on Mobile device or browser. We can introduce an async image processing pipeline like AWS severless image handler to compress these images which will convert this petabytes of data into TBs or even GBs.



2. Availability:

By using multiple instances of our services behind load balancer, we have almost achieved the availability. We can have replicas of our DBs to achieve the availability requirements. Additionally we can do multi region deployement and have a global load balancer in case one region is down. It will also increase the performance.



3. Performance:

We have two performance benchmarks:

  • 500 ms for every page load: To achieve it we can use CDNs to store html pages, CSS and post images.
  • 1 second for timeline page load: We have already made a decision to use CQRS pattern and made Timeline service to serve the timeline page. This will take care of the performance part but it will create a different problem. There are few thousands of users whom we call celebrity / influencers and millions of users follow them. In case if a celebrity user makes a post, millions of entries of Timeline service DB will be updated which might slow down the DB and hence the performance. To tackle this situation we can make following steps:
    1. Define which user is celebrity; Say a celebrity has 1 M followers.
    2. Make a column in User DB - IsCelebrity which will tell if the user is a celebrity or not.
    3. When celebrity post an image and Timeline serivec calls User service to get the followers list. Instead of returning followers list, User service will return IsCelebrity = true.
    4. Another Key-Value pair say Celebrity POST DB will be added to Timeline service DB which where key is celebrity_user_id and value is sorted list of post records.
    5. When Timeline service receive IsCelebrity as true. It just add the post into Celebrity POST DB.
    6. While loading the Timeline, Time line service get the list of celebrities from User service which the user is following using new REST endpoint on User service say GET /users/{user-id}/celebrity.
    7. Timeline service then merge the sorted results it gets from celebrity key value pair and user key value pair and returns it to user.
And that's all about the performance. Now that we have addresses every requirement functional or non functional, here is our final architecture diagram:



Have fun!