Zig-Zag World of Algorithm and Data Structures: June 2025

Friday, June 20, 2025

[Netflix Tech Blog] Inside Netflix's Ads Event Engine: A Deep Dive (Part 2)

In Part 1, we explored how Netflix moved from a bloated token model to a more elegant, scalable architecture using a centralized Metadata Registry Service (MRS). That shift allowed Netflix to scale its ad event tracking system across vendors, while keeping client payloads small and observability high.

But Netflix wasn’t done yet.

Netflix built it's own Ad tech platform: Ad Decision Server:

Netflix made a strategic bet to build an in-house advertising technology platform, including its own Ad Decision Server (ADS). This was about more than just cost or speed, it was about control, flexibility, and future-proofing. It actually began with a clear inventory of business and technical use cases:

In-house frequency capping
Pricing and billing for impression events
Robust campaign reporting and analytics
Scalable, multi-vendor tracking via event handler

But this Ad Event Processing Pipeline needed a significant overhaul to:

Match or exceed the capabilities of existing third-party solutions
Rapidly support new ad formats like Pause/Display ads
Enable advanced features like frequency capping and billing events
Consolidate and centralize telemetry pipelines across use cases

Road to a versatile and unified pipeline:

Netflix recognized future ad types like Pause/Display ads that would introduce variant telemetry. That means the upstream telemetry pipelines might vary, but downstream use‑cases remain largely consistent.

So what is upstream and what is downstream in this Netflix event processing context:

Upstream (Producers of telemetry / Events): This refers to the systems and sources that generate or send ad-related telemetry data into the event processing pipeline.

Example of upstream in this context:

Devices (Phones, TVs, Browsers)
Different Ad Types (eg. Video Ads, Display Ads, Pause Ads)
Ad Media Players
Ad Servers

These upstream systems send events like:

Ad Started
Ad Paused
Ad Clicked
Token for ad metadata

These system may:

Use different formats
Have different logging behavior
Be customized per ad type or region

Downstream (Consumers of processed data): This refers to the systems that consume the processed ad events for various business use cases.

Example of downstream in this context:

Frquency capping service
Billing and pricing engine
Reporting dashboards for advertisers
Anlytical systems (like Druid)
Ad session generators
External vendor tracking

These downstream consumers:

Expect standardized, enriched, reliable event data
Rely on consistent schema
Need latency and correctness guarantees

So "upstream telemetry pipelines might vary, but downstream use‑cases remain largely consistent" means:

Even if ad events come from diverse sources, with different formats and behaviors (upstream variation),
The core processing needs (like billing, metrics, sessionization, and reporting) remain the same (downstream consistency).

Based on that they outline a vision:

Centralized ad event collection service that:

Decrypts tokens
Enriches metadata (via the Registry)
Hashes identifiers
Publishes a unified data contract, agnostic of ad type or server

All consumers (for reporting, capping, billing)/read from this central service, ensuring clean separation and consistency.

Building Ad sessions & Downstream Pipelines:

A key architectural decision was "Sessionization happens downstream, not in the ingestion flow" but what does that mean? Lets break it down:

Ingestion flow (Upstream): This is the intial (real time) path where raw ad telemetry events are ingested into the system, like:

Ad started playing
User clicked
25% of video watched

These events are lightweight, real time and are not yet grouped into a meaningful session. Think of it as raw, granular events flying into Kafka or a central service.

Sessionization (Downstream Processing): This is a batch or stream aggregation process where the system:

Looks at multiple ad events from the same ad playback session
Groups them together
Creates a meaningful structure called an "Ad Session"

An Ad session might include:

When did the ad start and end
What was the user's interaction (viewed / clicked / skipped)
Was it complete or partial view
What was pricing and vendor context

So "Sessionization happens downstream, not in the ingestion flow" means:

The ingestion system just collects and stores raw events quickly and reliably.
The heavier work of aggregating these into a complete session (sessionization) happens later, in the downstream pipeline where there is:

More time
Access to more context
Better processiong capabilities

Why it was required?

Separation of concerns: Ingestion stays fast and simple; sessionization logic is centralized and easier to iterate on.
Extensibility: You can add new logic to sessionization without touching ingestion.
Scalability: Real time pipelines stay light; heavy lifting is done downstream.
Consistency: One place to define and compute what an "ad session" means.

They also built out a suite of downstream systems:

Frequency Capping
Ads Metrics
Ads Sessionizer
Ads Event Handler
Billing and revenue jobs
Reporting & Metrics APIs

Final Architecture of Ad Event Processing:

Ad Event Processing Pipeline

Here is the breakdown of the flow:

Netflix Player -> Netflix Edge:

The client (TV, browser, mobile app) plays an ad and emits telemetry (e.g., impression, quartile events).
These events are forwarded to Netflix’s Edge infrastructure, ensuring low-latency ingestion globally.

Telemetry Infrastructure:

Once events hit the edge, they are processed by a Telemetry Infrastructure service.
This layer is responsible for:

Validating events
Attaching metadata (e.g., device info, playback context)
Interfacing with the Metadata Registry Service (MRS) to enrich events with ad context based on token references.

Kafka:

Events are published to a Kafka topic.
This decouples ingestion from downstream processing and allows multiple consumers to operate in parallel.

Ads Event Publisher:

This is the central orchestration layer.
It performs:

Token decryption
Metadata enrichment (macros, tracking URLs)
Identifier hashing
Emits a unified and extensible data contract for downstream use

Also writes select events to the data warehouse to support Ads Finance and Billing workflows.

Kafka -> Downstream Consumers:

Once enriched, the events are republished to Kafka for the following consumers:

Frequency Capping:

Records ad views in a Key-Value abstraction layer.
Helps enforce logic like “don’t show the same ad more than 3 times per day per user.”

Ads Metrics Processor:

Processes delivery metrics (e.g., impressions, completion rates).
Writes real-time data to Druid for interactive analytics.

Ads Sessionizer:

Groups related telemetry events into ad sessions (e.g., full playback window).
Feeds both data warehouse and offline vendor integrations.

Ads Event Handler:

Responsible for firing external tracking pixels and HTTP callbacks to third-party vendors (like Nielsen, DoubleVerify).
Ensures vendors get event pings per their integration needs.

Analytics & Offline Exchange:

The sessionized and aggregated data from the warehouse is used for:

Offline measurement
Revenue analytics
Advertiser reporting

Benefits of design:

Principle	Implementation
Separation of concerns	Ingestion (Telemetry Infra) ≠ Enrichment (Publisher) ≠ Consumption (Consumers)
Scalability	Kafka enables multi-consumer, partitioned scaling
Extensibility	Central Ads Event Publisher emits a unified schema
Observability	Real-time metrics via Druid; sessionized logs via warehouse
Partner-friendly	Vendor hooks via Event Handler + Offline exchange

By centralizing sessionization and metadata logic, Netflix ensures that:

Client-side logic remains lightweight and stable
Backend systems remain flexible to change and easy to audit

Few clarificatons:

1. What is the Key-Value (KV) abstraction in Frequency Capping?

In Netflix’s Ad Event Processing Pipeline, frequency capping is the mechanism that ensures a user doesn’t see the same ad (or type of ad) more than a specified number of times e.g., “Don’t show Ad X to user Y more than 3 times a day.”

To implement this efficiently, Netflix uses a Key-Value abstraction, which is essentially a highly scalable key-value store optimized for fast reads and writes.

How it works:

Key: A unique identifier for the "audience + ad + context"

Example: user_id:ad_id:date

Value: A counter or timestamp that tracks views or interactions

Example: {count: 2, last_shown: 2025-06-20T12:15:30}

Let's understand it by example:

Key	Value
user123:ad789:2025-06-20	{ count: 2, last_shown: 12:15 }
user123:ad456:2025-06-20	{ count: 1, last_shown: 10:04 }

When a new ad event comes in:

Netflix looks up the key
Checks the count
If under the cap (e.g., 3), they allow it and increment the counter
If at or above cap, the ad is skipped or replaced

Why "Abstraction"?

Netflix may use multiple underlying KV stores (like Redis, Cassandra, or custom services), but abstracts it behind a unified interface so:

Ad engineers don’t worry about implementation
Multiple ad systems can share the same cap logic
It can scale across regions and millions of users

Benefits:

Feature	Why it matters
Speed	Real-time lookups for every ad play
Scalability	Handles millions of users per day
Flexibility	Supports different capping rules
Multi-tenancy	Can cap by user, device, campaign, etc.

2. What is Druid and why it is getting used here?

Apache Druid is a real-time analytics database designed for low-latency, high-concurrency, OLAP-style queries on large volumes of streaming or batch data.

It’s built for interactive dashboards, slice-and-dice analytics, and fast aggregations over time-series or event data.

Think of it as a high-performance hybrid of a data warehouse and a time-series database.

Netflix uses Druid in the Ads Metrics Processor and Analytics layer. Here’s what it powers:

Streaming Ad Delivery Metrics

As ads are played, Netflix captures events like:

Impressions
Quartile completions (25%, 50%, etc.)
Clicks, errors, skips

These events are streamed and aggregated in near-real-time via Flink or similar streaming processors
Druid ingests these metrics for interactive querying.

Advertiser Reporting and Internal Dashboards

Ad operations teams and partners want to know:

"How many impressions did Ad X get in Region Y yesterday?"
"What's the completion rate for Campaign Z?"

These queries are run against Druid and return in milliseconds, even over billions of records

Why Nerflix uses Druid (instead of just a data warehouse like Redshift or BigQuery)?

Feature	Why it matters for Netflix Ads
Real-time ingestion	Metrics appear in dashboards within seconds of the event
Sub-second queries	Advertiser reports and internal dashboards need speed
High concurrency	Supports thousands of parallel queries from internal tools
Flexible aggregation	Group by ad_id, region, time, device_type, etc. instantly
Schema-on-read & roll-ups	Druid efficiently stores pre-aggregated data for common metrics

Conclusion:

As Netflix shifted toward building its own advertising technology platform, the Ads Event Processing Pipeline evolved into a thoughtfully architected, modular telemetry platform capable of handling diverse ad formats, scaling globally, and meeting the demands of both real-time analytics and long-term reporting.

This transformation was not just about infrastructure, but about platformization:

A centralized event publisher that abstracts complexity.
A unified schema that serves a wide range of downstream consumers.
A clean separation of concerns that allows rapid innovation in ad types without touching backend logic.
Observability and control, made possible through systems like Druid and the Metadata Registry Service.

With this evolution, Netflix ensured:

Scalability: Handling billions of ad events across regions and devices
Agility: Enabling quick integration of new ad formats like Pause/Display Ads
Partner alignment: Supporting vendor integrations with minimal friction
Business confidence: Powering frequency capping, billing, and reporting with precision

Here is the original Netflix blog.

Monday, June 16, 2025

[Netflix Tech Blog] Inside Netflix's Ads Event Engine: A Deep Dive (Part 1)

As Netflix steps into the ad-supported streaming world, its engineering backbone must operate with the same level of scale, reliability, and precision that powers its core viewing experience. But building a system that can track billions of ad events like impressions, clicks, quartile views, across diverse devices and ad partners is anything but trivial.

In this two-part series, I’ll dissect the architecture behind Netflix’s Ads Event Processing Pipeline, inspired by their tech blog. Part 1 focuses on the early phase, from ad request initiation to the evolution of encrypted tokens and the eventual need for a Metadata Registry Service. You'll see how Netflix tackled key design challenges such as bandwidth optimization, cross-device compatibility, vendor flexibility, and observability at scale.

If you're interested in distributed systems, real-time telemetry, or just want to peek behind the curtain of Netflix’s ad tech, you’re in the right place.

Let’s dive in!

The Initial System (Pilot) At a Glance:

When a Netflix-supported device is ready to show an ad, it doesn’t pick an ad on its own. Instead, the system goes through the following high-level steps:

Client Ad Request: The device (TV/mobile/browser) sends a request to Netflix’s Ads Manager service when it reaches an ad slot.
VAST Response from Microsoft Ad Server: Ads Manager forwards this request to the Microsoft’s ad server, which returns a VAST XML. This contains the actual ad media URL, tracking pixels, and metadata like vendor IDs and macros.
Encrypted Token Created: Instead of sending this raw VAST data to the device, Ads Manager:

Extracts relevant metadata (e.g., tracking URLs, event triggers, vendor IDs)
Encodes it using protobuf
Encrypts it into a compact structure called a token
Sends this token along with the actual ad video URL to the device

Ad Playback and Event Emission: As the ad plays, the device emits telemetry events like impression, firstQuartile, click etc., attaching the token to each event.
Event Processing via Kafka: These events are queued in Kafka, where the Ads Event Handler service reads them, decrypts the token, interprets the metadata, and fires the appropriate tracking URLs.

Ad Events flow

What is VAST?

VAST stands for Video Ad Serving Template. It’s an XML-based standard developed by the IAB (Interactive Advertising Bureau) to standardize the delivery of video ads from ad servers to video players (like Netflix's video player). In simpler terms:

VAST is a template that tells the video player:
What ad to play
Where to find it
How to track it (e.g., impressions, clicks, view progress)

Key Parts of a VAST Response:

Ad Media URL

Link to the video file
The player will use this to stream the ad.

<MediaFile delivery="progressive" type="video/mp4" width="640" height="360">
    https://adserver.com/path/to/ad.mp4
</MediaFile>

Tracking Events

URLs the player should "ping" (via HTTP GET) when certain things happen:

Impression (ad started)
FirstQuartile (25% watched)
Midpoint (50% watched)
ThirdQuartile (75% watched)
Complete (100%)
Click (User clicked the ad)

<TrackingEvents>
  <Tracking event="impression">https://track.com/imp</Tracking>
  <Tracking event="firstQuartile">https://track.com/q1</Tracking>
  ...
</TrackingEvents>

Click Tracking

URL to notify when the user clicks the ad.
Destination URL (where user is taken on click).

<VideoClicks>
  <ClickThrough>https://brand.com</ClickThrough>
  <ClickTracking>https://track.com/click</ClickTracking>
</VideoClicks>

Companion Ads (Optional)

Static banner or rich media that shows alongside the video ad

Why VAST?

Purpose	How it helps
Standardization	All ad servers speak the same language to video players
Tracking	Advertisers know when/how their ads were shown
Flexibility	Supports linear, companion, and interactive video ad formats
Easy integration	Netflix could plug in with Microsoft without building a custom API

Why Netflix chose protobuf and not Json:

Token size matters on client devices: Protobuf is highly compact, small tokens means less overhead. For example, a JSON payload that’s 1 KB might only be 100–200 bytes in protobuf.
Performance: Ad events are high volume (billions per day). Protobuf parsing is much faster than JSON, both on backend and client.
Encryption efficiency: Netflix encrypted the protobuf payload to create the token. Encrypting a compact binary blob is faster and lighter than encrypting JSON text.
Schema Evolution and Compatibility: Protobuf supports versioning, optional/required fields and backward/forward compatibility. That means Netflix could evolve the ad metadata schema without breaking clients or backend systems. JSON lacks this.

Feature	JSON	Protobuf
Format	Text-based	Binary (compact)
Size	Larger	Much smaller
Parse performance	Slower	Faster
Schema enforcement	No (free-form)	Yes (strongly typed schema)
Encryption impact	Heavier (more bytes to encrypt)	Lighter (smaller encrypted payload)
Tamper resistance	Weaker (can be read/edited)	Stronger (harder to interpret)

Why Kafka over other queues like RabitMQ:

Massive Scale and Parallelism

Kafka can handle billions of events/day, easily.
Partitions allow horizontal scaling of consumers:

e.g., 100 partitions means 100 consumers i.e. massive parallel processing.

RabbitMQ struggles when you need high fan-out and high throughput.

Event Replay and Reliability

Ad tracking must be durable (e.g., billing, audits)
Kafka stores messages on disk with configurable retention (e.g., 7 days)
If a service fails, it can re-read from the last offset i.e. no data lost.
RabbitMQ can lose messages unless you enable heavy persistence (with a perf hit).

Built-in Partitioning and Ordering

Kafka guarantees message ordering per partition.
Ads Event Handler can partition by:

ad_id, device_id, etc.

This ensures all related events are processed in order, by the same consumer.

Topic: ad_events
Partitions: 100
Key: device_id (events from same device always lands in same partition)

Problem in the Pilot System: The Token got too smart

In the pilot, the Ads Manager was:

Extracting all tracking metadata from the VAST (like URLs, vendor IDs, event types)
Packing all of it into a token (as protobuf)
Encrypting and sending this token to the client
The token was self-contained — everything the backend needed to fire tracking URLs was inside it

It worked for small scale but it didn't scale:

As Netflix onboarded multiple ad partners (e.g., Nielsen, DoubleVerify, Microsoft, etc.), each vendor wanted:

Their own tracking logic.
Unique macros to be replaced in URLs
Complex conditional tracking rules
Custom logic based on device type, timing, campaign, etc.

What are these macros we are talking about:

Macros are placeholder variables in tracking URLs that get dynamically replaced with actual values at runtime, before the URL is fired.

They let ad vendors capture contextual information like:

The timestamp when an ad was viewed
The type of device
Whether the user clicked or skipped
The playback position, etc.

Example:

A vendor might provide a tracking URL like:

https://track.vendor.com/pixel?event=impression&ts=[TIMESTAMP]&device=[DEVICE_TYPE]

Here:

[TIMESTAMP] is a macro for when the ad started playing
[DEVICE_TYPE] is a macro for the kind of device (TV, phone, etc.)

Before this URL is fired, those macros must be resolved to actual values:

https://track.vendor.com/pixel?event=impression&ts=1717910340000&device=tv

Why Macros become a problem at scale:

Inconsistent Syntax Across Vendors

Some used [TIMESTAMP], others used %ts%, or even custom placeholders.
Netflix had to implement logic per partner to resolve them correctly.

Dynamic Values Grew

New use cases brought in more macros: region, episode title, ad ID, player version, UI language, etc
Supporting new macros meant:

Updating the token schema
Updating Ads Manager logic
Sometimes even updating clients

Difficult to Test or Validate

Because everything was encrypted in the token, validating whether macros were substituted correctly was hard.
A bug in macro resolution could break tracking silently.

This meant the token payload was growing; more vendors, more metadata, more per-event config.

So to summarize the problems, here is the table:

Problem	Why it hurt
Payload bloat	Token size increased → slower to transmit → harder to encrypt
Hard to evolve	Any schema change required new client logic and re-deployment
No dynamic behavior	Logic was fixed at ad decision time; couldn’t update later
Redundant processing	Every token had repeated config for the same vendors
Hard to debug/troubleshoot	No human-readable registry of how tokens should behave

How did Netflix solve the problem: Meatdata Registry Service (MRS):

To overcome the scaling bottlenecks of self-contained tokens, Netflix introduced a new component: the Metadata Registry Service (MRS).

Its core idea was simple but powerful, separate the metadata from the token. Instead of encoding everything inside the token, Netflix now stores reusable metadata centrally in MRS, and the token merely references it via an ID.

How MRS works:

Ads Manager Gets VAST: Ads Manager still fetches the VAST response from the ad decisioning system (e.g. Microsoft’s ad server).
Metadata Extracted & Stored: Instead of encrypting all metadata into a token, Ads Manager:

Extracts metadata (tracking pixels, macros, event mappings, etc.)
Stores it in Metadata Registry Service (MRS) with a registry key

Token Now Just an Id: The token now just contains:

The registry key
A small payload (e.g. dynamic fields like ad_start_time or session ID)

Device Sends Events with Token: Devices emit telemetry events with this lightweight token.
Backend Looks Up MRS: Ads Event Handler uses the registry key to:

Fetch the tracking metadata from MRS
Apply macros
Fire tracking URLs

Ad Events flow with MRS

So summary of what exactly changed:

Before MRS	After MRS
Token = All tracking logic inside	Token = Just a reference id
Tracking metadata repeated often	Metadata stored once in registry
Hard to update or debug	Dynamic updates via registry
Schema tightly coupled to client	Centralized schema owned by server

Benefits of MRS approach:

Dynamic Metadata Updates

Vendor changes their URLs or macro format? Just update MRS.
No need to redeploy client or regenerate tokens.

Improved Performance

Tokens are smaller means less encryption overhead, less network cost.
Registry entries are cached means fast and efficient.

Better Observability

MRS is a source of truth for all campaign metadata.
Easy to search, debug, and trace logic across vendors and campaigns.

Reusability

Multiple campaigns can reference the same registry entry.
No duplication of tracking logic.

Separation of Concerns

Client focuses on playback and telemetry.
Backend owns tracking logic and event firing.

* You might be thinking there is no logic in client for tokens i.e. the logic to fire tracking pixels was not on the client. The client just played the ad and emitted telemetry events along with the token so why client redeployment is in the picture even without MR. According to me, the protobuf schema can become a bottlenec here as the schema changes, client code also need a change. With MRS we can freeze the client payload schema.

In this first part of our deep dive into Netflix’s Ads Event Processing Pipeline, we traced the evolution from a tightly-coupled, token-heavy system to a more modular and scalable architecture powered by the Metadata Registry Service (MRS).

By decoupling metadata from tokens and centralizing macro resolution and tracking logic, Netflix unlocked:

Smaller, lighter tokens
Faster and safer iteration
Dynamic partner integration
Debuggability at scale

What started as a necessary fix for bloated tokens ended up becoming the foundation for a much more agile and partner-friendly ad tracking ecosystem.

But Netflix didn’t stop there!

In [Part 2], we’ll explore the next big leap: how Netflix brought ad decisioning in-house — giving them end-to-end control over targeting, creative rotation, pacing, and auction logic and how this plugged seamlessly into the MRS-backed pipeline we just explored.

We will learn how they:

Built an Ad Decision Server (ADS) from scratch
Integrated it with the metadata pipeline
Enabled real-time, personalized ad serving on a global scale

Thanks a lot and stay tuned, the story gets even more fascinating.

Home

Friday, June 20, 2025

[Netflix Tech Blog] Inside Netflix's Ads Event Engine: A Deep Dive (Part 2)

Monday, June 16, 2025

[Netflix Tech Blog] Inside Netflix's Ads Event Engine: A Deep Dive (Part 1)