Home

Monday, June 16, 2025

[Netflix Tech Blog] Inside Netflix's Ads Event Engine: A Deep Dive (Part 1)

As Netflix steps into the ad-supported streaming world, its engineering backbone must operate with the same level of scale, reliability, and precision that powers its core viewing experience. But building a system that can track billions of ad events like impressions, clicks, quartile views, across diverse devices and ad partners is anything but trivial.

In this two-part series, I’ll dissect the architecture behind Netflix’s Ads Event Processing Pipeline, inspired by their tech blog. Part 1 focuses on the early phase, from ad request initiation to the evolution of encrypted tokens and the eventual need for a Metadata Registry Service. You'll see how Netflix tackled key design challenges such as bandwidth optimization, cross-device compatibility, vendor flexibility, and observability at scale.

If you're interested in distributed systems, real-time telemetry, or just want to peek behind the curtain of Netflix’s ad tech, you’re in the right place.

Let’s dive in!

The Initial System (Pilot) At a Glance:

When a Netflix-supported device is ready to show an ad, it doesn’t pick an ad on its own. Instead, the system goes through the following high-level steps: 

  1. Client Ad Request: The device (TV/mobile/browser) sends a request to Netflix’s Ads Manager service when it reaches an ad slot.
  2. VAST Response from Microsoft Ad Server: Ads Manager forwards this request to the Microsoft’s ad server, which returns a VAST XML. This contains the actual ad media URL, tracking pixels, and metadata like vendor IDs and macros.
  3. Encrypted Token Created: Instead of sending this raw VAST data to the device, Ads Manager:
    • Extracts relevant metadata (e.g., tracking URLs, event triggers, vendor IDs)
    • Encodes it using protobuf
    • Encrypts it into a compact structure called a token
    • Sends this token along with the actual ad video URL to the device
  4. Ad Playback and Event Emission: As the ad plays, the device emits telemetry events like impression, firstQuartile, click etc., attaching the token to each event.
  5. Event Processing via Kafka: These events are queued in Kafka, where the Ads Event Handler service reads them, decrypts the token, interprets the metadata, and fires the appropriate tracking URLs.




Ad Events flow


What is VAST?
VAST stands for Video Ad Serving Template. It’s an XML-based standard developed by the IAB (Interactive Advertising Bureau) to standardize the delivery of video ads from ad servers to video players (like Netflix's video player). In simpler terms:
  • VAST is a template that tells the video player:
  • What ad to play
  • Where to find it
  • How to track it (e.g., impressions, clicks, view progress)

Key Parts of a VAST Response:
  • Ad Media URL
    • Link to the video file
    • The player will use this to stream the ad.
<MediaFile delivery="progressive" type="video/mp4" width="640" height="360">
    https://adserver.com/path/to/ad.mp4
</MediaFile>
  • Tracking Events
    • URLs the player should "ping" (via HTTP GET) when certain things happen:
      • Impression (ad started)
      • FirstQuartile (25% watched)
      • Midpoint (50% watched)
      • ThirdQuartile (75% watched)
      • Complete (100%)
      • Click (User clicked the ad)
<TrackingEvents>
  <Tracking event="impression">https://track.com/imp</Tracking>
  <Tracking event="firstQuartile">https://track.com/q1</Tracking>
  ...
</TrackingEvents>
  • Click Tracking
    • URL to notify when the user clicks the ad.
    • Destination URL (where user is taken on click).
<VideoClicks>
  <ClickThrough>https://brand.com</ClickThrough>
  <ClickTracking>https://track.com/click</ClickTracking>
</VideoClicks>
  • Companion Ads (Optional)
    • Static banner or rich media that shows alongside the video ad

Why VAST?

PurposeHow it helps
StandardizationAll ad servers speak the same language to video players
TrackingAdvertisers know when/how their ads were shown
FlexibilitySupports linear, companion, and interactive video ad formats
Easy integrationNetflix could plug in with Microsoft without building a custom API


Why Netflix chose protobuf and not Json:
  • Token size matters on client devices: Protobuf is highly compact, small tokens means less overhead. For example, a JSON payload that’s 1 KB might only be 100–200 bytes in protobuf.
  • Performance: Ad events are high volume (billions per day). Protobuf parsing is much faster than JSON, both on backend and client.
  • Encryption efficiency: Netflix encrypted the protobuf payload to create the token. Encrypting a compact binary blob is faster and lighter than encrypting JSON text.
  • Schema Evolution and Compatibility: Protobuf supports versioning, optional/required fields and backward/forward compatibility. That means Netflix could evolve the ad metadata schema without breaking clients or backend systems. JSON lacks this.

FeatureJSONProtobuf
FormatText-basedBinary (compact)
SizeLargerMuch smaller
Parse performanceSlowerFaster
Schema enforcementNo (free-form)Yes (strongly typed schema)
Encryption impactHeavier (more bytes to encrypt)Lighter (smaller encrypted payload)
Tamper resistanceWeaker (can be read/edited)Stronger (harder to interpret)


Why Kafka over other queues like RabitMQ:
  • Massive Scale and Parallelism
    • Kafka can handle billions of events/day, easily.
    • Partitions allow horizontal scaling of consumers:
      • e.g., 100 partitions means 100 consumers i.e. massive parallel processing.
    • RabbitMQ struggles when you need high fan-out and high throughput.
  • Event Replay and Reliability
    • Ad tracking must be durable (e.g., billing, audits)
    • Kafka stores messages on disk with configurable retention (e.g., 7 days)
    • If a service fails, it can re-read from the last offset i.e. no data lost.
    • RabbitMQ can lose messages unless you enable heavy persistence (with a perf hit).
  •  Built-in Partitioning and Ordering
    • Kafka guarantees message ordering per partition.
    • Ads Event Handler can partition by:
      • ad_id, device_id, etc.
    • This ensures all related events are processed in order, by the same consumer.
Topic: ad_events
Partitions: 100
Key: device_id (events from same device always lands in same partition)


Problem in the Pilot System: The Token got too smart

In the pilot, the Ads Manager was:
  • Extracting all tracking metadata from the VAST (like URLs, vendor IDs, event types)
  • Packing all of it into a token (as protobuf)
  • Encrypting and sending this token to the client
  • The token was self-contained — everything the backend needed to fire tracking URLs was inside it
It worked for small scale but it didn't scale:

As Netflix onboarded multiple ad partners (e.g., Nielsen, DoubleVerify, Microsoft, etc.), each vendor wanted:
  • Their own tracking logic.
  • Unique macros to be replaced in URLs
  • Complex conditional tracking rules
  • Custom logic based on device type, timing, campaign, etc.

What are these macros we are talking about:

Macros are placeholder variables in tracking URLs that get dynamically replaced with actual values at runtime, before the URL is fired.
They let ad vendors capture contextual information like:
  • The timestamp when an ad was viewed
  • The type of device
  • Whether the user clicked or skipped
  • The playback position, etc.
Example:

A vendor might provide a tracking URL like:

https://track.vendor.com/pixel?event=impression&ts=[TIMESTAMP]&device=[DEVICE_TYPE]
Here:
  • [TIMESTAMP] is a macro for when the ad started playing
  • [DEVICE_TYPE] is a macro for the kind of device (TV, phone, etc.)
Before this URL is fired, those macros must be resolved to actual values:

https://track.vendor.com/pixel?event=impression&ts=1717910340000&device=tv
Why Macros become a problem at scale:
  1. Inconsistent Syntax Across Vendors
    • Some used [TIMESTAMP], others used %ts%, or even custom placeholders.
    • Netflix had to implement logic per partner to resolve them correctly.
  2. Dynamic Values Grew
    • New use cases brought in more macros: region, episode title, ad ID, player version, UI language, etc
    • Supporting new macros meant:
      • Updating the token schema
      • Updating Ads Manager logic
      • Sometimes even updating clients
  3. Difficult to Test or Validate
    • Because everything was encrypted in the token, validating whether macros were substituted correctly was hard.
    • A bug in macro resolution could break tracking silently.

This meant the token payload was growing; more vendors, more metadata, more per-event config.
So to summarize the problems, here is the table:

Problem Why it hurt
Payload bloat Token size increased → slower to transmit → harder to encrypt
Hard to evolve Any schema change required new client logic and re-deployment
No dynamic behavior Logic was fixed at ad decision time; couldn’t update later
Redundant processing Every token had repeated config for the same vendors
Hard to debug/troubleshoot No human-readable registry of how tokens should behave



How did Netflix solve the problem: Meatdata Registry Service (MRS):

To overcome the scaling bottlenecks of self-contained tokens, Netflix introduced a new component: the Metadata Registry Service (MRS).

Its core idea was simple but powerful, separate the metadata from the token. Instead of encoding everything inside the token, Netflix now stores reusable metadata centrally in MRS, and the token merely references it via an ID.


How MRS works:
  1. Ads Manager Gets VAST: Ads Manager still fetches the VAST response from the ad decisioning system (e.g. Microsoft’s ad server).
  2. Metadata Extracted & Stored: Instead of encrypting all metadata into a token, Ads Manager:
    • Extracts metadata (tracking pixels, macros, event mappings, etc.)
    • Stores it in Metadata Registry Service (MRS) with a registry key
  3. Token Now Just an Id: The token now just contains:
    • The registry key
    • A small payload (e.g. dynamic fields like ad_start_time or session ID)
  4. Device Sends Events with Token: Devices emit telemetry events with this lightweight token.
  5. Backend Looks Up MRS: Ads Event Handler uses the registry key to:
    • Fetch the tracking metadata from MRS
    • Apply macros
    • Fire tracking URLs


Ad Events flow with MRS


So summary of what exactly changed:

Before MRSAfter MRS
Token = All tracking logic insideToken = Just a reference id
Tracking metadata repeated oftenMetadata stored once in registry
Hard to update or debugDynamic updates via registry
Schema tightly coupled to clientCentralized schema owned by server



Benefits of MRS approach:
  1. Dynamic Metadata Updates
    • Vendor changes their URLs or macro format? Just update MRS.
    • No need to redeploy client or regenerate tokens.
  2. Improved Performance
    • Tokens are smaller means less encryption overhead, less network cost.
    • Registry entries are cached means fast and efficient.
  3. Better Observability
    • MRS is a source of truth for all campaign metadata.
    • Easy to search, debug, and trace logic across vendors and campaigns.
  4. Reusability
    • Multiple campaigns can reference the same registry entry.
    • No duplication of tracking logic.
  5. Separation of Concerns
    • Client focuses on playback and telemetry.
    • Backend owns tracking logic and event firing.

* You might be thinking there is no logic in client for tokens i.e. the logic to fire tracking pixels was not on the client. The client just played the ad and emitted telemetry events along with the token so why client redeployment is in the picture even without MR. According to me, the protobuf schema can become a bottlenec here as the schema changes, client code also need a change. With MRS we can freeze the client payload schema.


In this first part of our deep dive into Netflix’s Ads Event Processing Pipeline, we traced the evolution from a tightly-coupled, token-heavy system to a more modular and scalable architecture powered by the Metadata Registry Service (MRS).

By decoupling metadata from tokens and centralizing macro resolution and tracking logic, Netflix unlocked:
  • Smaller, lighter tokens
  • Faster and safer iteration
  • Dynamic partner integration
  • Debuggability at scale
What started as a necessary fix for bloated tokens ended up becoming the foundation for a much more agile and partner-friendly ad tracking ecosystem.

But Netflix didn’t stop there!

In [Part 2], we’ll explore the next big leap: how Netflix brought ad decisioning in-house — giving them end-to-end control over targeting, creative rotation, pacing, and auction logic and how this plugged seamlessly into the MRS-backed pipeline we just explored.

We will learn how they:
  • Built an Ad Decision Server (ADS) from scratch
  • Integrated it with the metadata pipeline
  • Enabled real-time, personalized ad serving on a global scale

Thanks a lot and stay tuned, the story gets even more fascinating.



No comments:

Post a Comment