As Netflix steps into the ad-supported streaming world, its engineering backbone must operate with the same level of scale, reliability, and precision that powers its core viewing experience. But building a system that can track billions of ad events like impressions, clicks, quartile views, across diverse devices and ad partners is anything but trivial.
In this two-part series, I’ll dissect the architecture behind Netflix’s Ads Event Processing Pipeline, inspired by their tech blog. Part 1 focuses on the early phase, from ad request initiation to the evolution of encrypted tokens and the eventual need for a Metadata Registry Service. You'll see how Netflix tackled key design challenges such as bandwidth optimization, cross-device compatibility, vendor flexibility, and observability at scale.
If you're interested in distributed systems, real-time telemetry, or just want to peek behind the curtain of Netflix’s ad tech, you’re in the right place.
Let’s dive in!
The Initial System (Pilot) At a Glance:
When a Netflix-supported device is ready to show an ad, it doesn’t pick an ad on its own. Instead, the system goes through the following high-level steps:
- Client Ad Request: The device (TV/mobile/browser) sends a request to Netflix’s Ads Manager service when it reaches an ad slot.
- VAST Response from Microsoft Ad Server: Ads Manager forwards this request to the Microsoft’s ad server, which returns a VAST XML. This contains the actual ad media URL, tracking pixels, and metadata like vendor IDs and macros.
- Encrypted Token Created: Instead of sending this raw VAST data to the device, Ads Manager:
- Extracts relevant metadata (e.g., tracking URLs, event triggers, vendor IDs)
- Encodes it using protobuf
- Encrypts it into a compact structure called a token
- Sends this token along with the actual ad video URL to the device
- Ad Playback and Event Emission: As the ad plays, the device emits telemetry events like impression, firstQuartile, click etc., attaching the token to each event.
- Event Processing via Kafka: These events are queued in Kafka, where the Ads Event Handler service reads them, decrypts the token, interprets the metadata, and fires the appropriate tracking URLs.
- VAST is a template that tells the video player:
- What ad to play
- Where to find it
- How to track it (e.g., impressions, clicks, view progress)
- Ad Media URL
- Link to the video file
- The player will use this to stream the ad.
<MediaFile delivery="progressive" type="video/mp4" width="640" height="360">
https://adserver.com/path/to/ad.mp4
</MediaFile>
- Tracking Events
- URLs the player should "ping" (via HTTP GET) when certain things happen:
- Impression (ad started)
- FirstQuartile (25% watched)
- Midpoint (50% watched)
- ThirdQuartile (75% watched)
- Complete (100%)
- Click (User clicked the ad)
<TrackingEvents>
<Tracking event="impression">https://track.com/imp</Tracking>
<Tracking event="firstQuartile">https://track.com/q1</Tracking>
...
</TrackingEvents>
- Click Tracking
- URL to notify when the user clicks the ad.
- Destination URL (where user is taken on click).
<VideoClicks>
<ClickThrough>https://brand.com</ClickThrough>
<ClickTracking>https://track.com/click</ClickTracking>
</VideoClicks>
- Companion Ads (Optional)
- Static banner or rich media that shows alongside the video ad
Purpose | How it helps |
---|---|
Standardization | All ad servers speak the same language to video players |
Tracking | Advertisers know when/how their ads were shown |
Flexibility | Supports linear, companion, and interactive video ad formats |
Easy integration | Netflix could plug in with Microsoft without building a custom API |
- Token size matters on client devices: Protobuf is highly compact, small tokens means less overhead. For example, a JSON payload that’s 1 KB might only be 100–200 bytes in protobuf.
- Performance: Ad events are high volume (billions per day). Protobuf parsing is much faster than JSON, both on backend and client.
- Encryption efficiency: Netflix encrypted the protobuf payload to create the token. Encrypting a compact binary blob is faster and lighter than encrypting JSON text.
- Schema Evolution and Compatibility: Protobuf supports versioning, optional/required fields and backward/forward compatibility. That means Netflix could evolve the ad metadata schema without breaking clients or backend systems. JSON lacks this.
Feature | JSON | Protobuf |
---|---|---|
Format | Text-based | Binary (compact) |
Size | Larger | Much smaller |
Parse performance | Slower | Faster |
Schema enforcement | No (free-form) | Yes (strongly typed schema) |
Encryption impact | Heavier (more bytes to encrypt) | Lighter (smaller encrypted payload) |
Tamper resistance | Weaker (can be read/edited) | Stronger (harder to interpret) |
- Massive Scale and Parallelism
- Kafka can handle billions of events/day, easily.
- Partitions allow horizontal scaling of consumers:
- e.g., 100 partitions means 100 consumers i.e. massive parallel processing.
- RabbitMQ struggles when you need high fan-out and high throughput.
- Event Replay and Reliability
- Ad tracking must be durable (e.g., billing, audits)
- Kafka stores messages on disk with configurable retention (e.g., 7 days)
- If a service fails, it can re-read from the last offset i.e. no data lost.
- RabbitMQ can lose messages unless you enable heavy persistence (with a perf hit).
- Built-in Partitioning and Ordering
- Kafka guarantees message ordering per partition.
- Ads Event Handler can partition by:
- ad_id, device_id, etc.
- This ensures all related events are processed in order, by the same consumer.
Topic: ad_events
Partitions: 100
Key: device_id (events from same device always lands in same partition)
- Extracting all tracking metadata from the VAST (like URLs, vendor IDs, event types)
- Packing all of it into a token (as protobuf)
- Encrypting and sending this token to the client
- The token was self-contained — everything the backend needed to fire tracking URLs was inside it
- Their own tracking logic.
- Unique macros to be replaced in URLs
- Complex conditional tracking rules
- Custom logic based on device type, timing, campaign, etc.
- The timestamp when an ad was viewed
- The type of device
- Whether the user clicked or skipped
- The playback position, etc.
https://track.vendor.com/pixel?event=impression&ts=[TIMESTAMP]&device=[DEVICE_TYPE]
[TIMESTAMP]
is a macro for when the ad started playing[DEVICE_TYPE]
is a macro for the kind of device (TV, phone, etc.)
https://track.vendor.com/pixel?event=impression&ts=1717910340000&device=tv
- Inconsistent Syntax Across Vendors
- Some used
[TIMESTAMP]
, others used%ts%
, or even custom placeholders. - Netflix had to implement logic per partner to resolve them correctly.
- Dynamic Values Grew
- New use cases brought in more macros: region, episode title, ad ID, player version, UI language, etc
- Supporting new macros meant:
- Updating the token schema
- Updating Ads Manager logic
- Sometimes even updating clients
- Difficult to Test or Validate
- Because everything was encrypted in the token, validating whether macros were substituted correctly was hard.
- A bug in macro resolution could break tracking silently.
Problem | Why it hurt |
---|---|
Payload bloat | Token size increased → slower to transmit → harder to encrypt |
Hard to evolve | Any schema change required new client logic and re-deployment |
No dynamic behavior | Logic was fixed at ad decision time; couldn’t update later |
Redundant processing | Every token had repeated config for the same vendors |
Hard to debug/troubleshoot | No human-readable registry of how tokens should behave |
- Ads Manager Gets VAST: Ads Manager still fetches the VAST response from the ad decisioning system (e.g. Microsoft’s ad server).
- Metadata Extracted & Stored: Instead of encrypting all metadata into a token, Ads Manager:
- Extracts metadata (tracking pixels, macros, event mappings, etc.)
- Stores it in Metadata Registry Service (MRS) with a registry key
- Token Now Just an Id: The token now just contains:
- The registry key
- A small payload (e.g. dynamic fields like
ad_start_time
or session ID) - Device Sends Events with Token: Devices emit telemetry events with this lightweight token.
- Backend Looks Up MRS: Ads Event Handler uses the registry key to:
- Fetch the tracking metadata from MRS
- Apply macros
- Fire tracking URLs
Before MRS | After MRS |
---|---|
Token = All tracking logic inside | Token = Just a reference id |
Tracking metadata repeated often | Metadata stored once in registry |
Hard to update or debug | Dynamic updates via registry |
Schema tightly coupled to client | Centralized schema owned by server |
- Dynamic Metadata Updates
- Vendor changes their URLs or macro format? Just update MRS.
- No need to redeploy client or regenerate tokens.
- Improved Performance
- Tokens are smaller means less encryption overhead, less network cost.
- Registry entries are cached means fast and efficient.
- Better Observability
- MRS is a source of truth for all campaign metadata.
- Easy to search, debug, and trace logic across vendors and campaigns.
- Reusability
- Multiple campaigns can reference the same registry entry.
- No duplication of tracking logic.
- Separation of Concerns
- Client focuses on playback and telemetry.
- Backend owns tracking logic and event firing.
- Smaller, lighter tokens
- Faster and safer iteration
- Dynamic partner integration
- Debuggability at scale
- Built an Ad Decision Server (ADS) from scratch
- Integrated it with the metadata pipeline
- Enabled real-time, personalized ad serving on a global scale
No comments:
Post a Comment