In Part 1, we explored how Netflix moved from a bloated token model to a more elegant, scalable architecture using a centralized Metadata Registry Service (MRS). That shift allowed Netflix to scale its ad event tracking system across vendors, while keeping client payloads small and observability high.
But Netflix wasn’t done yet.
Netflix built it's own Ad tech platform: Ad Decision Server:Netflix made a strategic bet to build an in-house advertising technology platform, including its own Ad Decision Server (ADS). This was about more than just cost or speed, it was about control, flexibility, and future-proofing. It actually began with a clear inventory of business and technical use cases:
- In-house frequency capping
- Pricing and billing for impression events
- Robust campaign reporting and analytics
- Scalable, multi-vendor tracking via event handler
- Match or exceed the capabilities of existing third-party solutions
- Rapidly support new ad formats like Pause/Display ads
- Enable advanced features like frequency capping and billing events
- Consolidate and centralize telemetry pipelines across use cases
Road to a versatile and unified pipeline:
Netflix recognized future ad types like Pause/Display ads that would introduce variant telemetry. That means the upstream telemetry pipelines might vary, but downstream use‑cases remain largely consistent.
So what is upstream and what is downstream in this Netflix event processing context:
Upstream (Producers of telemetry / Events): This refers to the systems and sources that generate or send ad-related telemetry data into the event processing pipeline.
Example of upstream in this context:
- Devices (Phones, TVs, Browsers)
- Different Ad Types (eg. Video Ads, Display Ads, Pause Ads)
- Ad Media Players
- Ad Servers
- Ad Started
- Ad Paused
- Ad Clicked
- Token for ad metadata
- Use different formats
- Have different logging behavior
- Be customized per ad type or region
Downstream (Consumers of processed data): This refers to the systems that consume the processed ad events for various business use cases.
Example of downstream in this context:
- Frquency capping service
- Billing and pricing engine
- Reporting dashboards for advertisers
- Anlytical systems (like Druid)
- Ad session generators
- External vendor tracking
- Expect standardized, enriched, reliable event data
- Rely on consistent schema
- Need latency and correctness guarantees
So "upstream telemetry pipelines might vary, but downstream use‑cases remain largely consistent" means:
- Even if ad events come from diverse sources, with different formats and behaviors (upstream variation),
- The core processing needs (like billing, metrics, sessionization, and reporting) remain the same (downstream consistency).
Based on that they outline a vision:
- Centralized ad event collection service that:
- Decrypts tokens
- Enriches metadata (via the Registry)
- Hashes identifiers
- Publishes a unified data contract, agnostic of ad type or server
- All consumers (for reporting, capping, billing)/read from this central service, ensuring clean separation and consistency.
Building Ad sessions & Downstream Pipelines:
A key architectural decision was "Sessionization happens downstream, not in the ingestion flow" but what does that mean? Lets break it down:
Ingestion flow (Upstream): This is the intial (real time) path where raw ad telemetry events are ingested into the system, like:
- Ad started playing
- User clicked
- 25% of video watched
These events are lightweight, real time and are not yet grouped into a meaningful session. Think of it as raw, granular events flying into Kafka or a central service.
Sessionization (Downstream Processing): This is a batch or stream aggregation process where the system:
- Looks at multiple ad events from the same ad playback session
- Groups them together
- Creates a meaningful structure called an "Ad Session"
An Ad session might include:
- When did the ad start and end
- What was the user's interaction (viewed / clicked / skipped)
- Was it complete or partial view
- What was pricing and vendor context
- The ingestion system just collects and stores raw events quickly and reliably.
- The heavier work of aggregating these into a complete session (sessionization) happens later, in the downstream pipeline where there is:
- More time
- Access to more context
- Better processiong capabilities
- Separation of concerns: Ingestion stays fast and simple; sessionization logic is centralized and easier to iterate on.
- Extensibility: You can add new logic to sessionization without touching ingestion.
- Scalability: Real time pipelines stay light; heavy lifting is done downstream.
- Consistency: One place to define and compute what an "ad session" means.
- Frequency Capping
- Ads Metrics
- Ads Sessionizer
- Ads Event Handler
- Billing and revenue jobs
- Reporting & Metrics APIs
Final Architecture of Ad Event Processing:
- Netflix Player -> Netflix Edge:
- The client (TV, browser, mobile app) plays an ad and emits telemetry (e.g., impression, quartile events).
- These events are forwarded to Netflix’s Edge infrastructure, ensuring low-latency ingestion globally.
- Telemetry Infrastructure:
- Once events hit the edge, they are processed by a Telemetry Infrastructure service.
- This layer is responsible for:
- Validating events
- Attaching metadata (e.g., device info, playback context)
- Interfacing with the Metadata Registry Service (MRS) to enrich events with ad context based on token references.
- Kafka:
- Events are published to a Kafka topic.
- This decouples ingestion from downstream processing and allows multiple consumers to operate in parallel.
- Ads Event Publisher:
- This is the central orchestration layer.
- It performs:
- Token decryption
- Metadata enrichment (macros, tracking URLs)
- Identifier hashing
- Emits a unified and extensible data contract for downstream use
- Also writes select events to the data warehouse to support Ads Finance and Billing workflows.
- Kafka -> Downstream Consumers:
- Once enriched, the events are republished to Kafka for the following consumers:
- Frequency Capping:
- Records ad views in a Key-Value abstraction layer.
- Helps enforce logic like “don’t show the same ad more than 3 times per day per user.”
- Ads Metrics Processor:
- Processes delivery metrics (e.g., impressions, completion rates).
- Writes real-time data to Druid for interactive analytics.
- Ads Sessionizer:
- Groups related telemetry events into ad sessions (e.g., full playback window).
- Feeds both data warehouse and offline vendor integrations.
- Ads Event Handler:
- Responsible for firing external tracking pixels and HTTP callbacks to third-party vendors (like Nielsen, DoubleVerify).
- Ensures vendors get event pings per their integration needs.
- Analytics & Offline Exchange:
- The sessionized and aggregated data from the warehouse is used for:
- Offline measurement
- Revenue analytics
- Advertiser reporting
Principle | Implementation |
---|---|
Separation of concerns | Ingestion (Telemetry Infra) ≠ Enrichment (Publisher) ≠ Consumption (Consumers) |
Scalability | Kafka enables multi-consumer, partitioned scaling |
Extensibility | Central Ads Event Publisher emits a unified schema |
Observability | Real-time metrics via Druid; sessionized logs via warehouse |
Partner-friendly | Vendor hooks via Event Handler + Offline exchange |
- Client-side logic remains lightweight and stable
- Backend systems remain flexible to change and easy to audit
- Key: A unique identifier for the "audience + ad + context"
- Example: user_id:ad_id:date
- Value: A counter or timestamp that tracks views or interactions
- Example: {count: 2, last_shown: 2025-06-20T12:15:30}
Key | Value |
---|---|
user123:ad789:2025-06-20 | { count: 2, last_shown: 12:15 } |
user123:ad456:2025-06-20 | { count: 1, last_shown: 10:04 } |
- Netflix looks up the key
- Checks the count
- If under the cap (e.g., 3), they allow it and increment the counter
- If at or above cap, the ad is skipped or replaced
- Ad engineers don’t worry about implementation
- Multiple ad systems can share the same cap logic
- It can scale across regions and millions of users
Feature | Why it matters |
---|---|
Speed | Real-time lookups for every ad play |
Scalability | Handles millions of users per day |
Flexibility | Supports different capping rules |
Multi-tenancy | Can cap by user, device, campaign, etc. |
- Streaming Ad Delivery Metrics
- As ads are played, Netflix captures events like:
- Impressions
- Quartile completions (25%, 50%, etc.)
- Clicks, errors, skips
- These events are streamed and aggregated in near-real-time via Flink or similar streaming processors
- Druid ingests these metrics for interactive querying.
- Advertiser Reporting and Internal Dashboards
- Ad operations teams and partners want to know:
- "How many impressions did Ad X get in Region Y yesterday?"
- "What's the completion rate for Campaign Z?"
- These queries are run against Druid and return in milliseconds, even over billions of records
Feature | Why it matters for Netflix Ads |
---|---|
Real-time ingestion | Metrics appear in dashboards within seconds of the event |
Sub-second queries | Advertiser reports and internal dashboards need speed |
High concurrency | Supports thousands of parallel queries from internal tools |
Flexible aggregation | Group by ad_id, region, time, device_type, etc. instantly |
Schema-on-read & roll-ups | Druid efficiently stores pre-aggregated data for common metrics |
- A centralized event publisher that abstracts complexity.
- A unified schema that serves a wide range of downstream consumers.
- A clean separation of concerns that allows rapid innovation in ad types without touching backend logic.
- Observability and control, made possible through systems like Druid and the Metadata Registry Service.
- Scalability: Handling billions of ad events across regions and devices
- Agility: Enabling quick integration of new ad formats like Pause/Display Ads
- Partner alignment: Supporting vendor integrations with minimal friction
- Business confidence: Powering frequency capping, billing, and reporting with precision