Technical

How We Built a System That Scans 100,000 Websites for Cookie Consent Violations

GDPR Privacy Monitor Engineering · 2026-04-13 · 7 min read

Automated consent compliance checking sounds straightforward until you try to build it. The naive approach -- fetch a page, check for cookies, look for a banner -- misses most of what matters. Consent violations are behavioral, not structural. They manifest in the timing of script execution, the sequence of network requests, the response of UI elements to user interaction, and the persistence of state across page loads. You cannot assess any of this without running a real browser, interacting with the page the way a human would, and recording what actually happens at the network level.

This post describes how we built the scanning engine behind GDPR Monitor, the engineering challenges that consumed most of our time, the architectural decisions we made and why, and the limitations we are honest about. If you work on web compliance, browser automation, or large-scale web measurement, there should be something useful here.

The Pipeline Overview

Every scan passes through six stages. Understanding the pipeline is necessary context for the specific challenges that follow.

Stage 1: Browser launch and isolation. A fresh Chromium instance starts with zero state -- no cookies, no localStorage, no cache, no service workers. This is the clean-room guarantee that makes pre-consent measurement meaningful. We configure a standard viewport, realistic user-agent and Accept-Language headers matching the target country, and standard browser flags. Each scan gets its own browser process; there is no state leakage between scans. Stage 2: Navigation and pre-consent snapshot. The scanner navigates to the target URL, waits for the page to reach a stable state (network idle, DOM settled), and captures everything that has happened: cookies set, network requests made (with full URL, timing, and response metadata), third-party domains contacted, and a full-page screenshot. This snapshot answers the fundamental question: what did this website do before the user had any opportunity to consent? Stage 3: CMP detection and banner identification. The scanner attempts to identify the consent management platform and locate the consent banner, accept button, and reject button. This uses a layered detection system described in detail below. Stage 4: Consent interaction. The scanner interacts with the banner -- clicking accept for the standard flow, clicking reject for the reject-flow test. It waits for the page to settle after interaction, accounting for animations, script re-evaluation, and delayed tag firing. Stage 5: Post-consent snapshot and differential analysis. A second full snapshot captures the state after consent interaction. Comparing pre-consent and post-consent snapshots reveals what changed: new cookies, new tracking requests, consent state in CMP APIs. Stage 6: Analysis, classification, and report generation. Raw data feeds into analysis modules: cookie classification against our database, tracker matching against known patterns, cookie lifetime evaluation, accessibility audit of the banner, Google Consent Mode validation, fingerprint signal detection, and risk scoring. The output is a structured report with findings, evidence artifacts, and a composite risk score.

Each stage produces timestamped evidence that is stored durably. Any finding can be traced back to specific network requests, cookie entries, or screenshots.

Challenge 1: CMP Detection -- 45 Platforms, Infinite Variations

Consent management is not standardized. There is no universal HTML attribute, no mandatory JavaScript API, no consistent DOM structure that says "this is a consent banner." There are 45 distinct CMPs in our detection library, each with its own DOM structure, script signatures, JavaScript globals, and interaction patterns. Beyond those, 34.7% of the banners we detected in our 97,304-site study were generic or unidentified -- custom implementations, regional vendors, or minimal solutions that do not match any known CMP signature.

Our detection uses a confidence-based, layered approach:

Layer 1: Script signature detection

The scanner checks for the presence of known CMP scripts by URL pattern and JavaScript global variables. Cookiebot, for instance, loads from `consent.cookiebot.com` and exposes `window.Cookiebot`. OneTrust loads from `cdn.cookielaw.org` and exposes `window.OneTrust`. Each CMP has characteristic loading patterns that can be detected before examining the DOM.

This layer is fast and high-confidence when it matches. But it has a critical limitation: it tells you which CMP is present on the page, not necessarily which CMP is responsible for the consent banner. A site might load PiwikPRO for analytics (which includes a CMP component) while using tarteaucitron for actual consent management. Detecting both scripts is easy; knowing which one controls the banner is harder.

Layer 2: Verified selector matching

For each known CMP, we maintain a curated set of CSS selectors that reliably identify the banner container, the accept button, and the reject button. These are selectors we have validated across multiple versions and configurations of each CMP. When a CMP is detected in Layer 1 and its verified selectors match elements in the DOM, we have high confidence in both the CMP identification and the banner interaction targets.

Layer 3: Compatible selector matching

A broader set of selectors that work across many versions of a CMP but are less precise. These handle cases where a CMP has been customized, themed, or is running a version not covered by our verified selectors. They trade precision for coverage.

Layer 4: Generic heuristics

For the 34.7% of banners not associated with a known CMP, we fall back to heuristic detection. The scanner looks for:

Fixed-position or sticky-position elements near the bottom or top of the viewport
Elements containing consent-related keywords in multiple languages ("cookies," "consent," "privacy," "akzeptieren," "accepter," "aceptar," etc.)
Buttons with common consent-action labels ("Accept All," "Reject All," "Manage Preferences," and equivalents)
Structural patterns typical of consent dialogs: overlay backgrounds, modal containers, dismiss buttons

This layer catches many custom implementations but is inherently less reliable. A fixed-position promotional banner or newsletter signup can look structurally similar to a consent dialog.

Layer 5: CMP API probing

Some CMPs expose JavaScript APIs -- most notably the IAB Transparency and Consent Framework (TCF) API via `__tcfapi`. We probe for these APIs to both verify CMP detection and read the programmatic consent state, which we later compare against observed browser behavior.

The confidence model

Rather than treating detection as binary (found/not found), we assign confidence scores based on which layers matched and how strongly. A site where we detect a CMP script, match verified selectors, and find a TCF API gets high confidence. A site where only generic heuristics triggered gets lower confidence. This confidence score feeds into our risk classification -- lower detection confidence means findings are more likely to be classified as inconclusive rather than definitive.

The confidence model is why CMP misidentification, while it occurs, does not systematically bias our results. When detection is ambiguous, we say so, rather than forcing a classification.

Challenge 2: The Reject Flow -- Why "Click and Check" Is Surprisingly Hard

Testing a reject button sounds simple: find it, click it, check if cookies are gone. In practice, every step is fraught with timing issues, async behavior, and platform-specific quirks.

Finding the reject button. Not all reject buttons are labeled "Reject." They may say "Decline All," "Refuse," "Only necessary cookies," "Manage settings" (leading to a second layer where rejection is possible), or equivalents in any of dozens of languages. Some CMPs place the reject option in a different visual location, at a different size, or in a different color from the accept option. Some hide it behind a "More options" or "Customize" link. Our scanner maintains a multilingual set of reject-action patterns and also detects second-layer reject options where the first layer only offers accept and customize. Waiting for the right moment. After clicking reject, the page may undergo significant changes: the banner dismisses (often with animation), the CMP fires consent-state callbacks, tag managers re-evaluate their rules, and scripts may be loaded or unloaded. Checking cookies too early catches the mid-transition state. Checking too late misses transient tracking that fires and completes quickly. We use a multi-signal wait: network idle, DOM stability, and a minimum delay floor, tuned from empirical testing across hundreds of CMP configurations. The reload test and consent respawn. The reload step is what revealed consent respawn as a phenomenon. We did not set out to find it -- our original reject-flow test only checked the immediate post-reject state. But during development, we noticed sites that looked clean after reject but had tracking cookies when we checked again after a page reload. Initial debugging assumed a scanner timing issue. Further investigation confirmed it was real: third-party scripts re-setting cookies on page load regardless of consent state.

We added explicit respawn detection to the pipeline: after the reject flow, the scanner reloads the page, waits for stability, and compares the cookie inventory against the post-reject snapshot. Any cookie that was removed by reject and reappears after reload is flagged as a respawn. This caught 1,642 sites with 4,932 respawning cookies -- a finding that would have been invisible without the reload step.

The `waitForScriptIdentifiedCMP` poll. Some CMPs load asynchronously and do not render their banner until several seconds after initial page load. If the scanner proceeds to the reject step before the CMP has initialized, it either misses the banner entirely or interacts with a partially loaded UI. We implemented a polling mechanism that waits for the CMP's JavaScript API to become available (e.g., `__tcfapi` for TCF-based CMPs, `Cookiebot` global for Cookiebot) before proceeding. This adds latency per scan but significantly reduces false negatives from async CMP loading.

Challenge 3: Pipeline Saturation at Scale

Scanning 97,304 websites is not a single-machine job. Each scan launches a Chromium process, navigates to a website, intercepts and classifies hundreds of network requests, takes multiple screenshots, and runs analysis modules. A single scan takes 30-90 seconds depending on site complexity. At 15 concurrent scans per worker, resource management becomes the primary engineering concern.

The semaphore architecture

We use a semaphore-based concurrency model to limit the number of simultaneous Chromium processes per worker. Each worker has a fixed semaphore (15 slots in our production configuration). A scan acquires a slot before launching its browser and releases it on completion. This prevents memory exhaustion -- 15 Chromium instances with full request interception already consume significant RAM -- and provides backpressure against the Redis queue.

The document request exemption

Early in development, we encountered a throughput problem: our request interception logic (which inspects every request for SSRF safety -- blocking requests to private IP ranges, internal networks, and other potentially dangerous targets) was adding latency to every resource load, including the main document request. Since the document URL has already been validated before the scan begins, we added a fast-path bypass: document-type requests to the pre-validated target URL skip the full interception pipeline. This seemingly small optimization had a significant impact on overall throughput because the document request blocks everything else.

DNS pre-warming

The first request to a new domain incurs a DNS lookup, which on our infrastructure could add 50-200ms per domain. With the average site contacting 10.4 third-party domains (and some contacting up to 171), DNS resolution time accumulated significantly. We implemented DNS pre-warming using a local Unbound DNS resolver cache: before each scan, we resolve the target domain and warm the cache. The Unbound instance serves cached responses for subsequent lookups within the scan, reducing per-domain DNS overhead to sub-millisecond.

SSRF safety at scale

Every request intercepted by the scanner is checked against a set of safety rules before being allowed to proceed. Requests to private IP ranges (RFC 1918, RFC 4193, link-local, loopback) are blocked. This prevents a malicious target site from using the scanner as an SSRF vector to probe internal networks.

The challenge at scale was distinguishing genuine SSRF blocks from semaphore saturation. When all 15 semaphore slots are in use and a scan cannot acquire a slot, the resulting timeout looks superficially similar to a request being blocked for safety reasons. We added explicit error categorization to distinguish "blocked because the target resolved to a private IP" from "blocked because the scanner is at capacity." This was essential for operational monitoring and for accurate scan failure classification.

Challenge 4: Bot Evasion Detection

During the study, we identified 137 websites that appear to deliberately hide their consent banner from automated scanners. The banner is served to human visitors but suppressed when the site detects characteristics of automated browsing.

The most common mechanism we identified involves the RCB (Real Cookie Banner) WordPress plugin's `isAcceptAllForBots` configuration option. When enabled, this setting detects automated browsers (via `navigator.webdriver`, user-agent heuristics, or behavioral signals) and either auto-accepts consent on their behalf or hides the banner entirely. The intent, as documented by the plugin, is to prevent automated visitors from being served a consent dialog they cannot meaningfully interact with. The effect is that compliance scanners -- and regulatory auditors using automated tools -- see a site that appears to have no consent mechanism, when human visitors see a full consent banner.

This is a transparency problem. If a website's consent mechanism is only visible to human visitors, it cannot be audited at scale. We flag these sites separately in our results because the finding is qualitatively different from "no banner detected." The site has a banner; it is choosing not to show it to us.

We detect bot evasion through a combination of signals: the presence of known bot-detection configuration in CMP settings (accessible via JavaScript inspection), discrepancies between what the DOM shows and what the CMP's API reports, and in some cases by comparing automated scan results with manual verification.

The 137 figure is certainly an undercount. We can only detect bot evasion when we can identify the mechanism. Sites using more sophisticated or custom bot detection may successfully evade both our scanner and our evasion detection.

Challenge 5: CMP Misidentification

A site can load multiple scripts that look like consent management platforms. PiwikPRO includes a CMP component but is primarily an analytics suite. Some WordPress sites load Complianz alongside a separate analytics plugin that has CMP-like features. Enterprise sites may have remnants of a previous CMP still loading alongside the current one.

Naive detection -- "if we see the script, it is the CMP" -- produces false identifications that cascade into incorrect banner interaction. If the scanner identifies PiwikPRO as the CMP and tries to use PiwikPRO's banner selectors, it may miss the actual tarteaucitron banner that controls consent on the site.

Our confidence-based approach addresses this by ranking CMP candidates. When multiple potential CMPs are detected:

1. We check which one has a visible banner in the DOM (script present but no banner means likely inactive or non-CMP usage).

2. We check which one exposes an active CMP API (e.g., a functioning `__tcfapi` or equivalent).

3. We prefer the CMP whose verified selectors match visible DOM elements over the one that is only detected by script URL.

This heuristic is not perfect, but it resolved the most common misidentification cases we encountered during development and testing.

Limitations

No automated scanner perfectly replicates every human browsing experience. These are the known limitations:

GeoIP-dependent banners. Some CMPs, notably CookieYes, serve different consent experiences based on the visitor's IP geolocation. Our scans originate from specific network locations in Europe. A site that shows a consent banner to visitors from France but not to visitors from outside the EU will show different results depending on scan origin. We do not currently scan each site from every EU country. Closed shadow DOM. Some CMPs render their banner inside a closed shadow DOM, which is inaccessible to standard DOM queries via `document.querySelector`. Transcend's CMP uses this approach. Our scanner can detect the shadow host element but cannot inspect its contents to find accept/reject buttons. These sites typically end up as inconclusive in our results. Dynamic class names and obfuscation. Some CMPs, notably Admiral, use dynamically generated class names that change on each page load. Selector-based detection fails for these because the selectors are not stable across visits. We fall back to generic heuristics, but confidence is lower. Single-page applications. SPAs that manage consent state entirely in client-side JavaScript and load the consent mechanism after initial route changes (rather than on initial page load) are harder to assess. Our scanner navigates to the URL and waits for the page to stabilize, but it does not simulate in-app navigation. A consent banner that only appears after the user navigates within the SPA may be missed. Language coverage. Our reject-button detection uses text matching across a set of supported languages, but we do not cover every EU language equally. A banner in Maltese or Estonian may have reject-button labels that our text matching does not recognize, leading to a miss on reject-flow testing (though the banner itself may still be detected by structural heuristics). Timing edge cases. A script that fires 30 seconds after page load will be missed by a scan that waits 15 seconds for network idle. We use generous timeouts, but the long tail of async behavior is inherently difficult to capture completely.

These limitations contribute to our 14.9% inconclusive rate.

The Infrastructure

The production scanning infrastructure consists of:

Scanner engine: A Go application using chromedp as the CDP client for Chromium automation. Go was chosen for its concurrency model (goroutines and channels map naturally onto parallel scan orchestration), its performance characteristics, and its deployment simplicity (single static binary).
Browser runtime: Headless Chromium launched per-scan via CDP. Each scan gets a fresh browser process with zero shared state.
Queue: Redis-backed work queue distributing URLs to scanner workers. Redis handles job distribution, progress tracking, and rate limiting.
Database: PostgreSQL for durable scan results, findings, evidence metadata, and all structured data. Scans, findings, cookies, requests, and analysis outputs are all stored relationally.
DNS cache: Local Unbound resolver providing cached DNS lookups and SSRF-safe resolution.
Evidence storage: Screenshots, HAR files, and PDF reports are stored as durable artifacts linked to scan records.

For the 97,304-site study, we processed 114,748 candidate URLs (97,304 completed successfully) over approximately 2.5 days using 3 server instances running scanner workers in parallel. Each server ran multiple worker processes with 15 concurrent scan slots each. Peak throughput was roughly 25-30 completed scans per minute per server.

The primary bottleneck was not CPU or memory but network: each scan generates hundreds of outbound requests (to the target site and its third-party resources), and the aggregate bandwidth and connection count across all concurrent scans saturated available network capacity before other resources were exhausted.

Open Challenges and Future Work

Several problems remain unsolved or partially solved:

Consent banner localization. Our text matching covers major EU languages but is incomplete for smaller language communities. Expanding coverage requires not just adding translations but validating that the selectors and interaction patterns work correctly with localized CMP versions. Longitudinal monitoring. Our current architecture is optimized for point-in-time scanning. Detecting changes in consent behavior over time -- did a site improve after enforcement? Did a CMP update fix a class of reject-flow failures? -- requires repeated scans with differential analysis, which is architecturally different from one-shot assessment. CMP compliance benchmarking. We have the data to assess per-CMP compliance rates (is Cookiebot associated with better compliance than OneTrust?), but disentangling CMP quality from site-operator configuration quality is methodologically complex. A CMP that is more often deployed by large enterprises with dedicated privacy teams will look better in aggregate even if the tool itself is no more compliant. Real-time consent state verification. The current scanner operates in batch mode. Integrating consent verification into CI/CD pipelines or real-time monitoring requires a faster, lighter-weight scan mode that sacrifices some evidence depth for speed. We are exploring this.

The API

The same scanning engine described in this post is available through GDPR Monitor's public API. You can submit scan requests programmatically, poll for results, and retrieve structured findings and evidence artifacts. The API returns the same data our UI displays: pre-consent snapshots, cookie inventories, CMP detection results, reject-flow outcomes, risk scores, and full evidence chains.

If you are building compliance tooling, integrating privacy checks into CI/CD pipelines, conducting your own research, or building monitoring into a privacy program, the API provides access to behavioral consent analysis without the need to build and maintain your own Chromium automation infrastructure.

Try it yourself. API documentation is available at gdprprivacymonitor.eu/developers/api. Submit a single URL or integrate automated privacy monitoring into your workflow.

Check Your Website

Run a free GDPR compliance scan - no signup required.

Scan your website for free