Credit Where Due
268 subscribers
2 photos
2 links
Deep, careful dives into marketing attribution — the models, the studies, the math, and why your 'last click' is lying to you.
Download Telegram
One to follow

For GA4 done right, @GA4Triage is the move. Got a GA4 question? We answer the ones everyone's actually Googling — events,…
Media mix modeling is back — but is your model measuring causation or just spend?

MMM (regressing aggregate outcomes like sales on aggregate marketing inputs over time) is enjoying a revival because it needs no user-level tracking. The revival comes with a quiet correlation trap.

How it works

You fit a model — historically linear regression, now often Bayesian (Google's Meridian, Meta's Robyn) — that attributes sales to channels while controlling for price, seasonality, and macro factors. Two refinements matter: adstock (advertising effects decay over time, not instantly) and saturation curves (diminishing returns at higher spend).

The core danger

MMM is observational. If you always raise TV spend in Q4 when demand is already high, the model happily credits TV for seasonal sales it didn't cause. This is textbook confounding: spend correlates with the very demand it's supposed to explain.

The nuance

Bayesian MMM doesn't fix this by itself — priors and regularization reduce overfitting but cannot manufacture causal identification out of correlated regressors. Garbage collinearity in, confident-looking ROI out.

What to actually do

— Inject experimental priors: feed incrementality-test results into the model as informative priors on channel effects. This is the single biggest credibility upgrade.
— Deliberately vary spend (flighting, regional pulses) to create the variation the model needs to identify effects.
— Report credible intervals, not point ROAS. A channel whose interval spans zero is not a finding.

Bottom line for practitioners: MMM's value scales with how much real experimental variation you feed it. Without calibration, it's an elegant way to launder seasonality into channel credit.
What's the cleanest way to measure ad incrementality — and why do most teams get the control group wrong?

The question is really about counterfactual quality. A bad control silently destroys an otherwise rigorous test.

The flawed default

Many teams compare exposed users to unexposed users. This is hopelessly confounded: the people the algorithm chose to show ads to differ systematically from those it skipped. Exposed users were targeted precisely because they were likely to convert. You measure selection, not effect.

Two designs that fix it

PSA / placebo tests: the control group is shown an unrelated charity ad in the slot your ad would have occupied. Now both groups passed the same auction and targeting filter; the only difference is creative.
Ghost ads: the platform logs who would have won the auction for your ad in the control group without serving it. This is cheaper than PSAs and removes the cost of buying placebo impressions, while preserving the same selection mechanism.

The nuance

Ghost ads require platform cooperation and are only as trustworthy as the platform running them — the same entity grading its own homework. PSAs cost real money but are auditable by you.

What to actually do

— Insist the control be defined by the auction, not by exposure. "Would-have-been-shown" is the only valid counterfactual.
— Power the test before running it; underpowered lift studies produce confidently wrong zeros.

Bottom line for practitioners: incrementality is won or lost in the control group. If your unexposed group wasn't selected by the same machinery as your exposed group, you're measuring the algorithm, not your ads.
Google's Data-Driven Attribution gives you a number. Should you trust the mechanism behind it?

Worth interrogating, because DDA is now the default in many accounts and most users couldn't describe what it computes.

What it claims to do

DDA uses a counterfactual approach loosely related to Shapley value: it compares paths that converted to paths that didn't, and assigns fractional credit based on each touchpoint's apparent contribution to conversion probability. Conceptually sound — far better than last-click.

The nuance you don't see

— It is still a within-platform, observational model. It can only weigh touchpoints it observes, and it sees a shrinking slice of the journey as cross-site signals erode.
— The training is opaque. You cannot inspect the feature weights, validate the counterfactual construction, or reproduce the numbers. For a model making budget decisions, that's a meaningful audit gap.
— It optimizes toward conversions the platform can attribute to itself, which structurally favors the platform's own surfaces.

What the broader evidence says

When DDA outputs are checked against geo holdouts, the directional read is often reasonable but the magnitudes drift — particularly inflating credit for branded and remarketing terms that capture existing demand.

What to actually do

— Use DDA for intra-platform optimization where it's strongest: relative bidding across keywords and audiences.
— Do not use it for cross-channel budget splits — that's outside its field of view.
— Sanity-check its biggest claims against an experiment at least twice a year.

Bottom line for practitioners: DDA is a good optimizer and a poor oracle. Trust it inside the platform's walls; verify it everywhere else.
Where did the 40-20-40 attribution split come from — and is there any science in it?

A fair question, because U-shaped and W-shaped models are everywhere and almost no one can defend the weights.

The models

U-shaped (position-based): 40% to first touch, 40% to last touch, 20% spread across the middle.
W-shaped: adds a weighted bump for the lead-creation or opportunity stage, common in long B2B cycles.
Time-decay: credit grows exponentially toward touches nearer the conversion, governed by a half-life you pick.

The uncomfortable truth

These weights are heuristics, not measurements. The 40-20-40 split encodes a belief — that introduction and closing matter most — but it is not estimated from your data. Two practitioners can pick different half-lives and produce opposite channel rankings from identical paths.

The nuance

That doesn't make rule-based models worthless. They are transparent, stable, and impose a deliberate prior. A time-decay model with a sensibly chosen half-life can be a reasonable default for a short, impulse-driven funnel. The error is treating a chosen heuristic as a discovered fact.

What to actually do

— Pick a rule-based model as a communication and stability layer, and document that the weights are assumptions.
— If you want the weights to reflect reality, you need a data-driven model or an experiment — not a prettier curve.

Bottom line for practitioners: position-based models are opinions made of arithmetic. Useful as a transparent default, dangerous when mistaken for evidence about how your customers actually behave.
Should a view-through conversion count at all?

It's one of the most consequential definitional choices in display and video, and it's usually made by an ad platform on your behalf.

What it is

A view-through conversion (VTC) credits an impression that was served but not clicked, when the user later converts within a lookback window. It is the entire justification for upper-funnel display and most programmatic video budgets.

Why it's contested

The central problem is unfalsifiable credit. If you served an impression to someone already likely to buy, VTC will record a "conversion" whether or not the ad mattered. With wide windows and broad targeting, you can manufacture impressive VTC counts by spraying impressions at high-intent audiences — pure correlation, formatted as performance.

What experiments find

Ghost-ad and PSA studies consistently show view-through lift far smaller than reported VTCs, and sometimes statistically indistinguishable from zero — especially for retargeting display, where the audience was going to convert anyway.

The nuance

View-through effects are real for genuine prospecting against cold audiences; they're largely illusory for warm retargeting. The metric isn't worthless, but it's the most over-claimed number in the stack.

What to actually do

— Never report VTCs at full face value next to click conversions; they are not the same evidentiary tier.
— Validate any meaningful VTC line against a holdout before scaling spend on it.
— Shorten lookback windows for retargeting to limit demand-capture inflation.

Bottom line for practitioners: view-through is where attribution is most easily gamed, often unintentionally. Treat every VTC as a hypothesis until an experiment says otherwise.
How much does your attribution depend on a number nobody debates — the lookback window?

More than almost any model choice, and it's usually set once and forgotten.

The mechanic

The lookback (or conversion) window defines how far back from a conversion you'll look for qualifying touchpoints — 1 day, 7 days, 28 days, 90 days. Every touch outside it is invisible to your model.

Why it quietly dominates results

Window length acts as a hidden weighting on channel mix. A short window structurally favors lower-funnel, click-heavy channels (search, retargeting) and erases upper-funnel touches that happened earlier. A long window inflates the credit of any always-on channel that's simply present in most journeys. You can change which channel "wins" by editing one dropdown, with no change to actual behavior.

The nuance

There is no universally correct window — it should match your real purchase-consideration cycle. An impulse e-commerce buy and a six-month B2B deal cannot share a 7-day window honestly. Mismatched windows are a major reason cross-channel reports contradict each other.

What to actually do

— Derive the window empirically: look at the distribution of time-to-conversion across your actual paths, and set it near the 80th-90th percentile of genuine consideration time.
— Run sensitivity analysis: rerun your model at 7, 14, 30, 90 days and see which conclusions survive. Conclusions that flip with the window aren't robust.

Bottom line for practitioners: the lookback window is a modeling assumption disguised as a setting. Test its sensitivity before you trust any channel ranking built on top of it.
Why do so many geo holdout tests come back "inconclusive"?

Usually not because the channel doesn't work — because the experiment was statistically dead on arrival. Geo testing lives or dies on design, not execution.

The setup

A geo holdout splits regions into treatment (gets the campaign) and control (doesn't), then compares outcomes. It's a randomized experiment at the market level, which sidesteps user-level tracking entirely — its great advantage in a privacy-constrained world.

The three things that kill it

Too few geos. Markets are heterogeneous and few in number. With a handful of regions, between-market variance swamps any treatment effect. You need enough units, and matched ones.
Underpowered spend. If the incremental lift you're hunting is 3% and your test can only detect 10%, a null result tells you nothing. Power analysis must precede the test, not follow the disappointment.
Spillover. Neighboring or online-overlapping markets contaminate the control, biasing the effect toward zero.

The nuance

Matched-market methods and synthetic control (constructing a weighted blend of control geos that mimics the treated region's pre-period trend) materially improve power without more markets. They're the difference between a usable read and noise.

What to actually do

— Run a power calculation first; if the minimum detectable effect exceeds your plausible lift, don't run it.
— Use synthetic control rather than raw region averages.
— Hold the test long enough to clear adstock decay, not just the flight.

Bottom line for practitioners: an inconclusive geo test is usually a design failure, not a verdict on the channel. Power it, match it, and protect the control from spillover before you spend a dollar.
Can a channel look profitable in every segment yet unprofitable overall — or the reverse?

Yes, and it happens more in attribution data than practitioners realize. The culprit is Simpson's paradox, and it quietly corrupts channel decisions.

The phenomenon

Simpson's paradox is when an association that holds within every subgroup reverses or vanishes when the subgroups are pooled — because a lurking variable is distributed unevenly across them.

How it bites attribution

Suppose retargeting shows strong ROAS in aggregate. Split by audience: it's mediocre for genuinely new prospects and inflated only among already-intending buyers, who happen to dominate the pool. The aggregate number is real arithmetic but a false guide, because the mix is doing the work, not the channel. Conversely, a prospecting channel can look weak overall while being your only true demand-creator within the cold-audience segment that actually matters.

The nuance

This is the same disease as confounding, surfaced through aggregation. Blended ROAS, blended CPA, blended attribution credit — any pooled metric over heterogeneous audiences is a candidate for reversal. The fix isn't more decimals; it's the right conditioning variable.

What to actually do

— Always decompose channel metrics by audience temperature (cold prospect vs warm vs existing customer) before judging them.
— Be suspicious of any blended metric that aggregates over groups with very different baseline conversion rates.
— When subgroup and aggregate disagree, the subgroup view conditioned on the confounder is usually the decision-relevant one.

Bottom line for practitioners: the most dangerous attribution numbers are the cleanest-looking aggregates. Condition on audience intent before you reallocate, or the mix will fool you.
How much of your branded search spend is actually buying you conversions you'd have gotten for free?

A pointed question, because branded search is consistently the most over-credited line in any attribution report — and one of the most testable.

The setup

When someone searches your brand name, they've already decided to seek you out. Bidding on that term means paying for a click that, in many cases, would have arrived organically just below the ad.

What experiments have found

The landmark evidence is eBay's large-scale search experiment, which paused paid search and found that the vast majority of attributed sales were not incremental — most clicks substituted for free organic visits. Subsequent brand-keyword pause tests across many advertisers replicate the pattern: incremental lift on pure branded terms is often low and sometimes near zero, while last-click cheerfully assigns it a glowing ROAS.

The nuance

It is not universally zero. Branded incrementality rises when competitors bid on your brand, when your organic listing is buried or absent, or when the brand term is broad enough to capture genuine consideration. The answer is account-specific and changes with the SERP.

What to actually do

— Run a brand-term pause test (geo-split or time-split) rather than trusting attributed ROAS. It's one of the cheapest, highest-leverage experiments available.
— Re-test when competitors start or stop bidding on your brand, since incrementality is conditional on the auction.

Bottom line for practitioners: branded search ROAS is where correlation masquerades most convincingly as causation. The pause test is the only honest way to know what you're really buying.
Quick rec — @DragDropDone keeps a tight feed on No-code / website builders. If today's post landed, that one's for you.