Credit Where Due
223 subscribers
2 photos
3 links
Deep, careful dives into marketing attribution — the models, the studies, the math, and why your 'last click' is lying to you.
Download Telegram
Channel created
Channel photo updated
Why does your lift test 'reach significance' if you just keep watching it?

The question: you're running an incrementality test, checking the dashboard daily, and one morning it crosses statistical significance. You call it and ship. Did you find a real effect — or did you find the inevitable consequence of looking many times?

What the statistics say: this is peeking (or optional stopping), and it's one of the most common ways attribution experiments lie. A fixed-sample significance test assumes you look once, at a pre-planned sample size. Every additional peek gives the noise another chance to cross the threshold. Check daily for a month and your real false-positive rate isn't 5% — simulations put it well above 20-30%, depending on how often you look. The test didn't get more sensitive; you gave randomness more shots at the goal.

Why it's epidemic in marketing: ad platforms surface live 'significance' indicators that practically invite peeking, and the pressure to call a winner early is enormous. The result is a literature of 'wins' that don't replicate — classic correlation (a lucky run) mistaken for causation (a real effect).

The nuance: you can look continuously — if you use methods built for it. Sequential testing and always-valid p-values (group-sequential designs, Bayesian approaches with proper priors) adjust the threshold for repeated looks. The sin isn't watching; it's watching with a one-look test.

What to actually do: pre-register your sample size and analysis date, or switch to a sequential/always-valid framework explicitly. Don't stop on the first green light.

Bottom line for practitioners: a lift test you peek at is a slot machine with a significance badge. Fix the stopping rule before you start, or your wins won't survive contact with reality.
Beyond A/B tests, what causal tools belong in a marketer's kit?

Most attribution discourse stops at randomized experiments and holdouts. The quasi-experimental toolkit from econometrics is underused in marketing and often fits situations where randomization isn't possible.

Three tools worth knowing

Difference-in-differences (DiD): compare the before/after change in a treated group against the before/after change in an untreated group. It nets out shared trends, so it's the backbone of most geo-style reads. Its core assumption — parallel trends absent treatment — is testable in the pre-period and must be checked, not assumed.
Regression discontinuity (RD): when treatment is assigned by a threshold (a loyalty tier at a spend cutoff, a free-shipping minimum), compare units just above and just below the line. They're nearly identical except for treatment, giving near-experimental credibility at the boundary.
Instrumental variables (IV): when something randomly nudges exposure without directly affecting the outcome (an ad-server delivery hiccup, a platform auction quirk), it can isolate causal effect from confounding. Powerful but fragile — a weak or invalid instrument quietly reintroduces the bias it was meant to remove.

The nuance

These aren't magic. Each rests on assumptions that fail silently: parallel trends, no manipulation around the cutoff, instrument validity. Used carelessly, they produce confident causal claims with no more warrant than the correlation they replaced.

What to actually do

— Reach for DiD when you have a natural treated/control split and a clean pre-period.
— Look for RD anywhere a threshold drives treatment; these moments are causal gold and routinely ignored.
— State and test the identifying assumption out loud before believing the estimate.

Bottom line for practitioners: randomized tests are the gold standard, but quasi-experimental methods extract causation from situations you can't randomize. Their power is entirely contingent on assumptions you must verify, not invoke.
Myth: kill the lowest-attributed channel and reallocate its budget for free upside

The question: your attribution report ranks channels by credited conversions; the bottom one looks like dead weight. Cut it, move the money up the list, and gain efficiency — sound logic?

Why it backfires: attributed credit measures observed last-mile (or model-weighted) presence, not marginal causal contribution. A low-credited channel is often an upper-funnel one whose effect shows up later, under a different touch that grabbed the credit. Cutting it can collapse demand that the high-credited channels were merely harvesting. Several documented cases of pausing 'low-value' prospecting or upper-funnel video saw the supposedly-strong brand and retargeting channels fall too — because there was less demand for them to capture. The credit moved; the conversions didn't.

The nuance: this is the harvesting-versus-generating distinction. Attribution conflates the two — it credits whoever stood closest to the conversion, which is usually the harvester. Reallocating from generators to harvesters can look efficient for a few weeks and then erode the funnel that fed everything. The interaction is exactly what observational credit-splitting cannot see.

Bottom line for practitioners: before cutting any channel on attribution alone, run a real holdout on it (pause in matched geos and watch total conversions, not just its own attributed ones). If total conversions hold, it was harvestable and you can trim. If total conversions drop even though its attributed credit was low, you just found a generator — and the model was lying to you about it.


Продолжение про marketing mix modeling — @measurement_brand_aff
When Shapley and Markov disagree, who's right?

A question worth sitting with, because both are sold as the rigorous alternative to last-click. They are not interchangeable.

What the methods actually do

— Shapley value borrows from cooperative game theory: it asks how much each channel adds on average across every possible ordering of the channels in a path. It distributes credit by marginal contribution.
— Markov chains model the journey as transition probabilities between states. Credit comes from the removal effect: delete a channel, see how much the conversion probability drops.

The nuance

Shapley is order-agnostic by construction — it averages permutations, so it tends to flatten sequencing. Markov keeps the path structure, so a channel that's a frequent bridge between other touches scores higher than its raw frequency suggests. On the same dataset they routinely disagree by 15-30% on a given channel's share, especially for mid-funnel display and retargeting.

Neither is measuring causation. Both are sophisticated rules for splitting observed credit among channels that co-occur with conversions. A channel can earn high removal-effect credit purely because converters happen to pass through it — correlation dressed in matrix algebra.

What to actually do

— Run both. Where they agree, you have a robust read. Where they diverge sharply, that's your flag to investigate, not to average.
— Treat divergence on a channel as a hypothesis to test with a holdout, not a number to trust.

Bottom line for practitioners: model disagreement is signal, not noise. Use Shapley and Markov as a consistency check on each other, and reserve causal claims for experiments.
Why does your attribution model say one thing and your geo holdout say another?

Because they answer different questions, and conflating them is the most expensive mistake in measurement.

Two distinct questions

— Attribution asks: among the conversions that happened, how should credit be divided across touchpoints? It is a descriptive accounting exercise over observed paths.
— Incrementality asks: how many of those conversions would NOT have happened without this spend? That is a causal question, answerable only by comparing a treated group to a counterfactual.

What the studies show

Incrementality tests — geo holdouts, randomized PSA experiments, ghost-bid auctions — repeatedly find that channels rich in captured demand (branded search, retargeting, lower-funnel social) are heavily over-credited by attribution models. Meta's own conversion-lift work and multiple academic PSA studies show platform-reported conversions running several multiples above true incremental lift in many accounts.

The nuance

This does not make attribution useless. Attribution is a fast, daily, granular allocation tool. Incrementality is slow, coarse, and expensive but causally honest. They operate at different cadences and resolutions.

What to actually do

— Use incrementality to calibrate your attribution model: derive scaling factors per channel from experiments, then apply them to your daily attributed numbers.
— Re-run calibration quarterly; incrementality decays as saturation and creative fatigue shift.

Bottom line for practitioners: attribution divides the pie, incrementality tells you how big the pie actually is. You need both, and you should never let the cheap one overrule the expensive one on questions of true value.
Why does last-click refuse to die?

It has been declared obsolete for fifteen years and still anchors most reporting. That persistence deserves an honest analysis rather than another dismissal.

The case against it

Last-click — assigning 100% of credit to the final touch before conversion — systematically over-rewards harvesting channels (branded search, retargeting, coupon sites) and starves demand-generation. It is, structurally, a bias toward whatever sits closest to the wallet.

Why it survives anyway

— It is deterministic and auditable. Every stakeholder can trace exactly why a conversion was credited. Multi-touch models, by contrast, are often black boxes that finance teams won't sign off on.
— It is stable. Probabilistic models reshuffle credit as their training data shifts, which makes period-over-period comparison maddening.
— It is consistent across the industry, so benchmarking against competitors and affiliate payouts stays comparable.

The nuance

Last-click is a bad attribution model but a defensible accounting convention. Affiliate payouts, for instance, need a deterministic rule everyone agrees on in advance — fairness and disputability matter more than causal accuracy there.

What to actually do

— Keep last-click for settlement and payout where determinism is the requirement.
— Add a parallel data-driven or incrementality-calibrated view for budget decisions, and never let the two reporting layers fight.

Bottom line for practitioners: don't kill last-click, demote it. Use it where auditability is the job, and use causal methods where allocation is the job.
One to follow

For GA4 done right, @GA4Triage is the move. Got a GA4 question? We answer the ones everyone's actually Googling — events,…
Media mix modeling is back — but is your model measuring causation or just spend?

MMM (regressing aggregate outcomes like sales on aggregate marketing inputs over time) is enjoying a revival because it needs no user-level tracking. The revival comes with a quiet correlation trap.

How it works

You fit a model — historically linear regression, now often Bayesian (Google's Meridian, Meta's Robyn) — that attributes sales to channels while controlling for price, seasonality, and macro factors. Two refinements matter: adstock (advertising effects decay over time, not instantly) and saturation curves (diminishing returns at higher spend).

The core danger

MMM is observational. If you always raise TV spend in Q4 when demand is already high, the model happily credits TV for seasonal sales it didn't cause. This is textbook confounding: spend correlates with the very demand it's supposed to explain.

The nuance

Bayesian MMM doesn't fix this by itself — priors and regularization reduce overfitting but cannot manufacture causal identification out of correlated regressors. Garbage collinearity in, confident-looking ROI out.

What to actually do

— Inject experimental priors: feed incrementality-test results into the model as informative priors on channel effects. This is the single biggest credibility upgrade.
— Deliberately vary spend (flighting, regional pulses) to create the variation the model needs to identify effects.
— Report credible intervals, not point ROAS. A channel whose interval spans zero is not a finding.

Bottom line for practitioners: MMM's value scales with how much real experimental variation you feed it. Without calibration, it's an elegant way to launder seasonality into channel credit.
What's the cleanest way to measure ad incrementality — and why do most teams get the control group wrong?

The question is really about counterfactual quality. A bad control silently destroys an otherwise rigorous test.

The flawed default

Many teams compare exposed users to unexposed users. This is hopelessly confounded: the people the algorithm chose to show ads to differ systematically from those it skipped. Exposed users were targeted precisely because they were likely to convert. You measure selection, not effect.

Two designs that fix it

PSA / placebo tests: the control group is shown an unrelated charity ad in the slot your ad would have occupied. Now both groups passed the same auction and targeting filter; the only difference is creative.
Ghost ads: the platform logs who would have won the auction for your ad in the control group without serving it. This is cheaper than PSAs and removes the cost of buying placebo impressions, while preserving the same selection mechanism.

The nuance

Ghost ads require platform cooperation and are only as trustworthy as the platform running them — the same entity grading its own homework. PSAs cost real money but are auditable by you.

What to actually do

— Insist the control be defined by the auction, not by exposure. "Would-have-been-shown" is the only valid counterfactual.
— Power the test before running it; underpowered lift studies produce confidently wrong zeros.

Bottom line for practitioners: incrementality is won or lost in the control group. If your unexposed group wasn't selected by the same machinery as your exposed group, you're measuring the algorithm, not your ads.
Google's Data-Driven Attribution gives you a number. Should you trust the mechanism behind it?

Worth interrogating, because DDA is now the default in many accounts and most users couldn't describe what it computes.

What it claims to do

DDA uses a counterfactual approach loosely related to Shapley value: it compares paths that converted to paths that didn't, and assigns fractional credit based on each touchpoint's apparent contribution to conversion probability. Conceptually sound — far better than last-click.

The nuance you don't see

— It is still a within-platform, observational model. It can only weigh touchpoints it observes, and it sees a shrinking slice of the journey as cross-site signals erode.
— The training is opaque. You cannot inspect the feature weights, validate the counterfactual construction, or reproduce the numbers. For a model making budget decisions, that's a meaningful audit gap.
— It optimizes toward conversions the platform can attribute to itself, which structurally favors the platform's own surfaces.

What the broader evidence says

When DDA outputs are checked against geo holdouts, the directional read is often reasonable but the magnitudes drift — particularly inflating credit for branded and remarketing terms that capture existing demand.

What to actually do

— Use DDA for intra-platform optimization where it's strongest: relative bidding across keywords and audiences.
— Do not use it for cross-channel budget splits — that's outside its field of view.
— Sanity-check its biggest claims against an experiment at least twice a year.

Bottom line for practitioners: DDA is a good optimizer and a poor oracle. Trust it inside the platform's walls; verify it everywhere else.
Where did the 40-20-40 attribution split come from — and is there any science in it?

A fair question, because U-shaped and W-shaped models are everywhere and almost no one can defend the weights.

The models

U-shaped (position-based): 40% to first touch, 40% to last touch, 20% spread across the middle.
W-shaped: adds a weighted bump for the lead-creation or opportunity stage, common in long B2B cycles.
Time-decay: credit grows exponentially toward touches nearer the conversion, governed by a half-life you pick.

The uncomfortable truth

These weights are heuristics, not measurements. The 40-20-40 split encodes a belief — that introduction and closing matter most — but it is not estimated from your data. Two practitioners can pick different half-lives and produce opposite channel rankings from identical paths.

The nuance

That doesn't make rule-based models worthless. They are transparent, stable, and impose a deliberate prior. A time-decay model with a sensibly chosen half-life can be a reasonable default for a short, impulse-driven funnel. The error is treating a chosen heuristic as a discovered fact.

What to actually do

— Pick a rule-based model as a communication and stability layer, and document that the weights are assumptions.
— If you want the weights to reflect reality, you need a data-driven model or an experiment — not a prettier curve.

Bottom line for practitioners: position-based models are opinions made of arithmetic. Useful as a transparent default, dangerous when mistaken for evidence about how your customers actually behave.
Should a view-through conversion count at all?

It's one of the most consequential definitional choices in display and video, and it's usually made by an ad platform on your behalf.

What it is

A view-through conversion (VTC) credits an impression that was served but not clicked, when the user later converts within a lookback window. It is the entire justification for upper-funnel display and most programmatic video budgets.

Why it's contested

The central problem is unfalsifiable credit. If you served an impression to someone already likely to buy, VTC will record a "conversion" whether or not the ad mattered. With wide windows and broad targeting, you can manufacture impressive VTC counts by spraying impressions at high-intent audiences — pure correlation, formatted as performance.

What experiments find

Ghost-ad and PSA studies consistently show view-through lift far smaller than reported VTCs, and sometimes statistically indistinguishable from zero — especially for retargeting display, where the audience was going to convert anyway.

The nuance

View-through effects are real for genuine prospecting against cold audiences; they're largely illusory for warm retargeting. The metric isn't worthless, but it's the most over-claimed number in the stack.

What to actually do

— Never report VTCs at full face value next to click conversions; they are not the same evidentiary tier.
— Validate any meaningful VTC line against a holdout before scaling spend on it.
— Shorten lookback windows for retargeting to limit demand-capture inflation.

Bottom line for practitioners: view-through is where attribution is most easily gamed, often unintentionally. Treat every VTC as a hypothesis until an experiment says otherwise.
How much does your attribution depend on a number nobody debates — the lookback window?

More than almost any model choice, and it's usually set once and forgotten.

The mechanic

The lookback (or conversion) window defines how far back from a conversion you'll look for qualifying touchpoints — 1 day, 7 days, 28 days, 90 days. Every touch outside it is invisible to your model.

Why it quietly dominates results

Window length acts as a hidden weighting on channel mix. A short window structurally favors lower-funnel, click-heavy channels (search, retargeting) and erases upper-funnel touches that happened earlier. A long window inflates the credit of any always-on channel that's simply present in most journeys. You can change which channel "wins" by editing one dropdown, with no change to actual behavior.

The nuance

There is no universally correct window — it should match your real purchase-consideration cycle. An impulse e-commerce buy and a six-month B2B deal cannot share a 7-day window honestly. Mismatched windows are a major reason cross-channel reports contradict each other.

What to actually do

— Derive the window empirically: look at the distribution of time-to-conversion across your actual paths, and set it near the 80th-90th percentile of genuine consideration time.
— Run sensitivity analysis: rerun your model at 7, 14, 30, 90 days and see which conclusions survive. Conclusions that flip with the window aren't robust.

Bottom line for practitioners: the lookback window is a modeling assumption disguised as a setting. Test its sensitivity before you trust any channel ranking built on top of it.
Why do so many geo holdout tests come back "inconclusive"?

Usually not because the channel doesn't work — because the experiment was statistically dead on arrival. Geo testing lives or dies on design, not execution.

The setup

A geo holdout splits regions into treatment (gets the campaign) and control (doesn't), then compares outcomes. It's a randomized experiment at the market level, which sidesteps user-level tracking entirely — its great advantage in a privacy-constrained world.

The three things that kill it

Too few geos. Markets are heterogeneous and few in number. With a handful of regions, between-market variance swamps any treatment effect. You need enough units, and matched ones.
Underpowered spend. If the incremental lift you're hunting is 3% and your test can only detect 10%, a null result tells you nothing. Power analysis must precede the test, not follow the disappointment.
Spillover. Neighboring or online-overlapping markets contaminate the control, biasing the effect toward zero.

The nuance

Matched-market methods and synthetic control (constructing a weighted blend of control geos that mimics the treated region's pre-period trend) materially improve power without more markets. They're the difference between a usable read and noise.

What to actually do

— Run a power calculation first; if the minimum detectable effect exceeds your plausible lift, don't run it.
— Use synthetic control rather than raw region averages.
— Hold the test long enough to clear adstock decay, not just the flight.

Bottom line for practitioners: an inconclusive geo test is usually a design failure, not a verdict on the channel. Power it, match it, and protect the control from spillover before you spend a dollar.
Can a channel look profitable in every segment yet unprofitable overall — or the reverse?

Yes, and it happens more in attribution data than practitioners realize. The culprit is Simpson's paradox, and it quietly corrupts channel decisions.

The phenomenon

Simpson's paradox is when an association that holds within every subgroup reverses or vanishes when the subgroups are pooled — because a lurking variable is distributed unevenly across them.

How it bites attribution

Suppose retargeting shows strong ROAS in aggregate. Split by audience: it's mediocre for genuinely new prospects and inflated only among already-intending buyers, who happen to dominate the pool. The aggregate number is real arithmetic but a false guide, because the mix is doing the work, not the channel. Conversely, a prospecting channel can look weak overall while being your only true demand-creator within the cold-audience segment that actually matters.

The nuance

This is the same disease as confounding, surfaced through aggregation. Blended ROAS, blended CPA, blended attribution credit — any pooled metric over heterogeneous audiences is a candidate for reversal. The fix isn't more decimals; it's the right conditioning variable.

What to actually do

— Always decompose channel metrics by audience temperature (cold prospect vs warm vs existing customer) before judging them.
— Be suspicious of any blended metric that aggregates over groups with very different baseline conversion rates.
— When subgroup and aggregate disagree, the subgroup view conditioned on the confounder is usually the decision-relevant one.

Bottom line for practitioners: the most dangerous attribution numbers are the cleanest-looking aggregates. Condition on audience intent before you reallocate, or the mix will fool you.
How much of your branded search spend is actually buying you conversions you'd have gotten for free?

A pointed question, because branded search is consistently the most over-credited line in any attribution report — and one of the most testable.

The setup

When someone searches your brand name, they've already decided to seek you out. Bidding on that term means paying for a click that, in many cases, would have arrived organically just below the ad.

What experiments have found

The landmark evidence is eBay's large-scale search experiment, which paused paid search and found that the vast majority of attributed sales were not incremental — most clicks substituted for free organic visits. Subsequent brand-keyword pause tests across many advertisers replicate the pattern: incremental lift on pure branded terms is often low and sometimes near zero, while last-click cheerfully assigns it a glowing ROAS.

The nuance

It is not universally zero. Branded incrementality rises when competitors bid on your brand, when your organic listing is buried or absent, or when the brand term is broad enough to capture genuine consideration. The answer is account-specific and changes with the SERP.

What to actually do

— Run a brand-term pause test (geo-split or time-split) rather than trusting attributed ROAS. It's one of the cheapest, highest-leverage experiments available.
— Re-test when competitors start or stop bidding on your brand, since incrementality is conditional on the auction.

Bottom line for practitioners: branded search ROAS is where correlation masquerades most convincingly as causation. The pause test is the only honest way to know what you're really buying.
Quick rec — @DragDropDone keeps a tight feed on No-code / website builders. If today's post landed, that one's for you.