Why does your lift test 'reach significance' if you just keep watching it?
The question: you're running an incrementality test, checking the dashboard daily, and one morning it crosses statistical significance. You call it and ship. Did you find a real effect — or did you find the inevitable consequence of looking many times?
What the statistics say: this is peeking (or optional stopping), and it's one of the most common ways attribution experiments lie. A fixed-sample significance test assumes you look once, at a pre-planned sample size. Every additional peek gives the noise another chance to cross the threshold. Check daily for a month and your real false-positive rate isn't 5% — simulations put it well above 20-30%, depending on how often you look. The test didn't get more sensitive; you gave randomness more shots at the goal.
Why it's epidemic in marketing: ad platforms surface live 'significance' indicators that practically invite peeking, and the pressure to call a winner early is enormous. The result is a literature of 'wins' that don't replicate — classic correlation (a lucky run) mistaken for causation (a real effect).
The nuance: you can look continuously — if you use methods built for it. Sequential testing and always-valid p-values (group-sequential designs, Bayesian approaches with proper priors) adjust the threshold for repeated looks. The sin isn't watching; it's watching with a one-look test.
What to actually do: pre-register your sample size and analysis date, or switch to a sequential/always-valid framework explicitly. Don't stop on the first green light.
Bottom line for practitioners: a lift test you peek at is a slot machine with a significance badge. Fix the stopping rule before you start, or your wins won't survive contact with reality.
The question: you're running an incrementality test, checking the dashboard daily, and one morning it crosses statistical significance. You call it and ship. Did you find a real effect — or did you find the inevitable consequence of looking many times?
What the statistics say: this is peeking (or optional stopping), and it's one of the most common ways attribution experiments lie. A fixed-sample significance test assumes you look once, at a pre-planned sample size. Every additional peek gives the noise another chance to cross the threshold. Check daily for a month and your real false-positive rate isn't 5% — simulations put it well above 20-30%, depending on how often you look. The test didn't get more sensitive; you gave randomness more shots at the goal.
Why it's epidemic in marketing: ad platforms surface live 'significance' indicators that practically invite peeking, and the pressure to call a winner early is enormous. The result is a literature of 'wins' that don't replicate — classic correlation (a lucky run) mistaken for causation (a real effect).
The nuance: you can look continuously — if you use methods built for it. Sequential testing and always-valid p-values (group-sequential designs, Bayesian approaches with proper priors) adjust the threshold for repeated looks. The sin isn't watching; it's watching with a one-look test.
What to actually do: pre-register your sample size and analysis date, or switch to a sequential/always-valid framework explicitly. Don't stop on the first green light.
Bottom line for practitioners: a lift test you peek at is a slot machine with a significance badge. Fix the stopping rule before you start, or your wins won't survive contact with reality.
Beyond A/B tests, what causal tools belong in a marketer's kit?
Most attribution discourse stops at randomized experiments and holdouts. The quasi-experimental toolkit from econometrics is underused in marketing and often fits situations where randomization isn't possible.
Three tools worth knowing
— Difference-in-differences (DiD): compare the before/after change in a treated group against the before/after change in an untreated group. It nets out shared trends, so it's the backbone of most geo-style reads. Its core assumption — parallel trends absent treatment — is testable in the pre-period and must be checked, not assumed.
— Regression discontinuity (RD): when treatment is assigned by a threshold (a loyalty tier at a spend cutoff, a free-shipping minimum), compare units just above and just below the line. They're nearly identical except for treatment, giving near-experimental credibility at the boundary.
— Instrumental variables (IV): when something randomly nudges exposure without directly affecting the outcome (an ad-server delivery hiccup, a platform auction quirk), it can isolate causal effect from confounding. Powerful but fragile — a weak or invalid instrument quietly reintroduces the bias it was meant to remove.
The nuance
These aren't magic. Each rests on assumptions that fail silently: parallel trends, no manipulation around the cutoff, instrument validity. Used carelessly, they produce confident causal claims with no more warrant than the correlation they replaced.
What to actually do
— Reach for DiD when you have a natural treated/control split and a clean pre-period.
— Look for RD anywhere a threshold drives treatment; these moments are causal gold and routinely ignored.
— State and test the identifying assumption out loud before believing the estimate.
Bottom line for practitioners: randomized tests are the gold standard, but quasi-experimental methods extract causation from situations you can't randomize. Their power is entirely contingent on assumptions you must verify, not invoke.
Most attribution discourse stops at randomized experiments and holdouts. The quasi-experimental toolkit from econometrics is underused in marketing and often fits situations where randomization isn't possible.
Three tools worth knowing
— Difference-in-differences (DiD): compare the before/after change in a treated group against the before/after change in an untreated group. It nets out shared trends, so it's the backbone of most geo-style reads. Its core assumption — parallel trends absent treatment — is testable in the pre-period and must be checked, not assumed.
— Regression discontinuity (RD): when treatment is assigned by a threshold (a loyalty tier at a spend cutoff, a free-shipping minimum), compare units just above and just below the line. They're nearly identical except for treatment, giving near-experimental credibility at the boundary.
— Instrumental variables (IV): when something randomly nudges exposure without directly affecting the outcome (an ad-server delivery hiccup, a platform auction quirk), it can isolate causal effect from confounding. Powerful but fragile — a weak or invalid instrument quietly reintroduces the bias it was meant to remove.
The nuance
These aren't magic. Each rests on assumptions that fail silently: parallel trends, no manipulation around the cutoff, instrument validity. Used carelessly, they produce confident causal claims with no more warrant than the correlation they replaced.
What to actually do
— Reach for DiD when you have a natural treated/control split and a clean pre-period.
— Look for RD anywhere a threshold drives treatment; these moments are causal gold and routinely ignored.
— State and test the identifying assumption out loud before believing the estimate.
Bottom line for practitioners: randomized tests are the gold standard, but quasi-experimental methods extract causation from situations you can't randomize. Their power is entirely contingent on assumptions you must verify, not invoke.
Myth: kill the lowest-attributed channel and reallocate its budget for free upside
The question: your attribution report ranks channels by credited conversions; the bottom one looks like dead weight. Cut it, move the money up the list, and gain efficiency — sound logic?
Why it backfires: attributed credit measures observed last-mile (or model-weighted) presence, not marginal causal contribution. A low-credited channel is often an upper-funnel one whose effect shows up later, under a different touch that grabbed the credit. Cutting it can collapse demand that the high-credited channels were merely harvesting. Several documented cases of pausing 'low-value' prospecting or upper-funnel video saw the supposedly-strong brand and retargeting channels fall too — because there was less demand for them to capture. The credit moved; the conversions didn't.
The nuance: this is the harvesting-versus-generating distinction. Attribution conflates the two — it credits whoever stood closest to the conversion, which is usually the harvester. Reallocating from generators to harvesters can look efficient for a few weeks and then erode the funnel that fed everything. The interaction is exactly what observational credit-splitting cannot see.
Bottom line for practitioners: before cutting any channel on attribution alone, run a real holdout on it (pause in matched geos and watch total conversions, not just its own attributed ones). If total conversions hold, it was harvestable and you can trim. If total conversions drop even though its attributed credit was low, you just found a generator — and the model was lying to you about it.
—
Продолжение про marketing mix modeling — @measurement_brand_aff
The question: your attribution report ranks channels by credited conversions; the bottom one looks like dead weight. Cut it, move the money up the list, and gain efficiency — sound logic?
Why it backfires: attributed credit measures observed last-mile (or model-weighted) presence, not marginal causal contribution. A low-credited channel is often an upper-funnel one whose effect shows up later, under a different touch that grabbed the credit. Cutting it can collapse demand that the high-credited channels were merely harvesting. Several documented cases of pausing 'low-value' prospecting or upper-funnel video saw the supposedly-strong brand and retargeting channels fall too — because there was less demand for them to capture. The credit moved; the conversions didn't.
The nuance: this is the harvesting-versus-generating distinction. Attribution conflates the two — it credits whoever stood closest to the conversion, which is usually the harvester. Reallocating from generators to harvesters can look efficient for a few weeks and then erode the funnel that fed everything. The interaction is exactly what observational credit-splitting cannot see.
Bottom line for practitioners: before cutting any channel on attribution alone, run a real holdout on it (pause in matched geos and watch total conversions, not just its own attributed ones). If total conversions hold, it was harvestable and you can trim. If total conversions drop even though its attributed credit was low, you just found a generator — and the model was lying to you about it.
—
Продолжение про marketing mix modeling — @measurement_brand_aff
When Shapley and Markov disagree, who's right?
A question worth sitting with, because both are sold as the rigorous alternative to last-click. They are not interchangeable.
What the methods actually do
— Shapley value borrows from cooperative game theory: it asks how much each channel adds on average across every possible ordering of the channels in a path. It distributes credit by marginal contribution.
— Markov chains model the journey as transition probabilities between states. Credit comes from the removal effect: delete a channel, see how much the conversion probability drops.
The nuance
Shapley is order-agnostic by construction — it averages permutations, so it tends to flatten sequencing. Markov keeps the path structure, so a channel that's a frequent bridge between other touches scores higher than its raw frequency suggests. On the same dataset they routinely disagree by 15-30% on a given channel's share, especially for mid-funnel display and retargeting.
Neither is measuring causation. Both are sophisticated rules for splitting observed credit among channels that co-occur with conversions. A channel can earn high removal-effect credit purely because converters happen to pass through it — correlation dressed in matrix algebra.
What to actually do
— Run both. Where they agree, you have a robust read. Where they diverge sharply, that's your flag to investigate, not to average.
— Treat divergence on a channel as a hypothesis to test with a holdout, not a number to trust.
Bottom line for practitioners: model disagreement is signal, not noise. Use Shapley and Markov as a consistency check on each other, and reserve causal claims for experiments.
A question worth sitting with, because both are sold as the rigorous alternative to last-click. They are not interchangeable.
What the methods actually do
— Shapley value borrows from cooperative game theory: it asks how much each channel adds on average across every possible ordering of the channels in a path. It distributes credit by marginal contribution.
— Markov chains model the journey as transition probabilities between states. Credit comes from the removal effect: delete a channel, see how much the conversion probability drops.
The nuance
Shapley is order-agnostic by construction — it averages permutations, so it tends to flatten sequencing. Markov keeps the path structure, so a channel that's a frequent bridge between other touches scores higher than its raw frequency suggests. On the same dataset they routinely disagree by 15-30% on a given channel's share, especially for mid-funnel display and retargeting.
Neither is measuring causation. Both are sophisticated rules for splitting observed credit among channels that co-occur with conversions. A channel can earn high removal-effect credit purely because converters happen to pass through it — correlation dressed in matrix algebra.
What to actually do
— Run both. Where they agree, you have a robust read. Where they diverge sharply, that's your flag to investigate, not to average.
— Treat divergence on a channel as a hypothesis to test with a holdout, not a number to trust.
Bottom line for practitioners: model disagreement is signal, not noise. Use Shapley and Markov as a consistency check on each other, and reserve causal claims for experiments.
Why does your attribution model say one thing and your geo holdout say another?
Because they answer different questions, and conflating them is the most expensive mistake in measurement.
Two distinct questions
— Attribution asks: among the conversions that happened, how should credit be divided across touchpoints? It is a descriptive accounting exercise over observed paths.
— Incrementality asks: how many of those conversions would NOT have happened without this spend? That is a causal question, answerable only by comparing a treated group to a counterfactual.
What the studies show
Incrementality tests — geo holdouts, randomized PSA experiments, ghost-bid auctions — repeatedly find that channels rich in captured demand (branded search, retargeting, lower-funnel social) are heavily over-credited by attribution models. Meta's own conversion-lift work and multiple academic PSA studies show platform-reported conversions running several multiples above true incremental lift in many accounts.
The nuance
This does not make attribution useless. Attribution is a fast, daily, granular allocation tool. Incrementality is slow, coarse, and expensive but causally honest. They operate at different cadences and resolutions.
What to actually do
— Use incrementality to calibrate your attribution model: derive scaling factors per channel from experiments, then apply them to your daily attributed numbers.
— Re-run calibration quarterly; incrementality decays as saturation and creative fatigue shift.
Bottom line for practitioners: attribution divides the pie, incrementality tells you how big the pie actually is. You need both, and you should never let the cheap one overrule the expensive one on questions of true value.
Because they answer different questions, and conflating them is the most expensive mistake in measurement.
Two distinct questions
— Attribution asks: among the conversions that happened, how should credit be divided across touchpoints? It is a descriptive accounting exercise over observed paths.
— Incrementality asks: how many of those conversions would NOT have happened without this spend? That is a causal question, answerable only by comparing a treated group to a counterfactual.
What the studies show
Incrementality tests — geo holdouts, randomized PSA experiments, ghost-bid auctions — repeatedly find that channels rich in captured demand (branded search, retargeting, lower-funnel social) are heavily over-credited by attribution models. Meta's own conversion-lift work and multiple academic PSA studies show platform-reported conversions running several multiples above true incremental lift in many accounts.
The nuance
This does not make attribution useless. Attribution is a fast, daily, granular allocation tool. Incrementality is slow, coarse, and expensive but causally honest. They operate at different cadences and resolutions.
What to actually do
— Use incrementality to calibrate your attribution model: derive scaling factors per channel from experiments, then apply them to your daily attributed numbers.
— Re-run calibration quarterly; incrementality decays as saturation and creative fatigue shift.
Bottom line for practitioners: attribution divides the pie, incrementality tells you how big the pie actually is. You need both, and you should never let the cheap one overrule the expensive one on questions of true value.
