A variant can win every segment and still lose overall — Simpson's Paradox in funnels.
If desktop and mobile each prefer B, but B happened to get more low-converting mobile traffic, the pooled number flips to A.
— Always check whether your randomizer balanced segment mix, not just totals.
— Report segment-weighted results, not raw pooled rates, when traffic mix differs by arm.
This is the same machinery as an SRM problem, one layer down.
Read the number, not the story. [segment-weighted · device split]
If desktop and mobile each prefer B, but B happened to get more low-converting mobile traffic, the pooled number flips to A.
— Always check whether your randomizer balanced segment mix, not just totals.
— Report segment-weighted results, not raw pooled rates, when traffic mix differs by arm.
This is the same machinery as an SRM problem, one layer down.
Read the number, not the story. [segment-weighted · device split]
Persistent help text under a field lifts completion ~2.3% over hover tooltips on the same form.
From 5 paired tests; effect concentrated on mobile, where hover doesn't exist.
Tooltips hide information behind an interaction users may never trigger. On touch devices, hover is a tap that competes with the field itself.
— Tooltip ▮▮▮▮ completion
— Persistent text ▮▮▮▮▮▮ completion
Reserve tooltips for desktop-only, low-stakes fields. Use inline text wherever mobile traffic is meaningful.
Read the number, not the story. [n=5 tests · mobile-skewed]
—
Чтобы быть в курсе рынка — подпишись на @affcareers_limassol
From 5 paired tests; effect concentrated on mobile, where hover doesn't exist.
Tooltips hide information behind an interaction users may never trigger. On touch devices, hover is a tap that competes with the field itself.
— Tooltip ▮▮▮▮ completion
— Persistent text ▮▮▮▮▮▮ completion
Reserve tooltips for desktop-only, low-stakes fields. Use inline text wherever mobile traffic is meaningful.
Read the number, not the story. [n=5 tests · mobile-skewed]
—
Чтобы быть в курсе рынка — подпишись на @affcareers_limassol
Roughly 6-8% of running A/B tests carry a Sample Ratio Mismatch — and most teams never check.
You split 50/50 but observe 50.8/49.2 on 80k users. Feels like noise. Run a chi-square test: a 0.8pp skew at that volume is wildly unlikely by chance, meaning your randomizer, redirect, or bot filtering is broken.
— Any SRM (the assignment ratio doesn't match what you set) invalidates the whole result, win or loss.
— Check it before you read the conversion number, not after.
Read the number, not the story. [SRM threshold p<0.001]
You split 50/50 but observe 50.8/49.2 on 80k users. Feels like noise. Run a chi-square test: a 0.8pp skew at that volume is wildly unlikely by chance, meaning your randomizer, redirect, or bot filtering is broken.
— Any SRM (the assignment ratio doesn't match what you set) invalidates the whole result, win or loss.
— Check it before you read the conversion number, not after.
Read the number, not the story. [SRM threshold p<0.001]
Stopping a test the first day it hits p<0.05 inflates your false-positive rate from 5% to ~26%.
That's the cost of peeking — checking significance repeatedly and stopping at the first green light. Each look is another lottery ticket for a fluke.
— Fix one: fix the sample size in advance, look once at the end.
— Fix two: use a sequential method (mSPRT, group-sequential) built to allow continuous monitoring.
A p-value (chance the result is noise) only means 5% if you looked exactly once.
Read the number, not the story. [α inflation 5%→26%, ~5 looks]
That's the cost of peeking — checking significance repeatedly and stopping at the first green light. Each look is another lottery ticket for a fluke.
— Fix one: fix the sample size in advance, look once at the end.
— Fix two: use a sequential method (mSPRT, group-sequential) built to allow continuous monitoring.
A p-value (chance the result is noise) only means 5% if you looked exactly once.
Read the number, not the story. [α inflation 5%→26%, ~5 looks]
Halving the effect you can detect quadruples the traffic you need.
MDE — minimum detectable effect — scales with 1/n². Chasing a 1% lift instead of a 2% lift isn't twice the work, it's four times.
— Baseline 3% conversion, want to catch a 2% relative lift at 80% power: ~1.1M users per arm.
— Same setup for a 5% lift: ~180k per arm.
Most teams pick the MDE last. Pick it first, then ask if you have the traffic to honor it.
Read the number, not the story. [80% power · 95% CI]
MDE — minimum detectable effect — scales with 1/n². Chasing a 1% lift instead of a 2% lift isn't twice the work, it's four times.
— Baseline 3% conversion, want to catch a 2% relative lift at 80% power: ~1.1M users per arm.
— Same setup for a 5% lift: ~180k per arm.
Most teams pick the MDE last. Pick it first, then ask if you have the traffic to honor it.
Read the number, not the story. [80% power · 95% CI]
Neighbor spotlight: @SplitTestStreet. They go deep on A/B testing — the kind of channel you actually keep notifications on for.
~35% of "winning" UI changes lose half their lift within three weeks.
That's the novelty effect — returning users react to change, not to the change being better. The curve decays toward the control.
— Segment new vs returning visitors: if the win lives only in returning users early on, suspect novelty.
— Hold the test one full purchase cycle before you trust the number.
First-week lift ▮▮▮▮▮▮ → week-three lift ▮▮▮.
Read the number, not the story. [new vs returning split · 21d window]
That's the novelty effect — returning users react to change, not to the change being better. The curve decays toward the control.
— Segment new vs returning visitors: if the win lives only in returning users early on, suspect novelty.
— Hold the test one full purchase cycle before you trust the number.
First-week lift ▮▮▮▮▮▮ → week-three lift ▮▮▮.
Read the number, not the story. [new vs returning split · 21d window]
CUPED can cut the traffic you need by 30-50% with zero UX change.
It's a variance-reduction trick: use each user's pre-experiment behavior as a covariate to strip out predictable noise from the conversion metric.
— Works best when pre-period behavior strongly predicts the outcome (repeat buyers, logged-in users).
— Near-useless for first-touch anonymous traffic with no history.
Same test, narrower confidence interval, faster decision.
Read the number, not the story. [variance −40% typical · logged-in cohort]
It's a variance-reduction trick: use each user's pre-experiment behavior as a covariate to strip out predictable noise from the conversion metric.
— Works best when pre-period behavior strongly predicts the outcome (repeat buyers, logged-in users).
— Near-useless for first-touch anonymous traffic with no history.
Same test, narrower confidence interval, faster decision.
Read the number, not the story. [variance −40% typical · logged-in cohort]
Average order value tests fail significance checks because the math is wrong, not the metric.
AOV and revenue-per-session are ratio metrics — the denominator (sessions) is itself random. A naive t-test understates variance and hands you false wins.
— Use the delta method or bootstrap to get honest confidence intervals on ratios.
— Revenue is also heavy-tailed: one whale moves the mean, so winsorize or cap.
The lift looks real until you price in the whale.
Read the number, not the story. [delta method · winsorized 99th pct]
AOV and revenue-per-session are ratio metrics — the denominator (sessions) is itself random. A naive t-test understates variance and hands you false wins.
— Use the delta method or bootstrap to get honest confidence intervals on ratios.
— Revenue is also heavy-tailed: one whale moves the mean, so winsorize or cap.
The lift looks real until you price in the whale.
Read the number, not the story. [delta method · winsorized 99th pct]
One in five "successful" CRO wins quietly damages a metric nobody was watching.
Aggressive checkout simplification lifts conversion but raises refund and chargeback rates — the cost lands in a different team's dashboard.
— Define guardrail metrics (refunds, returns, support tickets, churn) before launch.
— A win that needs the guardrail to stay flat isn't a win until it does.
Conversion ▮▮▮▮▮▮ up, refunds ▮▮▮▮ also up = net negative.
Read the number, not the story. [guardrails: refund · CAC · churn]
Aggressive checkout simplification lifts conversion but raises refund and chargeback rates — the cost lands in a different team's dashboard.
— Define guardrail metrics (refunds, returns, support tickets, churn) before launch.
— A win that needs the guardrail to stay flat isn't a win until it does.
Conversion ▮▮▮▮▮▮ up, refunds ▮▮▮▮ also up = net negative.
Read the number, not the story. [guardrails: refund · CAC · churn]
Testing 20 metrics at 95% gives you a ~64% chance of at least one false win.
Every extra metric you eyeball is another coin flip against you — the multiple-comparisons problem.
— Declare one primary metric before launch. Everything else is exploratory and needs replication.
— If you must track many, correct the threshold (Bonferroni, Benjamini-Hochberg).
The "interesting secondary finding" is usually the false positive you went looking for.
Read the number, not the story. [1−0.95^20 ≈ 64%]
Every extra metric you eyeball is another coin flip against you — the multiple-comparisons problem.
— Declare one primary metric before launch. Everything else is exploratory and needs replication.
— If you must track many, correct the threshold (Bonferroni, Benjamini-Hochberg).
The "interesting secondary finding" is usually the false positive you went looking for.
Read the number, not the story. [1−0.95^20 ≈ 64%]
Running two A/B tests on the same page at once can silently corrupt both.
If the button-color test and the headline test interact, each contaminates the other's control group. You read two clean wins that don't replicate together.
— Either run a proper multivariate test that models the interaction, or fully isolate traffic.
— Most pairs of changes don't interact — but you only learn that by checking, not assuming.
Clean-looking A/B + clean-looking A/B ≠ a clean combined launch.
Read the number, not the story. [2×2 factorial · interaction term]
If the button-color test and the headline test interact, each contaminates the other's control group. You read two clean wins that don't replicate together.
— Either run a proper multivariate test that models the interaction, or fully isolate traffic.
— Most pairs of changes don't interact — but you only learn that by checking, not assuming.
Clean-looking A/B + clean-looking A/B ≠ a clean combined launch.
Read the number, not the story. [2×2 factorial · interaction term]
Underpowered tests don't just miss real wins — when they hit, they exaggerate by 2-3x.
This is the winner's curse (Type M / magnitude error). With low power, only the luckily-large estimates clear significance, so every "win" is inflated.
— A 30%-powered test that shows +20% is probably a true +7% at best.
— Symptom: big lifts on small samples that shrink on re-test.
The smaller the sample behind a giant lift, the more you should discount it.
Read the number, not the story. [power ≈30% · Type-M ~2.5x]
This is the winner's curse (Type M / magnitude error). With low power, only the luckily-large estimates clear significance, so every "win" is inflated.
— A 30%-powered test that shows +20% is probably a true +7% at best.
— Symptom: big lifts on small samples that shrink on re-test.
The smaller the sample behind a giant lift, the more you should discount it.
Read the number, not the story. [power ≈30% · Type-M ~2.5x]
Counting conversions per session instead of per user can swing a result by 10%+.
Users with multiple sessions get counted multiple times, and heavy users skew toward whichever arm they landed in more.
— Pick the randomization unit (user) and the analysis unit (user) to match. De-duplicate.
— Session-level conversion rates flatter high-frequency cohorts and break independence assumptions.
The denominator decides the answer before the variant does.
Read the number, not the story. [unit: user · de-duped]
Users with multiple sessions get counted multiple times, and heavy users skew toward whichever arm they landed in more.
— Pick the randomization unit (user) and the analysis unit (user) to match. De-duplicate.
— Session-level conversion rates flatter high-frequency cohorts and break independence assumptions.
The denominator decides the answer before the variant does.
Read the number, not the story. [unit: user · de-duped]
