A reported "+8% lift" with a confidence interval of −2% to +18% is a coin flip, not a win.
The point estimate is the least informative number in the result. The interval tells you what you actually know.
— If the CI crosses zero, you cannot rule out that the change did nothing.
— A tight +1% beats a wide +8% for a launch decision.
Ship on the lower bound, not the headline.
Read the number, not the story. [95% CI · lower-bound rule]
The point estimate is the least informative number in the result. The interval tells you what you actually know.
— If the CI crosses zero, you cannot rule out that the change did nothing.
— A tight +1% beats a wide +8% for a launch decision.
Ship on the lower bound, not the headline.
Read the number, not the story. [95% CI · lower-bound rule]
A variant can win every segment and still lose overall — Simpson's Paradox in funnels.
If desktop and mobile each prefer B, but B happened to get more low-converting mobile traffic, the pooled number flips to A.
— Always check whether your randomizer balanced segment mix, not just totals.
— Report segment-weighted results, not raw pooled rates, when traffic mix differs by arm.
This is the same machinery as an SRM problem, one layer down.
Read the number, not the story. [segment-weighted · device split]
If desktop and mobile each prefer B, but B happened to get more low-converting mobile traffic, the pooled number flips to A.
— Always check whether your randomizer balanced segment mix, not just totals.
— Report segment-weighted results, not raw pooled rates, when traffic mix differs by arm.
This is the same machinery as an SRM problem, one layer down.
Read the number, not the story. [segment-weighted · device split]
Persistent help text under a field lifts completion ~2.3% over hover tooltips on the same form.
From 5 paired tests; effect concentrated on mobile, where hover doesn't exist.
Tooltips hide information behind an interaction users may never trigger. On touch devices, hover is a tap that competes with the field itself.
— Tooltip ▮▮▮▮ completion
— Persistent text ▮▮▮▮▮▮ completion
Reserve tooltips for desktop-only, low-stakes fields. Use inline text wherever mobile traffic is meaningful.
Read the number, not the story. [n=5 tests · mobile-skewed]
—
Чтобы быть в курсе рынка — подпишись на @affcareers_limassol
From 5 paired tests; effect concentrated on mobile, where hover doesn't exist.
Tooltips hide information behind an interaction users may never trigger. On touch devices, hover is a tap that competes with the field itself.
— Tooltip ▮▮▮▮ completion
— Persistent text ▮▮▮▮▮▮ completion
Reserve tooltips for desktop-only, low-stakes fields. Use inline text wherever mobile traffic is meaningful.
Read the number, not the story. [n=5 tests · mobile-skewed]
—
Чтобы быть в курсе рынка — подпишись на @affcareers_limassol
Roughly 6-8% of running A/B tests carry a Sample Ratio Mismatch — and most teams never check.
You split 50/50 but observe 50.8/49.2 on 80k users. Feels like noise. Run a chi-square test: a 0.8pp skew at that volume is wildly unlikely by chance, meaning your randomizer, redirect, or bot filtering is broken.
— Any SRM (the assignment ratio doesn't match what you set) invalidates the whole result, win or loss.
— Check it before you read the conversion number, not after.
Read the number, not the story. [SRM threshold p<0.001]
You split 50/50 but observe 50.8/49.2 on 80k users. Feels like noise. Run a chi-square test: a 0.8pp skew at that volume is wildly unlikely by chance, meaning your randomizer, redirect, or bot filtering is broken.
— Any SRM (the assignment ratio doesn't match what you set) invalidates the whole result, win or loss.
— Check it before you read the conversion number, not after.
Read the number, not the story. [SRM threshold p<0.001]
Stopping a test the first day it hits p<0.05 inflates your false-positive rate from 5% to ~26%.
That's the cost of peeking — checking significance repeatedly and stopping at the first green light. Each look is another lottery ticket for a fluke.
— Fix one: fix the sample size in advance, look once at the end.
— Fix two: use a sequential method (mSPRT, group-sequential) built to allow continuous monitoring.
A p-value (chance the result is noise) only means 5% if you looked exactly once.
Read the number, not the story. [α inflation 5%→26%, ~5 looks]
That's the cost of peeking — checking significance repeatedly and stopping at the first green light. Each look is another lottery ticket for a fluke.
— Fix one: fix the sample size in advance, look once at the end.
— Fix two: use a sequential method (mSPRT, group-sequential) built to allow continuous monitoring.
A p-value (chance the result is noise) only means 5% if you looked exactly once.
Read the number, not the story. [α inflation 5%→26%, ~5 looks]
