Conversion Lab Notes
1.78K subscribers
16 photos
1 link
Hard numbers on what actually lifts conversion rates — uplift benchmarks, statistical reads, and the math behind every CRO claim you've seen on Twitter.
Download Telegram
Neighbor spotlight: @SplitTestStreet. They go deep on A/B testing — the kind of channel you actually keep notifications on for.
~35% of "winning" UI changes lose half their lift within three weeks.
That's the novelty effect — returning users react to change, not to the change being better. The curve decays toward the control.
— Segment new vs returning visitors: if the win lives only in returning users early on, suspect novelty.
— Hold the test one full purchase cycle before you trust the number.
First-week lift ▮▮▮▮▮▮ → week-three lift ▮▮▮.
Read the number, not the story. [new vs returning split · 21d window]
CUPED can cut the traffic you need by 30-50% with zero UX change.
It's a variance-reduction trick: use each user's pre-experiment behavior as a covariate to strip out predictable noise from the conversion metric.
— Works best when pre-period behavior strongly predicts the outcome (repeat buyers, logged-in users).
— Near-useless for first-touch anonymous traffic with no history.
Same test, narrower confidence interval, faster decision.
Read the number, not the story. [variance −40% typical · logged-in cohort]
Average order value tests fail significance checks because the math is wrong, not the metric.
AOV and revenue-per-session are ratio metrics — the denominator (sessions) is itself random. A naive t-test understates variance and hands you false wins.
— Use the delta method or bootstrap to get honest confidence intervals on ratios.
— Revenue is also heavy-tailed: one whale moves the mean, so winsorize or cap.
The lift looks real until you price in the whale.
Read the number, not the story. [delta method · winsorized 99th pct]
One in five "successful" CRO wins quietly damages a metric nobody was watching.
Aggressive checkout simplification lifts conversion but raises refund and chargeback rates — the cost lands in a different team's dashboard.
— Define guardrail metrics (refunds, returns, support tickets, churn) before launch.
— A win that needs the guardrail to stay flat isn't a win until it does.
Conversion ▮▮▮▮▮▮ up, refunds ▮▮▮▮ also up = net negative.
Read the number, not the story. [guardrails: refund · CAC · churn]
Testing 20 metrics at 95% gives you a ~64% chance of at least one false win.
Every extra metric you eyeball is another coin flip against you — the multiple-comparisons problem.
— Declare one primary metric before launch. Everything else is exploratory and needs replication.
— If you must track many, correct the threshold (Bonferroni, Benjamini-Hochberg).
The "interesting secondary finding" is usually the false positive you went looking for.
Read the number, not the story. [1−0.95^20 ≈ 64%]
Running two A/B tests on the same page at once can silently corrupt both.
If the button-color test and the headline test interact, each contaminates the other's control group. You read two clean wins that don't replicate together.
— Either run a proper multivariate test that models the interaction, or fully isolate traffic.
— Most pairs of changes don't interact — but you only learn that by checking, not assuming.
Clean-looking A/B + clean-looking A/B ≠ a clean combined launch.
Read the number, not the story. [2×2 factorial · interaction term]
Underpowered tests don't just miss real wins — when they hit, they exaggerate by 2-3x.
This is the winner's curse (Type M / magnitude error). With low power, only the luckily-large estimates clear significance, so every "win" is inflated.
— A 30%-powered test that shows +20% is probably a true +7% at best.
— Symptom: big lifts on small samples that shrink on re-test.
The smaller the sample behind a giant lift, the more you should discount it.
Read the number, not the story. [power ≈30% · Type-M ~2.5x]
Counting conversions per session instead of per user can swing a result by 10%+.
Users with multiple sessions get counted multiple times, and heavy users skew toward whichever arm they landed in more.
— Pick the randomization unit (user) and the analysis unit (user) to match. De-duplicate.
— Session-level conversion rates flatter high-frequency cohorts and break independence assumptions.
The denominator decides the answer before the variant does.
Read the number, not the story. [unit: user · de-duped]
Channel photo updated
Splitting one long checkout into 3 short steps changes conversion by roughly ±0%.
The step count is mostly a wash; what moves the number is perceived progress and how many fields survive, not how many pages they sit on.
— A progress indicator on a single page often matches a multi-step flow.
— Fewer total inputs beats fewer total screens, every time we've measured it.
One page ▮▮▮▮▮ vs three pages ▮▮▮▮▮ — call it a tie.
Read the number, not the story. [field-count dominant · n≈25 tests]