I kept a 5% holdout. It exposed my fake gains.
This is the move that changed how I trust my own dashboard.
All year I shipped 'winners.' Dashboard said I was up huge, cumulatively. Felt like a genius.
Then I checked my 5% holdout — a slice of traffic that NEVER got any of my changes, frozen on the original page.
My 'winners' vs the holdout? The real cumulative lift was about 40% of what the individual tests claimed. Wins overlap, decay, and interact. The dashboard double-counted.
The holdout is the only honest mirror you've got. It tells you what your work was ACTUALLY worth.
— Carve out a small permanent holdout that gets zero changes
— Compare your live page to it quarterly
Go set up a holdout group today. Brace yourself. Report back.
This is the move that changed how I trust my own dashboard.
All year I shipped 'winners.' Dashboard said I was up huge, cumulatively. Felt like a genius.
Then I checked my 5% holdout — a slice of traffic that NEVER got any of my changes, frozen on the original page.
My 'winners' vs the holdout? The real cumulative lift was about 40% of what the individual tests claimed. Wins overlap, decay, and interact. The dashboard double-counted.
The holdout is the only honest mirror you've got. It tells you what your work was ACTUALLY worth.
— Carve out a small permanent holdout that gets zero changes
— Compare your live page to it quarterly
Go set up a holdout group today. Brace yourself. Report back.
I optimized clicks and accidentally tanked revenue.
Product grid test. New layout pushed cheaper items up top. Add-to-carts jumped 14%. I cheered.
Guardrail metric: average order value. Down 22%. People bought, just bought the cheap junk I'd promoted to the top.
Net revenue per visitor: negative. My "win" lost money.
Now every test has one primary metric AND a guardrail it's not allowed to crater.
— Pick your one true north metric (usually revenue/visitor)
— Set guardrails: AOV, refund rate, churn, support tickets
— A primary win that breaks a guardrail is a loss
Go add a revenue guardrail to your running test before you call it. Report back.
Product grid test. New layout pushed cheaper items up top. Add-to-carts jumped 14%. I cheered.
Guardrail metric: average order value. Down 22%. People bought, just bought the cheap junk I'd promoted to the top.
Net revenue per visitor: negative. My "win" lost money.
Now every test has one primary metric AND a guardrail it's not allowed to crater.
— Pick your one true north metric (usually revenue/visitor)
— Set guardrails: AOV, refund rate, churn, support tickets
— A primary win that breaks a guardrail is a loss
Go add a revenue guardrail to your running test before you call it. Report back.
Day 2 of the test and I peeked. Big mistake.
Variant B was crushing. +22%. I texted my partner 'we found it.'
Day 5: +3%. Day 8: dead even.
That early spike was noise. Small samples swing wild. If you call it on day 2 you're just gambling on randomness and calling it skill.
The fix that saved me: I now set a fixed sample size BEFORE the test starts. No reading results until I hit it. I literally hide the dashboard.
If you must peek for sanity, use a sequential test (always-valid p-values) so early looks don't inflate your false positives.
— Decide your sample size first
— Don't call a winner before you hit it
Go set a stop number on your running test. Then close the tab.
Variant B was crushing. +22%. I texted my partner 'we found it.'
Day 5: +3%. Day 8: dead even.
That early spike was noise. Small samples swing wild. If you call it on day 2 you're just gambling on randomness and calling it skill.
The fix that saved me: I now set a fixed sample size BEFORE the test starts. No reading results until I hit it. I literally hide the dashboard.
If you must peek for sanity, use a sequential test (always-valid p-values) so early looks don't inflate your false positives.
— Decide your sample size first
— Don't call a winner before you hit it
Go set a stop number on your running test. Then close the tab.
Testing a pricing page without nuking revenue
Pricing tests scare people because you can torch real money. I run them in a safe order, lowest risk to highest:
— Start with layout, not price. 3 tiers vs 2. Toggle position. Zero revenue risk.
— Test the 'most popular' badge placement. Anchoring is free money.
— Test annual-vs-monthly default toggle. Defaulting to annual lifted my AOV without changing a price.
— Test feature framing and ordering inside each tier.
— Test the order of tiers (high-to-low anchors differently than low-to-high).
— ONLY then touch actual numbers, and cap exposure to 50% of traffic.
Never change price AND layout in one test. You won't know what moved.
Go test your 'most popular' badge first. It's free, it's fast, it anchors.
—
В @ScaleOrStall такого cbo vs abo scaling ещё много
Pricing tests scare people because you can torch real money. I run them in a safe order, lowest risk to highest:
— Start with layout, not price. 3 tiers vs 2. Toggle position. Zero revenue risk.
— Test the 'most popular' badge placement. Anchoring is free money.
— Test annual-vs-monthly default toggle. Defaulting to annual lifted my AOV without changing a price.
— Test feature framing and ordering inside each tier.
— Test the order of tiers (high-to-low anchors differently than low-to-high).
— ONLY then touch actual numbers, and cap exposure to 50% of traffic.
Never change price AND layout in one test. You won't know what moved.
Go test your 'most popular' badge first. It's free, it's fast, it anchors.
—
В @ScaleOrStall такого cbo vs abo scaling ещё много
I almost shipped a fake winner
This week I ran a hero test. Variant B up 14%. I was reaching for the ship button.
Then I checked the split. 52/48. Supposed to be 50/50.
That gap is sample ratio mismatch. Means something upstream broke the randomization — a redirect, a cache, a bot bucket. The 14% was probably an artifact, not a win.
Killed the test. Found a cached page serving B to returning users only. Of course B looked better. It was talking to warm traffic.
Day 1 lesson I relearn every quarter: before you read the lift, read the counts.
— Pull your variant traffic numbers right now
— If the split is off by more than ~1%, your result is garbage
Go check the SRM on your live test. Report back.
This week I ran a hero test. Variant B up 14%. I was reaching for the ship button.
Then I checked the split. 52/48. Supposed to be 50/50.
That gap is sample ratio mismatch. Means something upstream broke the randomization — a redirect, a cache, a bot bucket. The 14% was probably an artifact, not a win.
Killed the test. Found a cached page serving B to returning users only. Of course B looked better. It was talking to warm traffic.
Day 1 lesson I relearn every quarter: before you read the lift, read the counts.
— Pull your variant traffic numbers right now
— If the split is off by more than ~1%, your result is garbage
Go check the SRM on your live test. Report back.
This week: button color vs button verb
Everyone wants to test green vs orange. Snooze.
I tested the VERB instead. Same button, same color. Just changed 'Get Started' to 'Get My Free Audit.'
Setup: landing page for a CPA offer, ~6k clicks split over 9 days.
Result: the specific-value verb pulled +19% clicks to the form. The color test I ran last month? Flat. Couldn't tell them apart with a microscope.
The lesson: color is decoration. The verb is the promise. 'Get My Free Audit' tells them what they walk away with. 'Get Started' tells them about work they have to do.
— Open your hero CTA
— Swap the generic verb for one that names the payoff
Go change your CTA verb to claim the reward, not start the chore. Report back.
Everyone wants to test green vs orange. Snooze.
I tested the VERB instead. Same button, same color. Just changed 'Get Started' to 'Get My Free Audit.'
Setup: landing page for a CPA offer, ~6k clicks split over 9 days.
Result: the specific-value verb pulled +19% clicks to the form. The color test I ran last month? Flat. Couldn't tell them apart with a microscope.
The lesson: color is decoration. The verb is the promise. 'Get My Free Audit' tells them what they walk away with. 'Get Started' tells them about work they have to do.
— Open your hero CTA
— Swap the generic verb for one that names the payoff
Go change your CTA verb to claim the reward, not start the chore. Report back.
I tested doing nothing. It won.
Weird one. Day 3 of cleaning up a checkout flow.
We had an exit-intent popup begging people to stay. Classic. I ran a test: popup vs NO popup.
No popup won. Conversions up 7%. The popup was annoying people who were just tab-switching to grab their card.
The takeaway nobody posts about: the 'remove a feature' variant is the most underrated test you can run. Adding stuff is sexy. Subtracting stuff actually moves money sometimes.
I keep a running list now: every popup, every field, every badge — what happens if it's just gone?
— Pick one 'helpful' element on your page
— Test the version where it doesn't exist
Go delete something and split test it. Report back.
Weird one. Day 3 of cleaning up a checkout flow.
We had an exit-intent popup begging people to stay. Classic. I ran a test: popup vs NO popup.
No popup won. Conversions up 7%. The popup was annoying people who were just tab-switching to grab their card.
The takeaway nobody posts about: the 'remove a feature' variant is the most underrated test you can run. Adding stuff is sexy. Subtracting stuff actually moves money sometimes.
I keep a running list now: every popup, every field, every badge — what happens if it's just gone?
— Pick one 'helpful' element on your page
— Test the version where it doesn't exist
Go delete something and split test it. Report back.
Quick rec — @AboveTheFoldHeresy keeps a tight feed on Landing page optimization. If today's post landed, that one's for you.
Day 30: I'd shipped 9 tests. My buddy shipped 2.
He runs perfect experiments. 95% confidence, clean writeups, beautiful charts. Two a month.
I run fast and slightly dirty. Nine a month. Some are junk. But three of mine were real wins compounding on top of each other.
Here's the math nobody likes: test velocity beats test precision when your baseline is mediocre. Early on you have so much low-hanging fruit that just SHIPPING more swings finds more wins than perfecting fewer.
Precision matters later, when you're optimizing a page that's already good and the wins are tiny.
— Count how many tests you actually launched last month
— If it's under 4, your problem is speed, not stats
Go launch your scrappiest test today. Don't wait for it to be perfect. Report back.
He runs perfect experiments. 95% confidence, clean writeups, beautiful charts. Two a month.
I run fast and slightly dirty. Nine a month. Some are junk. But three of mine were real wins compounding on top of each other.
Here's the math nobody likes: test velocity beats test precision when your baseline is mediocre. Early on you have so much low-hanging fruit that just SHIPPING more swings finds more wins than perfecting fewer.
Precision matters later, when you're optimizing a page that's already good and the wins are tiny.
— Count how many tests you actually launched last month
— If it's under 4, your problem is speed, not stats
Go launch your scrappiest test today. Don't wait for it to be perfect. Report back.
This week I climbed the specificity ladder on one headline
Started vague, tested my way down to a number.
Rung 1: 'Boost your conversions' (baseline)
Rung 2: 'Boost conversions with A/B testing'
Rung 3: 'Lift conversions 19% in 14 days'
Same page, same offer, rotated over three sprints.
Each rung up beat the one below. Rung 3 pulled +27% over the vague baseline. The number did the heavy lifting — specificity reads as proof, even before anyone checks it.
Vague headlines feel safe. They're not. They're forgettable. A concrete number forces the reader to react.
— Look at your hero headline
— If it has a verb but no number, you're on rung 2
Go add a real, specific number to your headline. Test it against the vague one. Report back.
Started vague, tested my way down to a number.
Rung 1: 'Boost your conversions' (baseline)
Rung 2: 'Boost conversions with A/B testing'
Rung 3: 'Lift conversions 19% in 14 days'
Same page, same offer, rotated over three sprints.
Each rung up beat the one below. Rung 3 pulled +27% over the vague baseline. The number did the heavy lifting — specificity reads as proof, even before anyone checks it.
Vague headlines feel safe. They're not. They're forgettable. A concrete number forces the reader to react.
— Look at your hero headline
— If it has a verb but no number, you're on rung 2
Go add a real, specific number to your headline. Test it against the vague one. Report back.
I stacked two winners. They canceled out.
Rookie move, year three, still did it.
Test A: new headline, +12%. Test B (separate week): new hero image, +8%. Both clean wins.
So I shipped both together. Logic: 12 + 8 = a big number, right?
Wrong. Combined page performed WORSE than the original on either alone. The new headline promised speed; the new image showed a complex dashboard. They fought each other.
That's an interaction effect. Two elements that win solo can clash when combined because they tell different stories.
Now when I stack proven winners, I run ONE more confirmation test on the combo before trusting it.
— List your last two shipped winners
— Did you ever test them together?
Go verify your stacked winners actually still win combined. Report back.
Rookie move, year three, still did it.
Test A: new headline, +12%. Test B (separate week): new hero image, +8%. Both clean wins.
So I shipped both together. Logic: 12 + 8 = a big number, right?
Wrong. Combined page performed WORSE than the original on either alone. The new headline promised speed; the new image showed a complex dashboard. They fought each other.
That's an interaction effect. Two elements that win solo can clash when combined because they tell different stories.
Now when I stack proven winners, I run ONE more confirmation test on the combo before trusting it.
— List your last two shipped winners
— Did you ever test them together?
Go verify your stacked winners actually still win combined. Report back.
'CTA above the fold' cost me conversions this week
Everyone repeats it like scripture. I tested it anyway.
Long-form sales page for a high-ticket tool. Variant A: CTA up top, above the fold. Variant B: CTA only after the full pitch, way down the page.
B won. +16% on the actual purchase.
Why: high-consideration offers need the argument BEFORE the ask. Putting the button up top just gave cold visitors a way to bounce before I'd convinced them.
The 'rule' works for low-friction stuff — newsletter signups, free tools. It backfires when the buyer needs persuading.
— Check how expensive/complex your offer is
— High-ticket? The early CTA might be hurting you
Go test your CTA AFTER the full pitch, not before. Report back.
Everyone repeats it like scripture. I tested it anyway.
Long-form sales page for a high-ticket tool. Variant A: CTA up top, above the fold. Variant B: CTA only after the full pitch, way down the page.
B won. +16% on the actual purchase.
Why: high-consideration offers need the argument BEFORE the ask. Putting the button up top just gave cold visitors a way to bounce before I'd convinced them.
The 'rule' works for low-friction stuff — newsletter signups, free tools. It backfires when the buyer needs persuading.
— Check how expensive/complex your offer is
— High-ticket? The early CTA might be hurting you
Go test your CTA AFTER the full pitch, not before. Report back.
Day 11: my big swing lost 18%. Worth it.
I rebuilt the whole hero. New angle, new image, new CTA, new social proof block. Total gut rewrite.
It got smoked. Down 18%. Embarrassing.
But here's why I'm not mad: that one test taught me more than ten safe tweaks. The new angle was the killer — I isolated it after, and the angle alone was the thing tanking trust.
Now I know that angle is poison for this audience. That's data I'd never get from testing button shades.
Big swings have a worse hit rate but a higher ceiling AND they teach you the boundaries fast. Run a few on purpose.
— Allocate ~20% of your tests to bold swings
— Expect most to lose. Mine the losers for why.
Go schedule one reckless big-swing test this month. Report back.
I rebuilt the whole hero. New angle, new image, new CTA, new social proof block. Total gut rewrite.
It got smoked. Down 18%. Embarrassing.
But here's why I'm not mad: that one test taught me more than ten safe tweaks. The new angle was the killer — I isolated it after, and the angle alone was the thing tanking trust.
Now I know that angle is poison for this audience. That's data I'd never get from testing button shades.
Big swings have a worse hit rate but a higher ceiling AND they teach you the boundaries fast. Run a few on purpose.
— Allocate ~20% of your tests to bold swings
— Expect most to lose. Mine the losers for why.
Go schedule one reckless big-swing test this month. Report back.
My winner was fake. Blame returning visitors.
Day 6: new nav design, +14% engagement. Looked legit.
Then I segmented new vs returning. New visitors: flat. Returning visitors: +40%.
That's the novelty effect. Your regulars notice the change, poke at it because it's new and shiny, and inflate your numbers. It fades in a week or two.
The real test of any UI change is how NEW visitors react — they have no 'ooh what's different' bias.
Now for any design test I segment by new vs returning, and I weight new-visitor behavior way heavier for permanent decisions.
— Segment your current test: new vs returning
— If the win is all in returning users, wait it out
Go split your test results by visitor type. Report back.
Day 6: new nav design, +14% engagement. Looked legit.
Then I segmented new vs returning. New visitors: flat. Returning visitors: +40%.
That's the novelty effect. Your regulars notice the change, poke at it because it's new and shiny, and inflate your numbers. It fades in a week or two.
The real test of any UI change is how NEW visitors react — they have no 'ooh what's different' bias.
Now for any design test I segment by new vs returning, and I weight new-visitor behavior way heavier for permanent decisions.
— Segment your current test: new vs returning
— If the win is all in returning users, wait it out
Go split your test results by visitor type. Report back.
This week: I deleted a form field and made money
Lead-gen form for an offer. Five fields: name, email, phone, company, budget.
I tested killing the phone field. Just gone.
Result: form completions +23%, lead quality basically unchanged when sales followed up. The phone field was scaring people off and adding almost nothing.
Rule of thumb I trust now: every field you add is a tax on completion. Phone and 'company size' are the worst offenders for friction.
The trick is testing field removal one at a time so you know exactly which one was the anchor.
— Count the fields on your highest-value form
— Pick the most invasive one (phone, budget, job title)
Go test the version with that field removed. Report back.
Lead-gen form for an offer. Five fields: name, email, phone, company, budget.
I tested killing the phone field. Just gone.
Result: form completions +23%, lead quality basically unchanged when sales followed up. The phone field was scaring people off and adding almost nothing.
Rule of thumb I trust now: every field you add is a tax on completion. Phone and 'company size' are the worst offenders for friction.
The trick is testing field removal one at a time so you know exactly which one was the anchor.
— Count the fields on your highest-value form
— Pick the most invasive one (phone, budget, job title)
Go test the version with that field removed. Report back.
Day 3: vague social proof beat the big number
Counterintuitive one.
Variant A: 'Join 50,000+ users.'
Variant B: 'Join 2,847 marketers this month.'
I bet hard on A. Bigger number, bigger flex.
B won. +11% signups.
Why: 50,000 is a round, suspiciously clean brag — it reads like marketing. 2,847 is oddly specific and recent, so it reads like a real live count. And 'marketers' matched the visitor's identity better than generic 'users.'
Specific and recent beats big and round. Believability converts harder than scale.
— Look at your social proof line
— If it's a round number with a plus sign, it might read as fake
Go test an oddly-specific, recent, identity-matched number against your round brag. Report back.
Counterintuitive one.
Variant A: 'Join 50,000+ users.'
Variant B: 'Join 2,847 marketers this month.'
I bet hard on A. Bigger number, bigger flex.
B won. +11% signups.
Why: 50,000 is a round, suspiciously clean brag — it reads like marketing. 2,847 is oddly specific and recent, so it reads like a real live count. And 'marketers' matched the visitor's identity better than generic 'users.'
Specific and recent beats big and round. Believability converts harder than scale.
— Look at your social proof line
— If it's a round number with a plus sign, it might read as fake
Go test an oddly-specific, recent, identity-matched number against your round brag. Report back.
