Hub-page architecture: the spine that holds 10k leaf pages
Leaf pages without a hub structure are an orphan farm. Design the spine first; generate leaves into it.
The three-layer spine:
— Layer 1 — Pillar (1 per major dimension):
— Layer 2 — Hubs (1 per category value):
— Layer 3 — Leaves:
Rules:
☐ Every leaf is reachable from the pillar in ≤2 clicks.
☐ Hubs paginate at 100 leaves with
☐ A hub's own content includes a real summary, not just the link list, or it's a thin page itself.
☐ Empty hubs 404 or 301 up — never a blank index.
Ship gate: don't publish until all boxes are checked.
Leaf pages without a hub structure are an orphan farm. Design the spine first; generate leaves into it.
The three-layer spine:
— Layer 1 — Pillar (1 per major dimension):
/loans/. Owner-action: editorial, hand-built, links down to all hubs. Gate: must exist before any leaf.— Layer 2 — Hubs (1 per category value):
/loans/{type}/. Auto-generated, lists and links to its leaves, links up to the pillar. Gate: a hub with under 5 qualifying leaves should not exist — fold it up.— Layer 3 — Leaves:
/loans/{type}/{state}/. Each links up to its hub and across to siblings.Rules:
☐ Every leaf is reachable from the pillar in ≤2 clicks.
☐ Hubs paginate at 100 leaves with
rel sequencing, not infinite scroll that hides links from crawlers.☐ A hub's own content includes a real summary, not just the link list, or it's a thin page itself.
☐ Empty hubs 404 or 301 up — never a blank index.
Ship gate: don't publish until all boxes are checked.
Data-source validation runs before the template, not after
Your pages are only as trustworthy as the feed behind them. Treat the data source as the first QA gate.
Pre-ingest checklist for any new feed:
☐ Step 1 — Freshness stamp. Every record carries a
☐ Step 2 — Null-rate ceiling per field. Gate: any field above 30% null is demoted to optional and removed from titles/H1s.
☐ Step 3 — Outlier clamp. Numeric fields get min/max bounds. Gate: a $0 or $9,999,999 price flags the record for hold, not publish.
☐ Step 4 — Canonical naming. Map source values to your controlled vocabulary (state codes, currency, units) before render. Gate: fail any unmapped enum.
☐ Step 5 — Dedupe key. Define the composite key that makes a record unique. Gate: fail the batch if duplicate keys exist.
Guardrail: ingest writes to a staging table. Production reads only records that passed all five. A failed record never silently becomes a thin page.
Ship gate: don't publish until all boxes are checked.
Your pages are only as trustworthy as the feed behind them. Treat the data source as the first QA gate.
Pre-ingest checklist for any new feed:
☐ Step 1 — Freshness stamp. Every record carries a
last_verified date. Gate: reject the feed if more than 20% of records are older than your refresh SLA.☐ Step 2 — Null-rate ceiling per field. Gate: any field above 30% null is demoted to optional and removed from titles/H1s.
☐ Step 3 — Outlier clamp. Numeric fields get min/max bounds. Gate: a $0 or $9,999,999 price flags the record for hold, not publish.
☐ Step 4 — Canonical naming. Map source values to your controlled vocabulary (state codes, currency, units) before render. Gate: fail any unmapped enum.
☐ Step 5 — Dedupe key. Define the composite key that makes a record unique. Gate: fail the batch if duplicate keys exist.
Guardrail: ingest writes to a staging table. Production reads only records that passed all five. A failed record never silently becomes a thin page.
Ship gate: don't publish until all boxes are checked.
Failure mode: pages reachable only via the sitemap
Programmatic sets often have no internal path to the individual page — the sitemap is the only entry point. Google treats sitemap-only URLs as low priority and crawls them slowly or never. They look orphaned because they are.
Internal-link gate:
— Step 1. Owner: dev. Every generated page must receive at least 3 internal links from other crawlable pages (hub, siblings, related-by-data).
— Step 2. Owner: SEO. Run a crawl starting from the homepage only — no sitemap. Gate: every published URL is discoverable within 4 clicks.
— Step 3. Owner: dev. Build sibling/related modules from the data (same category, nearby geo, similar attribute), not random.
— Step 4. Owner: SEO. Report orphan count after each deploy. Gate: orphan count = 0.
Guardrail: the sitemap is a hint, never the primary discovery path.
Ship gate: don't publish until all boxes are checked.
—
Рядом по теме: @CrawlAndRender (там про crawl efficiency)
Programmatic sets often have no internal path to the individual page — the sitemap is the only entry point. Google treats sitemap-only URLs as low priority and crawls them slowly or never. They look orphaned because they are.
Internal-link gate:
— Step 1. Owner: dev. Every generated page must receive at least 3 internal links from other crawlable pages (hub, siblings, related-by-data).
— Step 2. Owner: SEO. Run a crawl starting from the homepage only — no sitemap. Gate: every published URL is discoverable within 4 clicks.
— Step 3. Owner: dev. Build sibling/related modules from the data (same category, nearby geo, similar attribute), not random.
— Step 4. Owner: SEO. Report orphan count after each deploy. Gate: orphan count = 0.
Guardrail: the sitemap is a hint, never the primary discovery path.
Ship gate: don't publish until all boxes are checked.
—
Рядом по теме: @CrawlAndRender (там про crawl efficiency)
Staged rollout to starve thin pages before they ship
Don't release a full template set at once. Use a 3-tier release valve so weak pages never reach the index.
— Tier 1 (data-rich): pages where your source has 8+ populated fields. Action: publish immediately, submit in sitemap.
— Tier 2 (partial): 4-7 fields. Action: publish with
— Tier 3 (sparse): under 4 fields. Action: do not render. Return 404 or fold into a parent hub.
The guardrail is a single field-count function in your template:
☐ Count non-null, non-boilerplate fields per record.
☐ Map count to tier.
☐ Tier drives the robots directive automatically — no human decides per page.
Weekly job: re-score Tier 2 pages. When a record crosses into 8+ fields, flip to
Ship gate: don't publish until all boxes are checked.
Don't release a full template set at once. Use a 3-tier release valve so weak pages never reach the index.
— Tier 1 (data-rich): pages where your source has 8+ populated fields. Action: publish immediately, submit in sitemap.
— Tier 2 (partial): 4-7 fields. Action: publish with
noindex,follow. They pass link equity but stay out of the index until enriched.— Tier 3 (sparse): under 4 fields. Action: do not render. Return 404 or fold into a parent hub.
The guardrail is a single field-count function in your template:
☐ Count non-null, non-boilerplate fields per record.
☐ Map count to tier.
☐ Tier drives the robots directive automatically — no human decides per page.
Weekly job: re-score Tier 2 pages. When a record crosses into 8+ fields, flip to
index and add to the next sitemap. Rollback path: any page dropping below threshold flips back to noindex same day.Ship gate: don't publish until all boxes are checked.
URL pattern spec: lock it before the first page exists
URL changes after launch are the most expensive rollback in pSEO. Write the spec once, freeze it.
The pattern contract:
☐ Step 1 — One variable per path segment.
☐ Step 2 — Slug source is immutable. Derive from a stable ID, not the display name. When "New-York City" becomes "NYC" in your data, the URL must not move. Gate: fail if slug derives from a mutable field.
☐ Step 3 — Casing and separators fixed: lowercase, hyphen, no trailing slash. Gate: fail any uppercase or underscore.
☐ Step 4 — Reserved-word guard. Strip values that collide with existing routes (
☐ Step 5 — Max one optional segment, and it must 301 to the canonical short form.
Guardrail: a unit test that generates 1,000 slugs from sample data and asserts zero duplicates and zero reserved-word hits.
Ship gate: don't publish until all boxes are checked.
URL changes after launch are the most expensive rollback in pSEO. Write the spec once, freeze it.
The pattern contract:
☐ Step 1 — One variable per path segment.
/loan/{type}/{state}, never /loan/{type}-in-{state}. Gate: fail if a segment encodes two dimensions.☐ Step 2 — Slug source is immutable. Derive from a stable ID, not the display name. When "New-York City" becomes "NYC" in your data, the URL must not move. Gate: fail if slug derives from a mutable field.
☐ Step 3 — Casing and separators fixed: lowercase, hyphen, no trailing slash. Gate: fail any uppercase or underscore.
☐ Step 4 — Reserved-word guard. Strip values that collide with existing routes (
/about, /api). Gate: fail on collision.☐ Step 5 — Max one optional segment, and it must 301 to the canonical short form.
Guardrail: a unit test that generates 1,000 slugs from sample data and asserts zero duplicates and zero reserved-word hits.
Ship gate: don't publish until all boxes are checked.
The internal-link injection SOP (deterministic, not random)
Random "related links" blocks leak crawl budget and link to dead-ends. Make linking a deterministic function of your data graph.
For each generated page, inject links in this fixed order:
— 1 link up to the parent hub (the {category} index). Owner: template. Gate: must exist.
— 3 sibling links to the nearest neighbors on your primary dimension (e.g. adjacent price tiers, same city). Owner: a ranked-neighbor query. Gate: siblings must themselves be indexable.
— 2 cross-dimension links (same {type}, different {region}). Gate: skip any target that is
— 1 link to the highest-authority page in the cluster (your money page).
Rules that keep it clean:
☐ Never link to a page that links back identically — break reciprocal loops.
☐ Anchor text pulls the target's H1 token, not a generic "click here."
☐ Cap total in-template links at 7 to avoid dilution.
☐ Run a monthly orphan report; any page with under 3 inbound internal links gets force-added to a neighbor's block.
Ship gate: don't publish until all boxes are checked.
Random "related links" blocks leak crawl budget and link to dead-ends. Make linking a deterministic function of your data graph.
For each generated page, inject links in this fixed order:
— 1 link up to the parent hub (the {category} index). Owner: template. Gate: must exist.
— 3 sibling links to the nearest neighbors on your primary dimension (e.g. adjacent price tiers, same city). Owner: a ranked-neighbor query. Gate: siblings must themselves be indexable.
— 2 cross-dimension links (same {type}, different {region}). Gate: skip any target that is
noindex.— 1 link to the highest-authority page in the cluster (your money page).
Rules that keep it clean:
☐ Never link to a page that links back identically — break reciprocal loops.
☐ Anchor text pulls the target's H1 token, not a generic "click here."
☐ Cap total in-template links at 7 to avoid dilution.
☐ Run a monthly orphan report; any page with under 3 inbound internal links gets force-added to a neighbor's block.
Ship gate: don't publish until all boxes are checked.
Pairs well with this channel
@OverviewHotTake — Strong, unfiltered opinions on AI Overviews and generative search — where it's… Quietly one of the better feeds in the space.
@OverviewHotTake — Strong, unfiltered opinions on AI Overviews and generative search — where it's… Quietly one of the better feeds in the space.
Crawl-budget release SOP for 100k+ page sets
Dumping 100,000 URLs into one sitemap teaches Googlebot nothing about priority. Meter the release.
The rollout schedule:
☐ Step 1 — Split sitemaps by tier, not by alphabet.
☐ Step 2 — Cap week-one exposure at the count your log files show Googlebot already crawls daily, times 5. Gate: don't exceed it.
☐ Step 3 — Watch the log-file ratio: indexed URLs ÷ crawled URLs. Gate: hold the next batch until the ratio is above 0.7.
☐ Step 4 — Release subsequent tiers only when the prior tier's index rate stabilizes for 7 days.
☐ Step 5 — Keep a
Guardrail: a daily job parses access logs and alerts if crawl requests to thin tiers exceed 15% of bot hits — a sign Google is wasting budget on pages you should have gated.
Ship gate: don't publish until all boxes are checked.
Dumping 100,000 URLs into one sitemap teaches Googlebot nothing about priority. Meter the release.
The rollout schedule:
☐ Step 1 — Split sitemaps by tier, not by alphabet.
sitemap-priority.xml (proven-demand pages) ships first and alone.☐ Step 2 — Cap week-one exposure at the count your log files show Googlebot already crawls daily, times 5. Gate: don't exceed it.
☐ Step 3 — Watch the log-file ratio: indexed URLs ÷ crawled URLs. Gate: hold the next batch until the ratio is above 0.7.
☐ Step 4 — Release subsequent tiers only when the prior tier's index rate stabilizes for 7 days.
☐ Step 5 — Keep a
lastmod that is honest. Faking it to trigger recrawl burns trust and crawl budget.Guardrail: a daily job parses access logs and alerts if crawl requests to thin tiers exceed 15% of bot hits — a sign Google is wasting budget on pages you should have gated.
Ship gate: don't publish until all boxes are checked.
Guardrail: the duplicate-title scanner
At scale, near-duplicate titles are the quiet killer — "Best {x} in {city}" times 5,000 reads as one page to a clustering algorithm. Install a hard guard.
The title-uniqueness routine:
☐ Step 1 — Generate all titles in a dry run, no publish.
☐ Step 2 — Strip the variable tokens, hash the static skeleton. Gate: if 100% of titles share one skeleton with only the city swapped, the template fails. Inject a second varying data point (rating, count, year).
☐ Step 3 — Levenshtein-cluster the full title strings. Gate: fail any cluster where more than 50 titles sit within edit-distance 5 of each other.
☐ Step 4 — Enforce a length band of 50-60 characters AFTER token substitution, using the longest real value, not the average. Gate: fail if the max-length value truncates.
☐ Step 5 — Meta descriptions get the same scan, with a 30-character minimum unique span per page.
Guardrail: this scanner runs in CI on every template change, not just at launch.
Ship gate: don't publish until all boxes are checked.
At scale, near-duplicate titles are the quiet killer — "Best {x} in {city}" times 5,000 reads as one page to a clustering algorithm. Install a hard guard.
The title-uniqueness routine:
☐ Step 1 — Generate all titles in a dry run, no publish.
☐ Step 2 — Strip the variable tokens, hash the static skeleton. Gate: if 100% of titles share one skeleton with only the city swapped, the template fails. Inject a second varying data point (rating, count, year).
☐ Step 3 — Levenshtein-cluster the full title strings. Gate: fail any cluster where more than 50 titles sit within edit-distance 5 of each other.
☐ Step 4 — Enforce a length band of 50-60 characters AFTER token substitution, using the longest real value, not the average. Gate: fail if the max-length value truncates.
☐ Step 5 — Meta descriptions get the same scan, with a 30-character minimum unique span per page.
Guardrail: this scanner runs in CI on every template change, not just at launch.
Ship gate: don't publish until all boxes are checked.
The promotion ladder: how a page earns the index
Indexing should be earned, not granted at birth. Run every generated page up a ladder.
The rungs (a page sits on the lowest it qualifies for):
— Rung 0 — Rendered,
— Rung 1 — Promote to
— Rung 2 — Add to priority sitemap when the page holds page-2 visibility for any query for 14 days. Gate: sustained, not a one-day spike.
— Rung 3 — Link from the money page when it converts or ranks top-10.
Demotion is automatic:
☐ Field count drops below required → back to Rung 0.
☐ Zero impressions in 90 days → back to Rung 0, remove from sitemap.
Guardrail: the ladder is a nightly job, not a manual review. No page promotes itself.
Ship gate: don't publish until all boxes are checked.
Indexing should be earned, not granted at birth. Run every generated page up a ladder.
The rungs (a page sits on the lowest it qualifies for):
— Rung 0 — Rendered,
noindex,follow, not in sitemap. Default for every new page. It passes link equity, collects internal links, stays invisible to search.— Rung 1 — Promote to
index + sitemap when: required fields complete AND at least 1 organic impression OR 3 internal inbound links. Gate: both data and demand signals.— Rung 2 — Add to priority sitemap when the page holds page-2 visibility for any query for 14 days. Gate: sustained, not a one-day spike.
— Rung 3 — Link from the money page when it converts or ranks top-10.
Demotion is automatic:
☐ Field count drops below required → back to Rung 0.
☐ Zero impressions in 90 days → back to Rung 0, remove from sitemap.
Guardrail: the ladder is a nightly job, not a manual review. No page promotes itself.
Ship gate: don't publish until all boxes are checked.
SOP: generate schema from the same record that renders the page
Hand-written JSON-LD drifts from visible content at scale and triggers "structured data does not match" penalties. Bind schema to the source.
The binding rules:
☐ Step 1 — One serializer per entity type. The
☐ Step 2 — Null-safe by construction. A missing
☐ Step 3 — No invented review counts or ratings. Gate:
☐ Step 4 — Validate in CI. Run the structured-data test on 20 sampled records per template. Gate: fail the build on any error, not warning-and-ship.
☐ Step 5 — Type honesty. A list page is
Guardrail: a monthly diff comparing visible fields to schema fields flags drift before Google does.
Ship gate: don't publish until all boxes are checked.
Hand-written JSON-LD drifts from visible content at scale and triggers "structured data does not match" penalties. Bind schema to the source.
The binding rules:
☐ Step 1 — One serializer per entity type. The
Product page and its Product schema read the same record object. Gate: no field appears in JSON-LD that isn't on the page.☐ Step 2 — Null-safe by construction. A missing
price omits the offers node entirely. Gate: never emit "price": null or a placeholder.☐ Step 3 — No invented review counts or ratings. Gate:
aggregateRating renders only when real review data exists for that record.☐ Step 4 — Validate in CI. Run the structured-data test on 20 sampled records per template. Gate: fail the build on any error, not warning-and-ship.
☐ Step 5 — Type honesty. A list page is
ItemList or CollectionPage, not Article. Gate: type must match page intent.Guardrail: a monthly diff comparing visible fields to schema fields flags drift before Google does.
Ship gate: don't publish until all boxes are checked.
Faceted navigation: the index-bloat firewall
Filters and sorts multiply URLs combinatorially. Three filters with ten options each is a thousand crawlable variants of one page. Build the firewall before you build the filters.
The control matrix — decide per parameter, once:
☐ Indexable facets: the 1-2 dimensions with real search demand (e.g. {category}, {city}). Clean path URLs, indexed, in sitemap.
☐ Non-indexable facets: sort, view, page-size, in-stock toggles. Action: query string +
☐ Combination cap: indexable only for single-facet and the 5 highest-demand two-facet pairs. Everything else canonicalizes up. Gate: fail any 3-facet URL that returns 200 and indexable.
☐ Parameter order: enforce a canonical order so
☐ Internal links never point at non-indexable facet URLs.
Guardrail: a crawl of your own filters that asserts the indexable URL count matches your matrix, not the combinatorial total.
Ship gate: don't publish until all boxes are checked.
Filters and sorts multiply URLs combinatorially. Three filters with ten options each is a thousand crawlable variants of one page. Build the firewall before you build the filters.
The control matrix — decide per parameter, once:
☐ Indexable facets: the 1-2 dimensions with real search demand (e.g. {category}, {city}). Clean path URLs, indexed, in sitemap.
☐ Non-indexable facets: sort, view, page-size, in-stock toggles. Action: query string +
noindex + canonical to the unfiltered version. Gate: never a crawlable path.☐ Combination cap: indexable only for single-facet and the 5 highest-demand two-facet pairs. Everything else canonicalizes up. Gate: fail any 3-facet URL that returns 200 and indexable.
☐ Parameter order: enforce a canonical order so
?a=1&b=2 and ?b=2&a=1 don't become two URLs.☐ Internal links never point at non-indexable facet URLs.
Guardrail: a crawl of your own filters that asserts the indexable URL count matches your matrix, not the combinatorial total.
Ship gate: don't publish until all boxes are checked.
The decay-detection SOP for generated page sets
pSEO pages rot silently — a data feed goes stale, a competitor refreshes, rankings slide across 5,000 pages at once. Run scheduled detection.
Monthly decay job:
☐ Step 1 — Pull 90-day-over-90-day clicks and impressions per URL pattern, grouped by template. Gate: flag any template losing both metrics by 20%+.
☐ Step 2 — Cross-reference data freshness. Gate: if decay correlates with stale
☐ Step 3 — Sample the SERP for 10 decayed pages. Gate: if a SERP feature (AI overview, pack) now owns the query, mark the template for format change, not refresh.
☐ Step 4 — Triage. Refresh data, rewrite the differentiation block, or retire to
☐ Step 5 — Re-submit only refreshed pages in the sitemap with an honest
Guardrail: track "pages refreshed vs pages decayed" as a rolling ratio. If you're refreshing slower than decay, freeze new generation until you catch up.
Ship gate: don't publish until all boxes are checked.
pSEO pages rot silently — a data feed goes stale, a competitor refreshes, rankings slide across 5,000 pages at once. Run scheduled detection.
Monthly decay job:
☐ Step 1 — Pull 90-day-over-90-day clicks and impressions per URL pattern, grouped by template. Gate: flag any template losing both metrics by 20%+.
☐ Step 2 — Cross-reference data freshness. Gate: if decay correlates with stale
last_verified dates, the fix is the feed, not the copy.☐ Step 3 — Sample the SERP for 10 decayed pages. Gate: if a SERP feature (AI overview, pack) now owns the query, mark the template for format change, not refresh.
☐ Step 4 — Triage. Refresh data, rewrite the differentiation block, or retire to
noindex. Each decayed page gets exactly one disposition.☐ Step 5 — Re-submit only refreshed pages in the sitemap with an honest
lastmod.Guardrail: track "pages refreshed vs pages decayed" as a rolling ratio. If you're refreshing slower than decay, freeze new generation until you catch up.
Ship gate: don't publish until all boxes are checked.
The differentiation budget: how much must be unique per page
"Unique content" is too vague to gate on. Set a numeric budget and enforce it in the template.
The per-page budget:
— At least 35% of rendered words must come from record-specific fields, not the shared skeleton. Gate: a word-source tagger fails any page below 35%.
— At least 1 computed value that no sibling shares: a ratio, a delta vs. category average, a rank. Gate: this number must change across pages or the block is boilerplate.
— At least 1 data point in the H1 or first 100 words. Gate: the opening must not be identical across the set.
How to enforce:
☐ Tag every template token as STATIC or DYNAMIC at build time.
☐ Render 100 sample pages, measure the DYNAMIC ratio.
☐ If a template can't clear 35%, the dimension is too thin — merge records into fewer, richer pages instead of generating many empty ones.
☐ Re-measure on every template edit; static additions silently erode the ratio.
Ship gate: don't publish until all boxes are checked.
"Unique content" is too vague to gate on. Set a numeric budget and enforce it in the template.
The per-page budget:
— At least 35% of rendered words must come from record-specific fields, not the shared skeleton. Gate: a word-source tagger fails any page below 35%.
— At least 1 computed value that no sibling shares: a ratio, a delta vs. category average, a rank. Gate: this number must change across pages or the block is boilerplate.
— At least 1 data point in the H1 or first 100 words. Gate: the opening must not be identical across the set.
How to enforce:
☐ Tag every template token as STATIC or DYNAMIC at build time.
☐ Render 100 sample pages, measure the DYNAMIC ratio.
☐ If a template can't clear 35%, the dimension is too thin — merge records into fewer, richer pages instead of generating many empty ones.
☐ Re-measure on every template edit; static additions silently erode the ratio.
Ship gate: don't publish until all boxes are checked.
Sitemap hygiene SOP for dynamic page sets
A sitemap that lists
The generation contract:
☐ Step 1 — Source of truth is the index ladder, not the route table. Only Rung-1+ pages get listed. Gate: a
☐ Step 2 — Status pre-check. Sample-fetch entries; any non-200 is excluded. Gate: fail the build if more than 1% of a sample returns errors.
☐ Step 3 — Shard at 45,000 URLs (under the 50k limit, with headroom) and register all shards in a sitemap index.
☐ Step 4 — Honest
☐ Step 5 — Diff vs. yesterday. Log added/removed URLs. A sudden 10k drop should alert, not ship silently.
Guardrail: monthly, reconcile sitemap URL count against Search Console's indexed count. A widening gap means the firewall upstream is leaking thin pages.
Ship gate: don't publish until all boxes are checked.
A sitemap that lists
noindex pages, 404s, or redirects sends conflicting signals at scale. Treat it as a managed artifact, regenerated nightly.The generation contract:
☐ Step 1 — Source of truth is the index ladder, not the route table. Only Rung-1+ pages get listed. Gate: a
noindex URL in the sitemap fails CI.☐ Step 2 — Status pre-check. Sample-fetch entries; any non-200 is excluded. Gate: fail the build if more than 1% of a sample returns errors.
☐ Step 3 — Shard at 45,000 URLs (under the 50k limit, with headroom) and register all shards in a sitemap index.
☐ Step 4 — Honest
lastmod from the record's real update timestamp. Gate: no blanket "today" stamps.☐ Step 5 — Diff vs. yesterday. Log added/removed URLs. A sudden 10k drop should alert, not ship silently.
Guardrail: monthly, reconcile sitemap URL count against Search Console's indexed count. A widening gap means the firewall upstream is leaking thin pages.
Ship gate: don't publish until all boxes are checked.