Smartphone vs desktop Googlebot in your logs
Mobile-first indexing means the smartphone crawler should dominate your logs. References to check the split.
→ Google's crawler overview — the exact smartphone vs desktop Googlebot UA strings.
Takeaway: 'Googlebot/2.1' with a mobile token is the one that matters now.
⭐ Pick of the week: the ratio check many SEOs run — grep both UAs, compare counts.
Takeaway: heavy desktop-crawler share can signal you're not seen as mobile-first.
→ Search Engine Journal's explainer on reading the mobile-first shift in logs.
→ Google's note that desktop crawling continues at low volume — so it's a ratio, not zero.
Credit to Google Search Central and SEJ.
—
Соседний канал в сети: @affcareers_remote
Mobile-first indexing means the smartphone crawler should dominate your logs. References to check the split.
→ Google's crawler overview — the exact smartphone vs desktop Googlebot UA strings.
Takeaway: 'Googlebot/2.1' with a mobile token is the one that matters now.
⭐ Pick of the week: the ratio check many SEOs run — grep both UAs, compare counts.
grep -c 'Mobile.*Googlebot' access.log vs the desktop variant.Takeaway: heavy desktop-crawler share can signal you're not seen as mobile-first.
→ Search Engine Journal's explainer on reading the mobile-first shift in logs.
→ Google's note that desktop crawling continues at low volume — so it's a ratio, not zero.
Credit to Google Search Central and SEJ.
—
Соседний канал в сети: @affcareers_remote
Verifying Googlebot in your logs: 5 sources that get it right
Reverse DNS spoofing is rampant, so here's the canon on confirming a hit is really Google.
→ Google Search Central — the official reverse-then-forward DNS recipe (
→ Google's exported IP ranges — googlebot.json and special-crawlers.json let you match by CIDR instead of slow DNS lookups. Cache them; they change.
★ Pick of the week — Stephan Boyer's writeup on rDNS pitfalls — explains why a forward-confirmed PTR is non-negotiable and how attackers fake the User-Agent string alone.
→ Cloudflare Radar docs — good primer on verified-bot signatures if you sit behind a proxy and lose the real IP.
→ iplists community lists — handy when you need Bingbot and others too.
Takeaway: never trust the UA string. Confirm the IP, or you're counting impostors.
Reverse DNS spoofing is rampant, so here's the canon on confirming a hit is really Google.
→ Google Search Central — the official reverse-then-forward DNS recipe (
host on the IP, then confirm it resolves back). Boring, authoritative, the baseline everyone else builds on.→ Google's exported IP ranges — googlebot.json and special-crawlers.json let you match by CIDR instead of slow DNS lookups. Cache them; they change.
★ Pick of the week — Stephan Boyer's writeup on rDNS pitfalls — explains why a forward-confirmed PTR is non-negotiable and how attackers fake the User-Agent string alone.
→ Cloudflare Radar docs — good primer on verified-bot signatures if you sit behind a proxy and lose the real IP.
→ iplists community lists — handy when you need Bingbot and others too.
Takeaway: never trust the UA string. Confirm the IP, or you're counting impostors.
Status-code mining with one-liners: the awk greatest hits
Four snippets worth pinning above your terminal, all credited to where I first saw them.
🔗 Julia Evans (wizardzines) — her awk explainers make
🔗 nixCraft — classic tutorial for isolating 5xx by hour:
★ Pick of the week — Brendan Gregg's "awk and the log" notes — treats logs as columnar data and shows running-tally tricks (associative arrays) that beat piping to sort.
🔗 Greg's Wiki (BashFAQ) — the quoting and field-number gotchas that save you from off-by-one column bugs.
Takeaway: 90% of log triage is field 9 (status) grouped by something. Master that, skip the heavy tools.
Four snippets worth pinning above your terminal, all credited to where I first saw them.
🔗 Julia Evans (wizardzines) — her awk explainers make
awk '{print $9}' | sort | uniq -c | sort -rn finally click. Best mental model for field-splitting logs.🔗 nixCraft — classic tutorial for isolating 5xx by hour:
awk '$9~/^5/{print substr($4,14,2)}'. Find the spike, then the cause.★ Pick of the week — Brendan Gregg's "awk and the log" notes — treats logs as columnar data and shows running-tally tricks (associative arrays) that beat piping to sort.
🔗 Greg's Wiki (BashFAQ) — the quoting and field-number gotchas that save you from off-by-one column bugs.
Takeaway: 90% of log triage is field 9 (status) grouped by something. Master that, skip the heavy tools.
Crawl budget, measured from logs (not guessed)
Everyone talks crawl budget; these sources actually quantify it from server data.
→ Google's Gary Illyes (crawl budget post) — the source definition: crawl rate limit + crawl demand. Read it before any tool's dashboard.
→ OnCrawl's log studies — ties Googlebot hit frequency to indexation lag with real charts. Their "crawl-to-index ratio" framing is genuinely useful.
★ Pick of the week — JetOctopus crawl-frequency breakdown — shows how to bucket URLs by hits/30 days and spot the long tail Google ignores. Concrete percentile thresholds, not vibes.
→ Screaming Frog log analyser docs — solid for matching crawled URLs against your sitemap to find orphan hits.
Takeaway: crawl budget waste = bot hits on parameter/faceted URLs. Logs are the only place you see it before it costs you.
Everyone talks crawl budget; these sources actually quantify it from server data.
→ Google's Gary Illyes (crawl budget post) — the source definition: crawl rate limit + crawl demand. Read it before any tool's dashboard.
→ OnCrawl's log studies — ties Googlebot hit frequency to indexation lag with real charts. Their "crawl-to-index ratio" framing is genuinely useful.
★ Pick of the week — JetOctopus crawl-frequency breakdown — shows how to bucket URLs by hits/30 days and spot the long tail Google ignores. Concrete percentile thresholds, not vibes.
→ Screaming Frog log analyser docs — solid for matching crawled URLs against your sitemap to find orphan hits.
Takeaway: crawl budget waste = bot hits on parameter/faceted URLs. Logs are the only place you see it before it costs you.
Pairs well with this channel
@LinkEquityHeat — Strong, unfiltered opinions on internal linking — why your 'related posts' widget is… Quietly one of the better feeds in the space.
@LinkEquityHeat — Strong, unfiltered opinions on internal linking — why your 'related posts' widget is… Quietly one of the better feeds in the space.
Log sampling done right: when a slice lies to you
Sampling huge logs is fine — until it quietly drops your rare-but-important events.
→ Honeycomb's dynamic sampling docs — the clearest explanation of why uniform sampling buries 5xx and bot edge cases, and how head/tail sampling fixes it.
★ Pick of the week — Liz Fong-Jones on observability sampling — argues for keeping 100% of errors and sampling the boring 200s. The principle transfers straight to crawl logs: never sample away Googlebot's 4xx hits.
→ Elastic's sampling guide — practical knobs if you're in the ELK stack.
→ VividCortex's "approximate" series — good stats intuition on confidence intervals for sampled counts.
Takeaway: sample your 200-OK noise, keep every error and every bot 3xx/4xx/5xx. Those are the rows you analyse logs for.
Sampling huge logs is fine — until it quietly drops your rare-but-important events.
→ Honeycomb's dynamic sampling docs — the clearest explanation of why uniform sampling buries 5xx and bot edge cases, and how head/tail sampling fixes it.
★ Pick of the week — Liz Fong-Jones on observability sampling — argues for keeping 100% of errors and sampling the boring 200s. The principle transfers straight to crawl logs: never sample away Googlebot's 4xx hits.
→ Elastic's sampling guide — practical knobs if you're in the ELK stack.
→ VividCortex's "approximate" series — good stats intuition on confidence intervals for sampled counts.
Takeaway: sample your 200-OK noise, keep every error and every bot 3xx/4xx/5xx. Those are the rows you analyse logs for.
GoAccess and friends: real-time log dashboards without a stack
For when you want a live view, not a data warehouse.
🔗 GoAccess official docs — single binary, terminal or HTML dashboard, parses combined format out of the box. The benchmark for zero-setup.
🔗 Allan Barizo's GoAccess + geoip tutorial — adds country breakdown so you can spot scraper farms by region.
★ Pick of the week — Gerardo Orellana's custom log-format guide (GoAccess author) — how to write a
🔗 lnav (Log File Navigator) — underrated TUI that auto-detects formats and does SQL over logs. Great for ad-hoc digging.
Takeaway: GoAccess for the live wall-display, lnav for interactive spelunking. Neither needs a database.
For when you want a live view, not a data warehouse.
🔗 GoAccess official docs — single binary, terminal or HTML dashboard, parses combined format out of the box. The benchmark for zero-setup.
🔗 Allan Barizo's GoAccess + geoip tutorial — adds country breakdown so you can spot scraper farms by region.
★ Pick of the week — Gerardo Orellana's custom log-format guide (GoAccess author) — how to write a
--log-format string for any weird Nginx or CDN log. Stops the "token not found" frustration cold.🔗 lnav (Log File Navigator) — underrated TUI that auto-detects formats and does SQL over logs. Great for ad-hoc digging.
Takeaway: GoAccess for the live wall-display, lnav for interactive spelunking. Neither needs a database.
Log the right fields: Nginx/Apache format tweaks for SEO
Default log formats omit data you'll wish you had. These sources fix that.
→ Nginx log_format docs — add
→ Apache mod_log_config manual —
★ Pick of the week — Barry Adams on "log everything Googlebot sees" — argues for capturing full UA, response time, and bytes, because that's how you prove slow pages throttle crawl rate.
→ Cloudflare Logpush field reference — if a CDN fronts you, your origin logs lie; pull edge logs instead.
Takeaway: add response time and Host now. You can't analyse fields you never logged.
Default log formats omit data you'll wish you had. These sources fix that.
→ Nginx log_format docs — add
$request_time and $upstream_response_time so you can see which URLs make Googlebot wait.→ Apache mod_log_config manual —
%D (microseconds) and %{Host}i for multi-site servers. Without Host you can't split bot traffic per domain.★ Pick of the week — Barry Adams on "log everything Googlebot sees" — argues for capturing full UA, response time, and bytes, because that's how you prove slow pages throttle crawl rate.
→ Cloudflare Logpush field reference — if a CDN fronts you, your origin logs lie; pull edge logs instead.
Takeaway: add response time and Host now. You can't analyse fields you never logged.
grep for logs: the patterns I keep reaching for
Not a beginner intro — these are the references that sharpened my filtering.
🔗 Julia Evans' grep zine —
🔗 ripgrep (BurntSushi) docs — rg over gzipped logs with
★ Pick of the week — "Use The Index, Luke"-style log thinking by Jon Henshaw — combining grep with
🔗 Greg's Wiki on grep+zcat pipelines — handling rotated
Takeaway: learn ripgrep's
Not a beginner intro — these are the references that sharpened my filtering.
🔗 Julia Evans' grep zine —
grep -P for Perl regex and why -F (fixed strings) is faster when you just want a literal IP.🔗 ripgrep (BurntSushi) docs — rg over gzipped logs with
-z is dramatically faster than zcat | grep on big archives.★ Pick of the week — "Use The Index, Luke"-style log thinking by Jon Henshaw — combining grep with
grep -v chains to strip your own monitoring/uptime bots before counting real crawl. The cleanup nobody documents.🔗 Greg's Wiki on grep+zcat pipelines — handling rotated
.gz logs without decompressing to disk.Takeaway: learn ripgrep's
-z and the grep -v exclusion chain. Faster searches, cleaner counts.
