Pairs well with this channel
@LinkEquityHeat — Strong, unfiltered opinions on internal linking — why your 'related posts' widget is… Quietly one of the better feeds in the space.
@LinkEquityHeat — Strong, unfiltered opinions on internal linking — why your 'related posts' widget is… Quietly one of the better feeds in the space.
Log sampling done right: when a slice lies to you
Sampling huge logs is fine — until it quietly drops your rare-but-important events.
→ Honeycomb's dynamic sampling docs — the clearest explanation of why uniform sampling buries 5xx and bot edge cases, and how head/tail sampling fixes it.
★ Pick of the week — Liz Fong-Jones on observability sampling — argues for keeping 100% of errors and sampling the boring 200s. The principle transfers straight to crawl logs: never sample away Googlebot's 4xx hits.
→ Elastic's sampling guide — practical knobs if you're in the ELK stack.
→ VividCortex's "approximate" series — good stats intuition on confidence intervals for sampled counts.
Takeaway: sample your 200-OK noise, keep every error and every bot 3xx/4xx/5xx. Those are the rows you analyse logs for.
Sampling huge logs is fine — until it quietly drops your rare-but-important events.
→ Honeycomb's dynamic sampling docs — the clearest explanation of why uniform sampling buries 5xx and bot edge cases, and how head/tail sampling fixes it.
★ Pick of the week — Liz Fong-Jones on observability sampling — argues for keeping 100% of errors and sampling the boring 200s. The principle transfers straight to crawl logs: never sample away Googlebot's 4xx hits.
→ Elastic's sampling guide — practical knobs if you're in the ELK stack.
→ VividCortex's "approximate" series — good stats intuition on confidence intervals for sampled counts.
Takeaway: sample your 200-OK noise, keep every error and every bot 3xx/4xx/5xx. Those are the rows you analyse logs for.
GoAccess and friends: real-time log dashboards without a stack
For when you want a live view, not a data warehouse.
🔗 GoAccess official docs — single binary, terminal or HTML dashboard, parses combined format out of the box. The benchmark for zero-setup.
🔗 Allan Barizo's GoAccess + geoip tutorial — adds country breakdown so you can spot scraper farms by region.
★ Pick of the week — Gerardo Orellana's custom log-format guide (GoAccess author) — how to write a
🔗 lnav (Log File Navigator) — underrated TUI that auto-detects formats and does SQL over logs. Great for ad-hoc digging.
Takeaway: GoAccess for the live wall-display, lnav for interactive spelunking. Neither needs a database.
For when you want a live view, not a data warehouse.
🔗 GoAccess official docs — single binary, terminal or HTML dashboard, parses combined format out of the box. The benchmark for zero-setup.
🔗 Allan Barizo's GoAccess + geoip tutorial — adds country breakdown so you can spot scraper farms by region.
★ Pick of the week — Gerardo Orellana's custom log-format guide (GoAccess author) — how to write a
--log-format string for any weird Nginx or CDN log. Stops the "token not found" frustration cold.🔗 lnav (Log File Navigator) — underrated TUI that auto-detects formats and does SQL over logs. Great for ad-hoc digging.
Takeaway: GoAccess for the live wall-display, lnav for interactive spelunking. Neither needs a database.
Log the right fields: Nginx/Apache format tweaks for SEO
Default log formats omit data you'll wish you had. These sources fix that.
→ Nginx log_format docs — add
→ Apache mod_log_config manual —
★ Pick of the week — Barry Adams on "log everything Googlebot sees" — argues for capturing full UA, response time, and bytes, because that's how you prove slow pages throttle crawl rate.
→ Cloudflare Logpush field reference — if a CDN fronts you, your origin logs lie; pull edge logs instead.
Takeaway: add response time and Host now. You can't analyse fields you never logged.
Default log formats omit data you'll wish you had. These sources fix that.
→ Nginx log_format docs — add
$request_time and $upstream_response_time so you can see which URLs make Googlebot wait.→ Apache mod_log_config manual —
%D (microseconds) and %{Host}i for multi-site servers. Without Host you can't split bot traffic per domain.★ Pick of the week — Barry Adams on "log everything Googlebot sees" — argues for capturing full UA, response time, and bytes, because that's how you prove slow pages throttle crawl rate.
→ Cloudflare Logpush field reference — if a CDN fronts you, your origin logs lie; pull edge logs instead.
Takeaway: add response time and Host now. You can't analyse fields you never logged.
grep for logs: the patterns I keep reaching for
Not a beginner intro — these are the references that sharpened my filtering.
🔗 Julia Evans' grep zine —
🔗 ripgrep (BurntSushi) docs — rg over gzipped logs with
★ Pick of the week — "Use The Index, Luke"-style log thinking by Jon Henshaw — combining grep with
🔗 Greg's Wiki on grep+zcat pipelines — handling rotated
Takeaway: learn ripgrep's
Not a beginner intro — these are the references that sharpened my filtering.
🔗 Julia Evans' grep zine —
grep -P for Perl regex and why -F (fixed strings) is faster when you just want a literal IP.🔗 ripgrep (BurntSushi) docs — rg over gzipped logs with
-z is dramatically faster than zcat | grep on big archives.★ Pick of the week — "Use The Index, Luke"-style log thinking by Jon Henshaw — combining grep with
grep -v chains to strip your own monitoring/uptime bots before counting real crawl. The cleanup nobody documents.🔗 Greg's Wiki on grep+zcat pipelines — handling rotated
.gz logs without decompressing to disk.Takeaway: learn ripgrep's
-z and the grep -v exclusion chain. Faster searches, cleaner counts.Screaming Frog Log File Analyser: the under-read docs
Most people own it for the crawler; the log tool is quietly excellent.
→ Screaming Frog's official log analyser guide — import, then the "URLs not in crawl" view instantly surfaces orphan pages bots find but your site doesn't link.
★ Pick of the week — Dan Sharp's tutorial on matching logs to crawl data — the killer move: import a crawl AND logs, then filter for URLs Googlebot hits that return non-200. Bot-discovered errors, ranked.
→ Aleyda Solis on log-driven prioritization — using crawl frequency to decide which fixes Google will actually re-crawl soon.
→ Glenn Gabe's case notes — real examples of log analysis catching a spike of bot hits on a redirect loop.
Takeaway: combine crawl export + logs. The intersection (bot-hit + broken) is your fix list.
Most people own it for the crawler; the log tool is quietly excellent.
→ Screaming Frog's official log analyser guide — import, then the "URLs not in crawl" view instantly surfaces orphan pages bots find but your site doesn't link.
★ Pick of the week — Dan Sharp's tutorial on matching logs to crawl data — the killer move: import a crawl AND logs, then filter for URLs Googlebot hits that return non-200. Bot-discovered errors, ranked.
→ Aleyda Solis on log-driven prioritization — using crawl frequency to decide which fixes Google will actually re-crawl soon.
→ Glenn Gabe's case notes — real examples of log analysis catching a spike of bot hits on a redirect loop.
Takeaway: combine crawl export + logs. The intersection (bot-hit + broken) is your fix list.
Spotting fake crawlers in your logs: 5 solid reads
Scrapers love wearing a Googlebot costume. Here's how the pros unmask them.
→ Google's "verifying Googlebot" page — rDNS is step one, always.
★ Pick of the week — Cloudflare's bot-management writeup — breaks down the tells: UA claims Googlebot but the IP is a residential proxy or a datacenter ASN Google doesn't own. ASN lookup beats DNS for speed.
→ Detectify / ipinfo ASN guides — map an IP to its owning network; "Googlebot" from Hetzner or DigitalOcean is fake, period.
→ DataDome's bot-traffic reports — useful baselines for what fraction of "Googlebot" hits are typically spoofed.
→ AbuseIPDB — cross-check noisy IPs you keep seeing.
Takeaway: real Googlebot lives in Google's ASNs (e.g. AS15169). Wrong ASN = impostor, no matter the UA.
Scrapers love wearing a Googlebot costume. Here's how the pros unmask them.
→ Google's "verifying Googlebot" page — rDNS is step one, always.
★ Pick of the week — Cloudflare's bot-management writeup — breaks down the tells: UA claims Googlebot but the IP is a residential proxy or a datacenter ASN Google doesn't own. ASN lookup beats DNS for speed.
→ Detectify / ipinfo ASN guides — map an IP to its owning network; "Googlebot" from Hetzner or DigitalOcean is fake, period.
→ DataDome's bot-traffic reports — useful baselines for what fraction of "Googlebot" hits are typically spoofed.
→ AbuseIPDB — cross-check noisy IPs you keep seeing.
Takeaway: real Googlebot lives in Google's ASNs (e.g. AS15169). Wrong ASN = impostor, no matter the UA.
Don't lose the evidence: log rotation and retention sources
The best log analysis fails if logrotate ate last month's data first.
🔗 logrotate man page —
🔗 DigitalOcean's logrotate tutorial — the cleanest walkthrough of a custom config and testing with
★ Pick of the week — Will Critchlow on keeping 90+ days of logs for SEO — argues crawl patterns only reveal themselves over months; default 7-day retention blinds you to seasonal bot behavior.
🔗 GoAccess persistence docs — incremental processing so you keep aggregates even after raw logs rotate away.
Takeaway: enable
The best log analysis fails if logrotate ate last month's data first.
🔗 logrotate man page —
rotate, compress, dateext. Set dateext so files are access.log-20240115 not .1, .2 — far easier to script against.🔗 DigitalOcean's logrotate tutorial — the cleanest walkthrough of a custom config and testing with
logrotate -d (debug, no changes).★ Pick of the week — Will Critchlow on keeping 90+ days of logs for SEO — argues crawl patterns only reveal themselves over months; default 7-day retention blinds you to seasonal bot behavior.
🔗 GoAccess persistence docs — incremental processing so you keep aggregates even after raw logs rotate away.
Takeaway: enable
dateext, keep 90 days compressed, and aggregate before deletion.ELK for access logs: a curated starting path
Elasticsearch + Kibana is overkill for some, perfect for others. Five honest references.
→ Elastic's Filebeat nginx module docs — pre-built parsing and dashboards; you're querying bots in an hour, not a week.
→ grok debugger (Elastic) — test your log-line pattern before it silently drops malformed lines.
★ Pick of the week — Daniel Berman's "parsing access logs with Logstash" — the cleanest grok pattern for combined format plus a geoip + user-agent filter chain that classifies bots on ingest.
→ Kibana Lens tutorials — building a "Googlebot hits per URL path" viz without writing query DSL.
→ Elastic's data-stream + ILM guide — auto-roll indices so storage doesn't explode.
Takeaway: use the Filebeat module's defaults first. Only hand-roll grok when your format is non-standard.
Elasticsearch + Kibana is overkill for some, perfect for others. Five honest references.
→ Elastic's Filebeat nginx module docs — pre-built parsing and dashboards; you're querying bots in an hour, not a week.
→ grok debugger (Elastic) — test your log-line pattern before it silently drops malformed lines.
★ Pick of the week — Daniel Berman's "parsing access logs with Logstash" — the cleanest grok pattern for combined format plus a geoip + user-agent filter chain that classifies bots on ingest.
→ Kibana Lens tutorials — building a "Googlebot hits per URL path" viz without writing query DSL.
→ Elastic's data-stream + ILM guide — auto-roll indices so storage doesn't explode.
Takeaway: use the Filebeat module's defaults first. Only hand-roll grok when your format is non-standard.
GSC Crawl Stats vs raw logs: when each one wins
The Search Console report is free and lies a little. Sources on reading it honestly.
→ Google's Crawl Stats report help — what the host-status and response-breakdown charts actually mean. Read the fine print on the 90-day window.
★ Pick of the week — Lily Ray on Crawl Stats blind spots — explains why GSC aggregates hide per-URL detail and groups CSS/JS oddly, so a log file is still the ground truth for "did Googlebot fetch THIS page."
→ OnCrawl's "GSC vs logs" comparison — side-by-side of where the numbers diverge and why (sampling, grouping, timezone).
→ Google's crawl-budget doc — pairs with the report to interpret "average response time" spikes.
Takeaway: GSC for the trend, raw logs for the specific URL. Use both; trust logs when they disagree.
The Search Console report is free and lies a little. Sources on reading it honestly.
→ Google's Crawl Stats report help — what the host-status and response-breakdown charts actually mean. Read the fine print on the 90-day window.
★ Pick of the week — Lily Ray on Crawl Stats blind spots — explains why GSC aggregates hide per-URL detail and groups CSS/JS oddly, so a log file is still the ground truth for "did Googlebot fetch THIS page."
→ OnCrawl's "GSC vs logs" comparison — side-by-side of where the numbers diverge and why (sampling, grouping, timezone).
→ Google's crawl-budget doc — pairs with the report to interpret "average response time" spikes.
Takeaway: GSC for the trend, raw logs for the specific URL. Use both; trust logs when they disagree.
