Pingback Clinic
123 subscribers
18 photos
6 links
Your uptime and monitoring questions answered. 'Why do I get false downtime alerts?' 'What's a good check interval?' -- ask, and we explain it clearly.
Download Telegram
Q: What does "99.9% uptime" actually allow in downtime?

A: Here's the math you can keep in your head. Per year:
— 99% = about 3.65 days of allowed downtime
— 99.9% ("three nines") = about 8.77 hours
— 99.99% ("four nines") = about 52.6 minutes
— 99.999% ("five nines") = about 5.26 minutes

Every extra nine roughly cuts your allowed downtime by 10x, and your costs go up much faster than that.

The catch most people miss: nines are meaningless without a measurement window. 99.9% per year hides a 9-hour outage. 99.9% per month allows only ~43 minutes. Always pin down whether the SLA is measured monthly or annually before you sign or promise it.

Got a question? Drop it in the comments.
Q: My monitor says 200 OK but the page is actually broken. What gives?

A: A 200 status only means the server answered, not that it answered correctly. Apps love to return 200 on an error page, a maintenance screen, or a blank shell where the database call quietly failed.

Upgrade from a status-code check to a content check. Tell your monitor to look for a specific string that only appears when the page truly works, like a product price, a logged-in element, or a footer copyright line. If that keyword is missing, treat it as down even with a 200.

Better still, also flag if a known error phrase appears, like "Exception" or "Service Unavailable."

Natural follow-up: pick a keyword that's stable across deploys, so a harmless copy change doesn't trigger a false outage.

Got a question? Drop it in the comments.
Q: How do I stop monitors from paging during planned maintenance?

A: Use a scheduled maintenance window (most monitors call it exactly that) to suppress alerts for the affected checks during a set time. It mutes paging but, importantly, keeps recording state so your timeline stays accurate.

Two things that separate a clean setup from a messy one:
— Suppress alerts, don't pause the monitor entirely. A paused monitor records no data, so a real outage that overruns your window goes completely unseen.
— Exclude maintenance windows from your SLA/uptime calculation, so planned work doesn't unfairly dent your numbers.

Likely follow-up: set the window slightly wider than your planned work, maybe 15 minutes of buffer on each side, since deploys always run long. And remember to update your status page so customers aren't surprised.

Got a question? Drop it in the comments.
Q: My site is "down" but the server is clearly running. Where do I even look?

A: When the box is healthy but checks fail, the problem is almost always between the user and your server, not the server itself. Walk the path in order:

— DNS: is the domain resolving? An expired domain or a botched DNS change takes you down while the server hums along happily.
— TLS: did the certificate expire or fail to deploy? Browsers refuse the connection before any page loads.
— CDN/proxy: Cloudflare or your load balancer can return a 5xx while origin is fine.
— Then origin last.

This is why a good monitor reports the failure stage, not just "down." Knowing it failed at DNS resolution versus connection versus content tells you which team to wake. Configure your checks to log that detail, and outages get a lot less mysterious.

Got a question? Drop it in the comments.
Q: How do I monitor a cron job or backup that has no URL to ping?

A: Flip the direction with heartbeat monitoring (also called dead-man's-switch checks). Instead of you pinging the job, the job pings the monitor when it finishes. If that expected ping doesn't arrive on schedule, the monitor alerts.

So a nightly backup hits a unique heartbeat URL on success. No ping by, say, 3:15am means the backup didn't run or didn't complete, and you find out the same morning, not three weeks later when you need to restore.

Key detail: only ping on success, and ping at the very end of the job. A heartbeat sent at the start tells you the job launched, not that it actually worked.

Follow-up: add a grace period matching the job's normal runtime variance, so a slow-but-fine run doesn't false-alarm.

Got a question? Drop it in the comments.
Q: My site isn't down, it's just painfully slow. Should that even alert?

A: Yes, because slow is the outage your users notice before "down" ever happens. A page taking 12 seconds drives people away as effectively as a 500 error, but a simple up/down check sails right past it.

Add a response-time threshold to your checks. Pick a number from your real baseline, not a guess, then alert when response time crosses it for several consecutive checks (consecutive matters, so one slow blip doesn't page you).

Two tiers work well:
— Warn at, say, 2x your normal response time (chat channel)
— Page when it's both slow and sustained, since that usually precedes a full outage

Likely follow-up: measure at a percentile like p95, not the average. Averages hide the slow tail where your unhappiest users live.

Got a question? Drop it in the comments.
Q: What happens if the person on call sleeps through the alert?

A: That's what an escalation policy is for, and not having one is how outages quietly run for hours. The policy defines who gets notified next, and when, if the first person doesn't acknowledge.

A solid chain:
— Page primary on-call. Wait 5 minutes for acknowledgment.
— No ack? Escalate to secondary on-call.
— Still no ack after another 5-10 minutes? Notify the team lead or manager.

The magic word is acknowledgment. The alert must keep escalating until a human actively confirms they're on it, not until it's merely been delivered. Delivered-but-ignored is exactly how things fall through.

Follow-up you'll hit: rotate the schedule so the same person isn't permanently on call, and test the chain quarterly. An untested escalation policy fails on the night you need it.

Got a question? Drop it in the comments.
Channel photo updated
Q: Should my uptime check just fetch the homepage HTML, or load the whole page?

A: Depends what you're protecting. A basic check fetches the initial HTML and measures time-to-first-byte. That's cheap, fast, and perfect for "is the server alive," but it lies about real experience, because it never loads your JavaScript, images, or third-party scripts.

For critical user journeys, run a browser-based (full-page) check that actually renders the page like a real browser. It catches a broken bundle, a hung analytics tag, or an API that returns blank data, none of which a raw HTML fetch would notice.

The sensible split:
— Lightweight HTML/status checks every minute for broad uptime
— Heavier browser checks every few minutes on your top 2-3 flows (login, checkout)

Browser checks cost more and run slower, so reserve them for journeys that actually make you money.

Got a question? Drop it in the comments.
One to follow

For AdSense done right, @AdSenseTrenches is the move. Field notes from running real AdSense accounts: placement tweaks that lifted clicks,…
Forwarded from Потрачено! Клуб спящих бизнесменов!
This media is not supported in your browser
VIEW IN TELEGRAM
🚀 aff.top — вся индустрия арбитража в одном месте
🧠 Блог про арбитраж и ИИ — как нейросети меняют залив и антифрод
🚨 База спамеров — ежедневно собираем спамеров и ведём рейтинг
🛠 70+ инструментов — от клоаки до антифрод-чека
🎬 1000+ видео — весь YouTube про трафик в одной ленте
👤 2400+ персон — байеры и фаундеры с контактами напрямую
Без регистрации, без платных «премиумов».
👇 Подписывайся на канал
Q: Should I post small outages on my public status page, or will that scare customers?

A: Post them. Counterintuitively, a status page that occasionally shows yellow builds more trust than one that's suspiciously always green. Customers already know you have incidents, what they're judging is whether you're honest about them.

What actually builds confidence:
— Acknowledge fast, even before you have a root cause ("investigating elevated errors")
— Update on a steady rhythm, every 30 minutes during an incident, even just to say "still working on it"
— Post a brief post-incident note explaining what happened and what you changed

The status page that erodes trust is the one frozen on "All Systems Operational" while users sit in a support queue. Silence reads as either incompetence or a cover-up.

Follow-up: keep the writing plain and blame-free. Customers want clarity, not engineering jargon or excuses.

Got a question? Drop it in the comments.
This media is not supported in your browser
VIEW IN TELEGRAM
Алиса AI будет конкурировать с Google AI Studio

Яндекс разворачивает экосистему AI-агентов на базе Алисы с доступом сначала для компаний, затем для всех. Агенты уже работают в Яндекс Такси и Лавке, скоро появятся в браузере и студии разработки. Платформа интегрирует стандартные функции — заказ такси, покупки, анализ данных. Алиса AI показывает неплохие результаты: менее известна, чем конкуренты, поэтому предлагает щедрые лимиты на видеогенерацию и работу с контентом. Яндекс планирует внедрить…

➡️ Читайте на сайте: https://aff.top/blog/alisa-ai-budet-konkurirovat-s-google-ai-studio

🧠 Ещё больше инсайтов → в канале AFF.top
This media is not supported in your browser
VIEW IN TELEGRAM
В Zennoposter добавили ИИ-помощник

Zennolab добавил в Zennoposter встроенный ИИ-кубик с доступом к четырём моделям (Gemini, DeepSeek, Claude, ChatGPT) — 50 бесплатных запросов в сутки. Есть режимы Assistant (чтение) и Agent (автоматическое создание скриптов), плюс новый GET-запрос по API. Нейросети хорошо справляются с регистрацией, постингом, фармингом аккаунтов и простым кодированием, но требуют проверки при парсинге динамических сайтов и диагностике ошибок. В связке с Zennoobr…

➡️ Читайте на сайте: https://aff.top/blog/v-zennoposter-dobavili-ii-pomoschnik

🧠 Ещё больше инсайтов → в канале AFF.top