CatOps

Cloudflare выкатили детальный разбор инцидента с регуляркой в своём блоге.

Обсуждаем пост-мортем в CatOps чатике

#postmortem

The Cloudflare Blog

Details of the Cloudflare outage on July 2, 2019

Almost nine years ago, Cloudflare was a tiny company and I was a customer not an employee. Cloudflare had launched a month earlier and one day alerting told me that my little site, jgc.org, didn’t seem to have working DNS any more.

2.81K viewsedited 08:23

Details of the Cloudflare outage on July 2, 2019

CatOps Chat

CatOps

Советы по подготовке Postmortems с примерами.

Подойдет как для публичных, так для приватных разборов инцидентов. Примеры, конечно же, из публичных.

Среди прочего:
- Использовать визуализацию (графики, например)
- Пытаться докопаться до сути вещей и причин происходящего (я лично не верю, что "root cause всегда один", поэтому сознательно избегаю этого понятия)
- Не тянуть с Постмортемом: чем раньше начать разбор - тем свежее память у людей
- Blameless
- Tell a story: справедливо больше для публичных постмортемов, но если при разборе у вас присутствует люди из другого контекста (менеджеры, инженеры команд, которые не принимали непосредственного участия в решении проблемы, etc.) совет может сработать и для приватных pm

#postmortem #culture

Blameless

Blog | Blameless Resources

Insights from thought leaders on incident management best practices, tools for site reliabiliity engineers

4.4K views12:53

5 Best Practices on Nailing Postmortems

CatOps

У Интернета есть две проблемы - BGP и DNS (c)

CloudFlare хорошо описали пост-мортем о проблемах CenturyLink из-за которых в воскресенье лихорадило соединение в Европе и частично США

Хотя, скорее всего, вы уже его видели

#postmortem

The Cloudflare Blog

August 30th 2020: Analysis of CenturyLink/Level(3) outage

Today CenturyLink/Level(3), a major ISP and Internet bandwidth provider, experienced a significant outage that impacted some of Cloudflare’s customers as well as a significant number of other services and providers across the Internet.

3.23K views09:45

Analysis of Today's CenturyLink/Level(3) Outage

CatOps

Постмортем о недавнем инциденте AWS.

Там же описаны некоторые зависимости внутренних сервисов компании.

#aws #postmortem

3.84K views17:34

Summary of the Amazon Kinesis Event in the Northern Virginia

CatOps

Recent Google incident post-mortem: https://status.cloud.google.com/incident/zall/20013#20013004

tl;dr: wrong quota applied to the Google User ID Service

#postmortem

4.92K views10:09

CatOps

Sometimes it is worth getting your head from clouds down to Earth.

Here is a great post-mortem story of a failed Ceph cluster.

The investigation led them down to more “invisible” underlying layers rather than just Ceph itself, but I won’t spoil more. This is an interesting and not that long read, so you can go through it yourself. Also, at least for me, every post-mortem looks like a detective story, not just a technical article.

P.S. I haven’t worked much with Ceph myself. When I was a very junior engineer, we had a few small Ceph clusters in a company I worked for. I was not involved in that project, though. However, I remember that once we had an issue with one of the clusters and my colleague spent a night fixing it.

The next day he said: “We didn’t quite lose the data. We just cannot retrieve it”. I think from that time this became a strong association for me with Ceph, even though Ceph is usually not the case.

#postmortem #ceph #linux

2.47K views10:28

A Ceph war story

CatOps

Amazon has published a public postmortem for the recent issues on Friday. However, it went through a little bit unnoticed because of the Log4j story (see one of the previous posts).

So, the original issue is happened to be a cascading failure, which led to congestion in AWS internal networks. This is an interesting part, because it puts some light on AWS internals.

So, the internal monitoring system as well as parts of control plane for EC2 reside in the internal network, which experienced issues. That's why AWS team was operating with partial visibility of their systems, which impacted the speed of resolution.

Customer services were still running, but their control APIs were impacted. For example, your existing EC2 machines were there, but you could neither describe them, not start a new one. These matters happened to be more critical for certain services within AWS line API Gateways and Amazon Connect.

The interesting thing is that these events were caused by the code that was there for years (according to AWS). Unfortunately, an unexpected behavior was revealed during an automated scaling event.

To mitigate such issues in future, AWS switched off automatic scaling in us-east-1. They claim that they have enough capacity already. As well as they're working on a fix for the part of code that caused the co congestion in the first place. I assume, there are many other internal action items from this outage as well.

#aws #postmortem

2.29K views11:15

Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region

CatOps

If you haven’t read a Roblox’s postmortem on October‘s 73-hour outage, you definitely should!

Even though this event happened in October, the postmortem was released just a few days ago. And in this case, this is a very good decision! Especially, because this write up provides a detailed analysis on what happened at that time and what chain of events caused that.

It‘s cool to read postmortem the next day after an outage - we are all curious human beings. Unfortunately, those postmortems usually they are lacking many details. This is understandable: it‘s not enough time for a thorough analysis, also your team is probably already tired.

In this case, though, you can have a detailed overview of what happened as well as plans to prevent this chain of events happening again. Moreover, with some plans already implemented.

It‘s a pity that not may companies do similar postmortems. And I must say that this is probably in their disadvantage either. After reading this document I have a feeling that Roblox is a cool place to work, TBH.

#postmortem #hashicorp #consul

Roblox

Roblox Return to Service | Roblox

Roblox is a global platform where millions of people gather together every day to imagine, create, and share experiences with each other in immersive, user-generated 3D worlds.

👍7

2.19K views12:05

Roblox Return to Service 10/28-10/31 2021

CatOps

On Thursday, November 18, 2021, Dropbox did not go down. It sounds like a beginning of some modernist novel. Yet, this is rather a post-mortem on an incident that never happened.

This is a great story of leadership and dedication, which lasted for a few years. Results? Dropbox were able to literally pull the network cord from their data center and “do not go down on 18 of November, 2021”. And not just a data center, but their main one.

I really enjoyed this story because it proves once again a few basic things that we prefer not to think about:
- Any project take time and effort. Big projects take a lot of time and effort. If you’re looking into rebuilding your system from scratch, that won’t take a week or two.
- Big projects require dedication. You cannot just add them as a side hassle for your existing team and expect them to deliver everything with the highest quality.
- Iterative improvements. Apollo 11 was the first mission to reach the Moon not because 11 is a pretty number.
- Test and exercise. It’s not enough just to “implement the best practices”, you have to validate if those are actually working as expected. And if there’s a process involved, you have to repeat it frequently enough to not to get rusty.

#infrastructure #culture #postmortem

dropbox.tech

That time we unplugged a data center to test our disaster readiness

👍3

2.72K views14:42

That time we unplugged a data center to test our disaster readiness

Resources to Help Ukraine

CatOps

I love reading postmortems. A good postmortem usually unveils a set of problems some of which you can have in your company as well. As they say: there is never a single root cause.

Here is a postmortem from Reddit about their Pi-day outage.

It has everything you love: complex systems, legacy software, processes that were not tested that well, sacred knowledge that is long gone, etc.

Don’t get me wrong, I’m saying that not to shame Reddit. In fact they did a great job highlighting all the problems. It’s much harder and takes more courage than just say: Calico broke - Calico bad.

Also, I have similar problems at my place as well and I bet you have too. This why it’s important to recognize the importance of such “low priority tech debt”. Cleaning that out may save your company’s ass someday.

#kubernetes #networking #postmortem

From the RedditEng community on Reddit

Explore this post and more from the RedditEng community

👍17❤4🔥3🤡1

2.42K viewsedited 09:47

You Broke Reddit: The Pi-Day Outage

Support Ukraine 🇺🇦

About

Blog

Apps

Platform