CatOps
5.09K subscribers
94 photos
5 videos
19 files
2.57K links
DevOps and other issues by Yurii Rochniak (@grem1in) - SRE @ Preply && Maksym Vlasov (@MaxymVlasov) - Engineer @ Star. Opinions on our own.

We do not post ads including event announcements. Please, do not bother us with such requests!
Download Telegram
At least Cloudflare is fast in sharing their postmortems.

https://blog.cloudflare.com/5-december-2025-outage/

A curious thing is this:

>>>
Customers that have their web assets served by our older FL1 proxy AND had the Cloudflare Managed Ruleset deployed were impacted. All requests for websites in this state returned an HTTP 500 error, with the small exception of some test endpoints such as /cdn-cgi/trace.
<<<

IIRC, in the previous incident on Nov 18, only the customers on the newer proxy version were impacted. So, one could say that Cloudflare had a single time-distributed total outage.

Another important thing:

>>>
Before the end of next week we will publish a detailed breakdown of all the resiliency projects underway, including the ones listed above. While that work is underway, we are locking down all changes to our network in order to ensure we have better mitigation and rollback systems before we begin again.
<<<

Honestly, looking forward to seeing the write-up. I can only imagine how stressed their team is after taking down a big chunk of the Internet twice in less than 30 days.


#cloudflare #postmortem
πŸ‘5πŸ”₯2
This isn't a technical article, but still an important one, I would say. This one is about the importance of making your work visible.

Shadow work in engineering teams.

For better or worse, in many companies, promotion cycle is the popularity contest, therefore you need to act accordingly.

This article is aimed at the managers, but you may find it useful as an individual contributor as well.

#culture
❀13πŸ‘1
Here's an article on using DRY and KISS principles when working with Terraform. In my opinion, this is one of those articles that has a good idea behind it, but lacks a bit in delivery.

KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever.

The main take-way is, as usual: use your own judgment when creating abstractions for your infra code. This also applies to all your code.

I do generally agree on the tooling part. This is what Adam Jacobs called "A 200% knowledge problem": when adding an abstraction (a wrapper), you need to understand not only your code and the underlaying technologies, but also each layer of your abstractions. Thus, do not add wrappers unless you have to.

However, this article also touches an important point: you may feel like it's time to introduce an abstraction, but in reality, it's not.

#terraform #iac
πŸ‘14
​​For today’s Donations Monday, let’s help Serhii Sternenko with his initiatives:

- Rusoriz - a standing Monobank jar. The goal is to buy 300 FPV drones daily.
- Fundraiser for the interceptor drones

#donations #Monday
❀5πŸ‘1
Cloudflare shares how they use Terraform in production.

Their setup is quite standard: Terraform, Atlantis, Conftest (OPA). One interesting thing is that they use their in-house tool called tfstate-butler to work around the lack of encryption of the Terraform states. Although, they do not disclose the details of this tool.

Another catchy quote:

>>>
...we do this at a global scale β€” where a single misconfiguration can propagate across our edge in seconds and lead to unintended consequences.

Yeah... We know, Cloudflare, we know...

#terraform #iac
πŸ‘13😁1πŸ€”1
GitHub Actions will charge $0.002 per minute for self-hosted runners starting from the 1st of March 2026.

Obviously, you would still pay whatever you pay for your self-hosted infrastructure itself.

GitHub Actions will remain free for public repositories. For now.

#cicd #gha #microsoft
😐23😁6😭5🀬2πŸ‘1
On the positive note: Docker opens access to their hardened images (DHI) to everyone, not just their enterprise customers.

DHI uses a distroless runtime and includes SBOM.

Here you can browse the whole catalog of DHI. Docker asked me to login, though, but I'm definitely not an enterprise customer :D

#docker #security
πŸ”₯8πŸ‘5
Cold-Restart Resilience is an article on what could go wrong, when a system recovers from a total outage. Cases, covered in this article, with some tips on how to solve those:

- Circular bootstrap dependencies
- Using in-memory storage as databases
- Failures when trying to create a quorum
- Failures to fetch a remote dynamic config
- Stale data in leaderless systems

It doesn't mention cascading errors, but those are kinda famous already.

#sre #reliability #systems
πŸ‘6
​​For today’s Donations Monday, I would like to ask to help a friend of mine to get a car at the Zaporizhzhia front lines.

https://send.monobank.ua/jar/5mSFtTYUFt

This is a personal request, so you can be sure that this fundraiser is legit.

#donations #Ukraine
❀1
The last digest of this year is here!

https://newsletter.catops.dev/p/catops-digest-2025-12-27

With this digest been out, I'm taking some holidays. So, there will be no new posts here until the end of the year (it's not like there were many posts in the last couple of days, lol).

Also, I would really appreciate it, if you could share your thoughts about the newsletter in general. Unlike for the Telegram channel, I cannot really find a good fit for it. You can share your thoughts in the comments on Substack, in our chat (in Ukrainian), or via info@catops.dev

πŸŽ„πŸŽ„πŸŽ„ Happy holidays! πŸŽ„πŸŽ„πŸŽ„
πŸ”₯3❀1πŸ€”1
​​I'm back!

It always feels nice to start a new year from scratch. Unfortunately, it's often not the case, and we have to finish things that remained.

Today's fundraiser is one of those things: let's help a friend of mine to raise funds for a pickup truck for the Zaporizhzhia front lines:

https://send.monobank.ua/jar/5mSFtTYUFt

#donations #Ukraine
❀4
Starting a new year with a postmortem, eh?

There was a prolonged incident with Kafka at Honeycomb last month. Here you can find a preliminary postmortem for this incident.

"Preliminary" means that there is no root cause analysis yet, but there's already the timeline and the remediation steps.

#postmortem
πŸ‘2πŸ”₯1
I think, this could be a good Friday read: "When Change Outruns Us" is a tale about sustained progress.


The main point of this article is that smart companies do not push for "constant change for the sake of change", but rather adopt a more cyclic pace, when the periods of extensive work are followed by more relaxed times.

This article is particularly interesting to me, because I've just finished listening to the "Slow Productivity" book by Cal Newport. One of the principles, outlined in that book, is that one should work in their natural pace. However, a constant run is no one's natural pace. Another observation in that book, is that starting from the second half of the XX century, managers started to approximate work by "business", i.e. if you look busy, you do some work, even if in the reality, there are zero outcomes.

Many tech companies like to claim that they are "outcomes-oriented" or "value impact", but in my experience, "business" is still the approximation for work. Especially, once a company growth beyond the size, when everyone naturally knows everyone, as well as what they are doing.

#culture #mgmt
πŸ‘3❀1