At least Cloudflare is fast in sharing their postmortems.
https://blog.cloudflare.com/5-december-2025-outage/
A curious thing is this:
>>>
Customers that have their web assets served by our older FL1 proxy AND had the Cloudflare Managed Ruleset deployed were impacted. All requests for websites in this state returned an HTTP 500 error, with the small exception of some test endpoints such as /cdn-cgi/trace.
<<<
IIRC, in the previous incident on Nov 18, only the customers on the newer proxy version were impacted. So, one could say that Cloudflare had a single time-distributed total outage.
Another important thing:
>>>
Before the end of next week we will publish a detailed breakdown of all the resiliency projects underway, including the ones listed above. While that work is underway, we are locking down all changes to our network in order to ensure we have better mitigation and rollback systems before we begin again.
<<<
Honestly, looking forward to seeing the write-up. I can only imagine how stressed their team is after taking down a big chunk of the Internet twice in less than 30 days.
#cloudflare #postmortem
https://blog.cloudflare.com/5-december-2025-outage/
A curious thing is this:
>>>
Customers that have their web assets served by our older FL1 proxy AND had the Cloudflare Managed Ruleset deployed were impacted. All requests for websites in this state returned an HTTP 500 error, with the small exception of some test endpoints such as /cdn-cgi/trace.
<<<
IIRC, in the previous incident on Nov 18, only the customers on the newer proxy version were impacted. So, one could say that Cloudflare had a single time-distributed total outage.
Another important thing:
>>>
Before the end of next week we will publish a detailed breakdown of all the resiliency projects underway, including the ones listed above. While that work is underway, we are locking down all changes to our network in order to ensure we have better mitigation and rollback systems before we begin again.
<<<
Honestly, looking forward to seeing the write-up. I can only imagine how stressed their team is after taking down a big chunk of the Internet twice in less than 30 days.
#cloudflare #postmortem
The Cloudflare Blog
Cloudflare outage on December 5, 2025
Cloudflare experienced a significant traffic outage on December 5, 2025, starting approximately at 8:47 UTC. The incident lasted approximately 25 minutes before resolution. We are sorry for the impact that it caused to our customers and the Internet. Theβ¦
π5π₯2
This isn't a technical article, but still an important one, I would say. This one is about the importance of making your work visible.
Shadow work in engineering teams.
For better or worse, in many companies, promotion cycle is the popularity contest, therefore you need to act accordingly.
This article is aimed at the managers, but you may find it useful as an individual contributor as well.
#culture
Shadow work in engineering teams.
For better or worse, in many companies, promotion cycle is the popularity contest, therefore you need to act accordingly.
This article is aimed at the managers, but you may find it useful as an individual contributor as well.
#culture
newsletter.manager.dev
Shadow work in engineering teams
And the price your team pays for it
β€13π1
Here's an article on using DRY and KISS principles when working with Terraform. In my opinion, this is one of those articles that has a good idea behind it, but lacks a bit in delivery.
KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever.
The main take-way is, as usual: use your own judgment when creating abstractions for your infra code. This also applies to all your code.
I do generally agree on the tooling part. This is what Adam Jacobs called "A 200% knowledge problem": when adding an abstraction (a wrapper), you need to understand not only your code and the underlaying technologies, but also each layer of your abstractions. Thus, do not add wrappers unless you have to.
However, this article also touches an important point: you may feel like it's time to introduce an abstraction, but in reality, it's not.
#terraform #iac
KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever.
The main take-way is, as usual: use your own judgment when creating abstractions for your infra code. This also applies to all your code.
I do generally agree on the tooling part. This is what Adam Jacobs called "A 200% knowledge problem": when adding an abstraction (a wrapper), you need to understand not only your code and the underlaying technologies, but also each layer of your abstractions. Thus, do not add wrappers unless you have to.
However, this article also touches an important point: you may feel like it's time to introduce an abstraction, but in reality, it's not.
#terraform #iac
rosecurity@dev
KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever
The Scale Gap Problem
π14
A new issue of the CatOps digest is here!
https://newsletter.catops.dev/p/catops-digest-2025-12-12
#digest #newsletter
https://newsletter.catops.dev/p/catops-digest-2025-12-12
#digest #newsletter
newsletter.catops.dev
CatOps Digest 2025-12-12
What was on CatOps in the last couple of weeks...
β€4π₯2
ββFor todayβs Donations Monday, letβs help Serhii Sternenko with his initiatives:
- Rusoriz - a standing Monobank jar. The goal is to buy 300 FPV drones daily.
- Fundraiser for the interceptor drones
#donations #Monday
- Rusoriz - a standing Monobank jar. The goal is to buy 300 FPV drones daily.
- Fundraiser for the interceptor drones
#donations #Monday
β€5π1
Cloudflare shares how they use Terraform in production.
Their setup is quite standard: Terraform, Atlantis, Conftest (OPA). One interesting thing is that they use their in-house tool called
Another catchy quote:
>>>
...we do this at a global scale β where a single misconfiguration can propagate across our edge in seconds and lead to unintended consequences.
Yeah... We know, Cloudflare, we know...
#terraform #iac
Their setup is quite standard: Terraform, Atlantis, Conftest (OPA). One interesting thing is that they use their in-house tool called
tfstate-butler to work around the lack of encryption of the Terraform states. Although, they do not disclose the details of this tool.Another catchy quote:
>>>
...we do this at a global scale β where a single misconfiguration can propagate across our edge in seconds and lead to unintended consequences.
Yeah... We know, Cloudflare, we know...
#terraform #iac
The Cloudflare Blog
Shifting left at enterprise scale: how we manage Cloudflare with Infrastructure as Code
Cloudflare has shifted to Infrastructure as Code and policy enforcement to manage internal Cloudflare accounts. This new architecture uses Terraform, custom tooling, and Open Policy Agent to enforce security baselines and increase engineering velocity.
π13π1π€1
GitHub Actions will charge $0.002 per minute for self-hosted runners starting from the 1st of March 2026.
Obviously, you would still pay whatever you pay for your self-hosted infrastructure itself.
GitHub Actions will remain free for public repositories. For now.
#cicd #gha #microsoft
Obviously, you would still pay whatever you pay for your self-hosted infrastructure itself.
GitHub Actions will remain free for public repositories. For now.
#cicd #gha #microsoft
GitHub Resources
Pricing changes for GitHub Actions
GitHub Actions pricing update: Discover lower runner rates (up to 39% off) following a major re-architecture for faster, more reliable CI/CD.
π23π6π5π€¬2π1
On the positive note: Docker opens access to their hardened images (DHI) to everyone, not just their enterprise customers.
DHI uses a distroless runtime and includes SBOM.
Here you can browse the whole catalog of DHI. Docker asked me to login, though, but I'm definitely not an enterprise customer :D
#docker #security
DHI uses a distroless runtime and includes SBOM.
Here you can browse the whole catalog of DHI. Docker asked me to login, though, but I'm definitely not an enterprise customer :D
#docker #security
Docker
Hardened Images for Everyone | Docker
Security for everyone. Docker Hardened Images are now free to use, share, and build on with no licensing surprises.
π₯8π5
Forwarded from oleg_log (Oleg)
Good one. Have literally the same feedback. Cool tech but mostly useless.
https://johnjames.blog/posts/graphql-the-enterprise-honeymoon-is-over
https://johnjames.blog/posts/graphql-the-enterprise-honeymoon-is-over
johnjames.blog
GraphQL: the enterprise honeymoon is over
A production-tested take on GraphQL in enterprise systems, why the honeymoon phase fades, and when its complexity outweighs the benefits.
π5π2π€1
Cold-Restart Resilience is an article on what could go wrong, when a system recovers from a total outage. Cases, covered in this article, with some tips on how to solve those:
- Circular bootstrap dependencies
- Using in-memory storage as databases
- Failures when trying to create a quorum
- Failures to fetch a remote dynamic config
- Stale data in leaderless systems
It doesn't mention cascading errors, but those are kinda famous already.
#sre #reliability #systems
- Circular bootstrap dependencies
- Using in-memory storage as databases
- Failures when trying to create a quorum
- Failures to fetch a remote dynamic config
- Stale data in leaderless systems
It doesn't mention cascading errors, but those are kinda famous already.
#sre #reliability #systems
Substack
Cold-Restart Resilience
Because βIt Startsβ Doesnβt Mean βIt Worksβ
π6
ββFor todayβs Donations Monday, I would like to ask to help a friend of mine to get a car at the Zaporizhzhia front lines.
https://send.monobank.ua/jar/5mSFtTYUFt
This is a personal request, so you can be sure that this fundraiser is legit.
#donations #Ukraine
https://send.monobank.ua/jar/5mSFtTYUFt
This is a personal request, so you can be sure that this fundraiser is legit.
#donations #Ukraine
β€1
Monzo - a British neobank - reveals their system that grants engineers temporary elevated access.
tl;dr: They are using AWS Nitro Enclaves for this.
During my time at N26, we also had a system that served the same purpose, albeit it was designed differently.
#security
tl;dr: They are using AWS Nitro Enclaves for this.
During my time at N26, we also had a system that served the same purpose, albeit it was designed differently.
#security
Monzo
Securing admin access to Monzo's platform
Monzo runs on a shared platform of infrastructure that hosts our microservices. In this post, weβll discuss how we broker access to our infrastructure credentials with a system that is resistant to attacks even from the team that maintains it.
π₯4
The last digest of this year is here!
https://newsletter.catops.dev/p/catops-digest-2025-12-27
With this digest been out, I'm taking some holidays. So, there will be no new posts here until the end of the year (it's not like there were many posts in the last couple of days, lol).
Also, I would really appreciate it, if you could share your thoughts about the newsletter in general. Unlike for the Telegram channel, I cannot really find a good fit for it. You can share your thoughts in the comments on Substack, in our chat (in Ukrainian), or via info@catops.dev
πππ Happy holidays! πππ
https://newsletter.catops.dev/p/catops-digest-2025-12-27
With this digest been out, I'm taking some holidays. So, there will be no new posts here until the end of the year (it's not like there were many posts in the last couple of days, lol).
Also, I would really appreciate it, if you could share your thoughts about the newsletter in general. Unlike for the Telegram channel, I cannot really find a good fit for it. You can share your thoughts in the comments on Substack, in our chat (in Ukrainian), or via info@catops.dev
πππ Happy holidays! πππ
newsletter.catops.dev
CatOps Digest 2025-12-27
The last digest of this year...
π₯3β€1π€1
ββI'm back!
It always feels nice to start a new year from scratch. Unfortunately, it's often not the case, and we have to finish things that remained.
Today's fundraiser is one of those things: let's help a friend of mine to raise funds for a pickup truck for the Zaporizhzhia front lines:
https://send.monobank.ua/jar/5mSFtTYUFt
#donations #Ukraine
It always feels nice to start a new year from scratch. Unfortunately, it's often not the case, and we have to finish things that remained.
Today's fundraiser is one of those things: let's help a friend of mine to raise funds for a pickup truck for the Zaporizhzhia front lines:
https://send.monobank.ua/jar/5mSFtTYUFt
#donations #Ukraine
β€4
Starting a new year with a postmortem, eh?
There was a prolonged incident with Kafka at Honeycomb last month. Here you can find a preliminary postmortem for this incident.
"Preliminary" means that there is no root cause analysis yet, but there's already the timeline and the remediation steps.
#postmortem
There was a prolonged incident with Kafka at Honeycomb last month. Here you can find a preliminary postmortem for this incident.
"Preliminary" means that there is no root cause analysis yet, but there's already the timeline and the remediation steps.
#postmortem
status.honeycomb.io
Querying and Ingest issues in EU
Honeycomb's Status Page - Querying and Ingest issues in EU.
π2π₯1
I think, this could be a good Friday read: "When Change Outruns Us" is a tale about sustained progress.
The main point of this article is that smart companies do not push for "constant change for the sake of change", but rather adopt a more cyclic pace, when the periods of extensive work are followed by more relaxed times.
This article is particularly interesting to me, because I've just finished listening to the "Slow Productivity" book by Cal Newport. One of the principles, outlined in that book, is that one should work in their natural pace. However, a constant run is no one's natural pace. Another observation in that book, is that starting from the second half of the XX century, managers started to approximate work by "business", i.e. if you look busy, you do some work, even if in the reality, there are zero outcomes.
Many tech companies like to claim that they are "outcomes-oriented" or "value impact", but in my experience, "business" is still the approximation for work. Especially, once a company growth beyond the size, when everyone naturally knows everyone, as well as what they are doing.
#culture #mgmt
The main point of this article is that smart companies do not push for "constant change for the sake of change", but rather adopt a more cyclic pace, when the periods of extensive work are followed by more relaxed times.
This article is particularly interesting to me, because I've just finished listening to the "Slow Productivity" book by Cal Newport. One of the principles, outlined in that book, is that one should work in their natural pace. However, a constant run is no one's natural pace. Another observation in that book, is that starting from the second half of the XX century, managers started to approximate work by "business", i.e. if you look busy, you do some work, even if in the reality, there are zero outcomes.
Many tech companies like to claim that they are "outcomes-oriented" or "value impact", but in my experience, "business" is still the approximation for work. Especially, once a company growth beyond the size, when everyone naturally knows everyone, as well as what they are doing.
#culture #mgmt
Substack
When Change Outruns Us
Why growth depends on absorption and recovery
π3β€1