DevOps&SRE Library
19.4K subscribers
435 photos
2 videos
2 files
5.35K links
Библиотека статей по теме DevOps и SRE.

Реклама: @ostinostin
Контент: @mxssl

РКН: https://www.gosuslugi.ru/snet/67704b536aa9672b963777b3
Download Telegram
Durable Workflows Beyond Vercel: Version-Safe Orchestration for Kubernetes

Workflow DevKit lets you write durable, long-running workflows directly in your Next.js and Node.js apps. You define steps with ’use step’, and the SDK handles persistence, retries, and replay automatically. Workflows survive server restarts, can sleep for days, and resume exactly where they left off.

On Vercel, all of this works out of the box — the platform handles deployment versioning and queue routing behind the scenes. But what happens when you deploy to your own Kubernetes cluster? Version mismatch. And it’s subtle enough to corrupt data before you notice.

We built Platformatic World to fix this. It’s a drop-in World implementation that brings the same deployment safety to any Kubernetes cluster. Every workflow run is pinned to the code version that created it. Queue messages are routed to the correct versioned pods. Old versions stay alive until all their in-flight runs are complete.


https://blog.platformatic.dev/durable-workflows-kubernetes-version-safe
Designing for Failure with CloudNativePG

This post focuses on three areas that separate a demo from production systems: backups, recovery and connection pooling.


https://dylanmarkdacosta.medium.com/designing-for-failure-with-cloudnativepg-2c3987605a39
Building a Production-Grade HA Kubernetes Cluster on a Homelab with $0 in Cloud Costs

How I turned four Proxmox nodes, some enterprise surplus drives, and an afternoon into a fully automated HA k3s cluster with Rancher, Traefik, and Ansible — all running on hardware that draws less power than a gaming PC.


https://thiago-marsal.medium.com/homelab-k3s-ha-cluster-a-complete-architecture-guide-6a60005b6e99
1
SlimFaas

SlimFaas is a lightweight, plug-and-play Function-as-a-Service (FaaS) platform for Kubernetes (and Docker-Compose / Podman-Compose).


https://github.com/SlimPlanet/SlimFaas
Что общего у SRE и рыбаков? «GitOps = реальность» — это миф? Не создаёт ли Chaos Engineering ещё больше хаоса?..

Звучит как те самые внезапные вопросы перед сном в будний день 👀

И, кстати, ответ на все три у нас имеется! Правда, не здесь, а в подкасте «В SREду на кухне» — его ведут опытные инженеры из Авито. Они обсуждают наболевшее, приглашают внешних гостей и коллег, а также делятся дополнительными инсайтами, статьями по теме и анонсами встреч в своём канале.

Советуем подписаться и сохранить на будущее пару выпусков 🧠
Please open Telegram to view this post
VIEW IN TELEGRAM
agentgram

A single front door for all your AI agents and MCPs


https://github.com/dfradehubs/agentgram
The Problem with AI-Generated Post-Incident Reviews

The real learning comes from analyzing the incident while writing the document, not reading it; the document at the end is the residue of the learning.


https://greatcircle.com/blog/2026/05/05/problem-with-ai-generated-post-incident-reviews
You Shipped It Fast. But Did You Ship It Right?

AI tools have genuinely changed how fast teams can produce code, but they haven't changed how fast a codebase can safely absorb that code.


https://stackoverflow.blog/2026/05/12/you-shipped-it-fast-but-did-you-ship-it-right
On benchmarking

Benchmarking is hard. There are many ways to do it wrong and few to do it right.

But zooming out from any single system or harness, there are broad principles that should be applied to all benchmarking. Using these correctly makes it difficult to produce biased results.

Am I the world's best benchmarker? Certainly not. I invented the language balls, after all. But correctness and precision are important parts of PlanetScale's culture. We've spent considerable time learning the art of benchmarking, and are here to share best-practices.

Here, we're focusing primarily on benchmarking databases, but these principles apply to many domains.


https://planetscale.com/blog/on-benchmarking
Humans aren't fast enough for 4 9's

When thinking about Service Level Objectives (SLOs) and contractual Service Level Agreements (SLAs) for availability, I always like to put the percentages into concrete numbers.


https://incident.io/blog/humans-arent-fast-enough-for-4-nines
Why reviewing AI-generated code is devilishly hard

When working on code with GenAI assistance you need a better understanding of the system than when working without.


https://www.spinellis.gr/blog/20260523
Why Teamwork Makes (Or Breaks) Your Incident Response

High-severity incidents expose how a team really works together, usually within the first ten minutes.


https://uptimelabs.io/articles/teamwork-incident-response
Say the Thing You Want

You’re in a 1:1 with your manager, and things are going just fine. You talk about the project and that other thing. Toward the end, she asks: “Anything else?”

And there is something else. You want to lead that new initiative. Or move to a different team. Or you’ve been thinking about what stands in the way of your promotion. The thought is right there, sitting in the back of your throat. You’re going to say it, and then… “Nope, all good.”

You get out of the call feeling a specific kind of regret. You rationalize it somehow and then tell yourself you’ll bring it up next time (you won’t).


https://terriblesoftware.org/2026/04/01/say-the-thing-you-want
mq

mq is a command-line tool that processes Markdown using a syntax similar to jq.

It's written in Rust, allowing you to easily slice, filter, map, and transform structured data.


https://github.com/harehare/mq
“Good Taste” Is Just Experience

“In the age of AI, taste is the ultimate differentiator.”


https://terriblesoftware.org/2026/03/27/good-taste-is-just-experience
slumber

Slumber is a TUI (terminal user interface) HTTP client. Define, execute, and share configurable HTTP requests.


https://github.com/LucasPickering/slumber
markitdown

MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines.


https://github.com/microsoft/markitdown
cate

An infinite canvas for your code, terminals, browsers, docs, and AI agents.


https://github.com/0-AI-UG/cate
paneru

Paneru is a MacOS window manager that arranges windows on an infinite strip, extending to the right. A core principle is that opening a new window will never cause existing windows to resize, maintaining your layout stability.


https://github.com/karinushka/paneru
Как правильно работать с резервным копированием в облаке?

25 июня приглашаем на бесплатный вебинар от MWS Cloud Platform всех, кто работает с облаками.

Развеем мифы, разберём лучшие современные подходы и инструменты.

Обсудим интеграцию в процессы, консистентность, точечное восстановление и безопасность. Поговорим о плюсах нативных облачных инструментов.

Проведём демо в MWS Cloud Platform и ответим на ваши вопросы.

Зарегистрируйтесь, чтобы не пропустить!

25 июня в 14:00 (мск)

Зарегистрироваться
Please open Telegram to view this post
VIEW IN TELEGRAM
opensre

The open-source framework for AI SRE agents, and the training and evaluation environment they need to improve. Connect the 60+ tools you already run, define your own workflows, and investigate incidents on your own infrastructure.


https://github.com/Tracer-Cloud/opensre