DevOps&SRE Library

How I Dropped Our Production Database and Now Pay 10% More for AWS

https://alexeyondata.substack.com/p/how-i-dropped-our-production-database

3.4K views15:04

DevOps&SRE Library

Is Infrastructure as Code the Next Abstraction to Fall?

I’ve been staring at a Terraform module for the last ten minutes, and I can’t stop thinking about a question that would have been absurd two years ago: why am I writing this?

Not “why am I provisioning this infrastructure.” That part makes sense. But why am I writing HCL, a domain-specific language that exists to describe infrastructure in a way that humans can read, when I have an AI agent sitting in my terminal that can call the AWS API directly?

It’s the kind of question that sounds naive until you realise the same logic is playing out across every layer of the stack. And the more I look at it, the more I think we’re watching the early stages of a fundamental shift in how we interact with machines.

https://sjramblings.io/is-infrastructure-as-code-the-next-abstraction-to-fall

3.4K views07:04

DevOps&SRE Library

Inside Terraform: A series about the internals of Terraform

This is the start/index post for a series of blog posts about the internals of Terraform. In this series, I will deep dive into different parts of Terraform and explain how they work under the hood.

The end-goal of this is to enable the reader to develop a deeper understanding of Terraform and how it works. After reading this, I would hope you are able to contribute to Terraform itself, add a new block to the language, or change existing behavior. I will not try to cover every single detail of Terraform, but I will try to cover the most important parts and give you a good overview of how different parts of Terraform work together.

My hope is that this series helps the reader to at least get a step closer to understanding the internals of Terraform. I won’t be covering anything related to language design and graph theory here; there are too many holes in my knowledge there as well. Maybe I’ll write something to that end in the future as well, probably not.

https://danielmschmidt.de/posts/2025-11-21-inside-terraform

3.94K views15:03

DevOps&SRE Library

terrapod

Open-source platform replacement for Terraform Enterprise.

https://github.com/mattrobinsonsre/terrapod

3.82K views07:00

DevOps&SRE Library

Advanced cost-aware Kubernetes scheduling for multi-cluster cost optimization with custom metrics

https://medium.com/@naeemulhaq/advanced-cost-aware-kubernetes-scheduling-for-multi-cluster-cost-optimization-with-custom-metrics-7ae709d712d2

3.83K views15:04

DevOps&SRE Library

System Design Series: Scaling your Kubernetes workloads with VPA

https://medium.com/@sanilkhurana7/system-design-series-scaling-your-kubernetes-workloads-with-vpa-and-the-architecture-of-vpa-6192fb70e443

3.8K views07:03

DevOps&SRE Library

Troubleshooting Conan ZFS GitHub ARC Container Initialization slowness

https://daversomethingsomething.medium.com/troubleshooting-conan-zfs-github-arc-container-initialization-slowness-ba3ee7be6fb0

3.73K views15:04

DevOps&SRE Library

Developing on Raspberry Pi

https://medium.com/@sean.ankenbruck_96245/developing-on-raspberry-pi-9be59b135d23

3.44K views07:00

DevOps&SRE Library

Hosting and scaling EKS hybrid nodes with KubeVirt and Kube-OVN CNI

https://itnext.io/hosting-and-scaling-eks-hybrid-nodes-with-kubevirt-and-kube-ovn-cni-a9305d1290f8

3.12K views15:03

DevOps&SRE Library

Mastering GKE Multi-Tenancy: The Power of Namespaces, RBAC, and Quotas

https://immrbhattarai.medium.com/mastering-gke-multi-tenancy-the-power-of-namespaces-rbac-and-quotas-0a01d69dca87

3.25K views07:05

DevOps&SRE Library

Moving Logic Out of Pods: Extending the Argo Workflows Controller

In this article, I'll show how the Argo Workflows Executor Plugin lets you extend the Argo Workflows controller without maintaining your own fork—simply by implementing a small HTTP server in any language. As a bonus, this same mechanism reduces the number of extra pods in your DAGs and lightens the load on the Kubernetes scheduler. If you're new to Argo, I'll briefly cover the architecture and where plugins fit in. We'll finish with practical examples and key configuration details.

https://hackernoon.com/moving-logic-out-of-pods-extending-the-argo-workflows-controller

3.38K views15:03

DevOps&SRE Library

k8squest

K8sQuest is a local, game-based Kubernetes training platform with an interactive GUI-like terminal interface. Each mission breaks something in Kubernetes. Your job is to fix it.

https://github.com/Manoj-engineer/k8squest

3.17K views07:00

DevOps&SRE Library

kimspect

kimspect is a kubernetes container image inspection tool that provides comprehensive visibility into container images running inside your cluster. kimspect can get image information by pod, namespace, and node. Built for performance and reliability, kimspect enables container image insights with a simple, intuitive command-line interface.

https://github.com/koithos/kimspect

3.3K views15:05

DevOps&SRE Library

kaos

KAOS is a Kubernetes-native framework for deploying and orchestrating AI agents with tool access, multi-agent coordination, and seamless LLM integration.

https://github.com/axsaucedo/kaos

3.05K views07:02

DevOps&SRE Library

flux9s

A K9s-inspired terminal UI for monitoring Flux GitOps resources in real-time.

https://github.com/dgunzy/flux9s

3.87K views15:02

DevOps&SRE Library

nix-csi

Mount /nix into Kubernetes pods using the CSI Ephemeral Volume feature. Volumes share lifetime with Pods and are embedded into the Podspec.

https://github.com/lillecarl/nix-csi

3.8K views07:04

DevOps&SRE Library

Every layer of review makes you 10x slower

https://apenwarr.ca/log/20260316

3.65K views15:01

DevOps&SRE Library

cartography

Cartography is a Python tool that maps infrastructure assets and their relationships into a Neo4j-backed graph view.

https://github.com/cartography-cncf/cartography

3.54K views06:04

DevOps&SRE Library

Stairway to GitOps: Scaling Flux at Morgan Stanley

Morgan Stanley explains how it scaled Flux across 500+ clusters over five years, including security, performance, and observability lessons.

https://fluxcd.io/blog/2026/03/stairway-to-gitops-morgan-stanley

3.66K views14:04

DevOps&SRE Library

The Invisible Rewrite: Modernizing the Kubernetes Image Promoter

Every container image you pull from registry.k8s.io got there through kpromo, the Kubernetes image promoter. It copies images from staging registries to production, signs them with cosign, replicates signatures across more than 20 regional mirrors, and generates SLSA provenance attestations. If this tool breaks, no Kubernetes release ships. Over the past few weeks, we rewrote its core from scratch, deleted 20% of the codebase, made it dramatically faster, and nobody noticed. That was the whole point.

https://kubernetes.io/blog/2026/03/17/image-promoter-rewrite

3.45K views06:03

DevOps&SRE Library

Securing Production Debugging in Kubernetes

This covers safer Kubernetes debugging with least-privilege RBAC, short-lived identity-bound credentials, and audited SSH-style access paths.

https://kubernetes.io/blog/2026/03/18/securing-production-debugging-in-kubernetes

4.42K views14:04

About

Blog

Apps

Platform