This piece, "The MTTI Manifesto," argues for the importance of a new metric in incident response: Mean Time to Isolate. The author contends that the majority of outage time is spent identifying the problem's source, not fixing it, and that focusing on MTTI can drive significant improvements in system architecture and observability.
https://www.oldschoolburke.com/the-mtti-manifesto/
https://www.oldschoolburke.com/the-mtti-manifesto/
Old School Burke
012: The MTTI Manifesto
Mean Time to Isolate
๐5
AWSDoor is a red team automation tool designed to simulate advanced attacker behavior in AWS environments
https://github.com/OtterHacker/AWSDoor
https://github.com/OtterHacker/AWSDoor
GitHub
GitHub - OtterHacker/AWSDoor: AWSDoor is a red team automation tool designed to simulate advanced attacker behavior in AWS environments
AWSDoor is a red team automation tool designed to simulate advanced attacker behavior in AWS environments - OtterHacker/AWSDoor
โค2
This write-up explores the emerging discipline of AI Reliability Engineering (AIRe) as the "Third Age of SRE." It argues that the unique challenges of AI workloads, such as their probabilistic nature and new failure modes like model decay, require an evolution of traditional Site Reliability Engineering principles.
https://thenewstack.io/ai-reliability-engineering-welcome-to-the-third-age-of-sre/
https://thenewstack.io/ai-reliability-engineering-welcome-to-the-third-age-of-sre/
The New Stack
AI Reliability Engineering: Welcome to the Third Age of SRE
SREs must build AI we can trust, leveraging the emerging ecosystem of tools and standards.
This dispatch offers a detailed walkthrough for backend engineers on creating a Kubernetes Operator using Go and Kubebuilder. The author, Amr Elhewy, simplifies complex DevOps concepts by building a practical "PodTracker" operator that sends Slack notifications for new pod creations.
https://hewi.blog/a-backend-engineer-lost-in-the-devops-world-making-a-kubernetes-operator-with-go
https://hewi.blog/a-backend-engineer-lost-in-the-devops-world-making-a-kubernetes-operator-with-go
๐ฅ3
MLOps Tools For Managing & Orchestrating The Machine Learning LifeCycle
https://github.com/polyaxon/polyaxon
https://github.com/polyaxon/polyaxon
GitHub
GitHub - polyaxon/polyaxon: MLOps Tools For Managing & Orchestrating The Machine Learning LifeCycle
MLOps Tools For Managing & Orchestrating The Machine Learning LifeCycle - polyaxon/polyaxon
๐4
OpenYurt - Extending your native Kubernetes to edge(project under CNCF)
https://github.com/openyurtio/openyurt
https://github.com/openyurtio/openyurt
GitHub
GitHub - openyurtio/openyurt: OpenYurt - Extending your native Kubernetes to edge(project under CNCF)
OpenYurt - Extending your native Kubernetes to edge(project under CNCF) - openyurtio/openyurt
๐3
Forwarded from AWS Notes (Roman Siewko)
๐ฅ FREE premium exam prep on AWS Skill Builder until Jan 5, 2026!
https://skillbuilder.aws/
๐ ๐๐ผ๐๐ฒ๐ฟ๐:
๐ธAWS Certified Cloud Practitioner (CLF-C02)
๐ธAWS AI Practitioner
๐ก ๐ช๐ต๐ฎ๐ ๐๐ผ๐ ๐ด๐ฒ๐ (๐ป๐ผ๐ฟ๐บ๐ฎ๐น๐น๐ ๐ฝ๐ฎ๐ถ๐ฑ):
โ Official practice exams
โ Hands-on labs (SimuLearn)
โ AWS Escape Room (learning by playing)
โ Flashcards & learning plans
Plus, there are always-free resources:
โข Official practice questions
โข Free AWS training events
โข AWS Educate (labs + potential free exam vouchers)
#AWS_certification
https://skillbuilder.aws/
๐ ๐๐ผ๐๐ฒ๐ฟ๐:
๐ธAWS Certified Cloud Practitioner (CLF-C02)
๐ธAWS AI Practitioner
๐ก ๐ช๐ต๐ฎ๐ ๐๐ผ๐ ๐ด๐ฒ๐ (๐ป๐ผ๐ฟ๐บ๐ฎ๐น๐น๐ ๐ฝ๐ฎ๐ถ๐ฑ):
โ Official practice exams
โ Hands-on labs (SimuLearn)
โ AWS Escape Room (learning by playing)
โ Flashcards & learning plans
Plus, there are always-free resources:
โข Official practice questions
โข Free AWS training events
โข AWS Educate (labs + potential free exam vouchers)
#AWS_certification
๐ฅ3
This post compares Amazon EKS Auto Mode and Azure AKS Automatic, evaluating which platform offers a superior managed Kubernetes solution. While acknowledging AWS's progress, the author ultimately argues that AKS Automatic's more comprehensive, end-to-end automation makes it the clear winner for a truly hands-off experience.
https://pixelrobots.co.uk/2024/12/amazon-eks-auto-mode-vs-azure-aks-automatic-the-better-managed-kubernetes-solution/
https://pixelrobots.co.uk/2024/12/amazon-eks-auto-mode-vs-azure-aks-automatic-the-better-managed-kubernetes-solution/
This paper delves into disaster recovery architectures that go beyond simple high availability to ensure systems remain operational even when HA fails. Yakaiah Bommishetti outlines various DR strategies, from cold backups to active-active multi-site setups, emphasizing the critical difference between preventing failures and restoring services after a catastrophe.
https://hackernoon.com/beyond-high-availability-disaster-recovery-architectures-that-keep-running-when-ha-fails
https://hackernoon.com/beyond-high-availability-disaster-recovery-architectures-that-keep-running-when-ha-fails
Hackernoon
Beyond High Availability: Disaster Recovery Architectures That Keep Running When HA Fails
High Availability is not Disaster Recovery. This in-depth guide explores real-world Disaster Recovery architectures.
โคโ๐ฅ3โค2
DevOps & SRE notes
Cloudflare, again
Will the "Code Orange" help Cloudflare?
https://blog.cloudflare.com/fail-small-resilience-plan/
https://blog.cloudflare.com/fail-small-resilience-plan/
The Cloudflare Blog
Code Orange: Fail Small โ our resilience plan following recent incidents
We have declared โCode Orange: Fail Smallโ to focus everyone at Cloudflare on a set of high-priority workstreams with one simple goal: ensure that the cause of our last two global outages never happens again.
๐คฃ4๐2๐ฅ1
A set of modern Grafana dashboards for Kubernetes.
https://github.com/dotdc/grafana-dashboards-kubernetes
https://github.com/dotdc/grafana-dashboards-kubernetes
GitHub
GitHub - dotdc/grafana-dashboards-kubernetes: A set of modern Grafana dashboards for Kubernetes.
A set of modern Grafana dashboards for Kubernetes. - dotdc/grafana-dashboards-kubernetes
๐7๐ฉ1
This case study examines the build-versus-buy decision for Terraform CI/CD orchestration by analyzing a custom-built tool called Terraflow. The author reflects on the trade-offs between creating a bespoke solution that perfectly fits a specific workflow and the opportunity cost of diverting engineering resources from core business features.
https://terrateam.io/blog/build-vs-buy-terraflow-case-study
https://terrateam.io/blog/build-vs-buy-terraflow-case-study
Terrateam
function title(pageContext) {
const { post } = pageContext.data;
return (post == null ? void 0 : post.seoTitle) || (post ==โฆ
const { post } = pageContext.data;
return (post == null ? void 0 : post.seoTitle) || (post ==โฆ
function description(pageContext) {
const { post } = pageContext.data;
return (post == null ? void 0 : post.description) || "Blog post from Terrateam";
}
const { post } = pageContext.data;
return (post == null ? void 0 : post.description) || "Blog post from Terrateam";
}
๐4โค2
This tutorial guides readers through building a unified OpenTelemetry pipeline in Kubernetes to correlate metrics, logs, and traces. Fatih Koรง explains how to deploy the OTel Collector as both a DaemonSet and a gateway to centralize enrichment and sampling, ultimately reducing incident resolution time.
https://fatihkoc.net/posts/opentelemetry-kubernetes-pipeline/
https://fatihkoc.net/posts/opentelemetry-kubernetes-pipeline/
Fatih Koรง
Building a Unified OpenTelemetry Pipeline in Kubernetes
Deploy OpenTelemetry Collector in Kubernetes to unify metrics, logs, and traces with correlation, smart sampling, and insights for faster incident resolution.
๐5
This documentation demystifies the structure of Kubernetes YAML files by breaking them down into their three core components:
https://medium.com/@thisara.weerakoon2001/demystifying-kubernetes-yaml-ef9e92acf3df
metadata, spec, and status. It explains how users define the desired state in the spec, while Kubernetes continuously works to align the actual status with that intent through its reconciliation loop.https://medium.com/@thisara.weerakoon2001/demystifying-kubernetes-yaml-ef9e92acf3df
Medium
Demystifying Kubernetes YAML
In the world of Kubernetes, YAML files are the bread and butter. They are the declarative way you tell Kubernetes what you want yourโฆ
๐3
This engineering publication from DoubleVerify presents a case study on synchronizing database schema updates across multiple projects and environments. The team developed a solution using a shared, standalone schema migrations repository and Kubernetes pre-install hooks to automate and coordinate the process.
https://medium.com/doubleverify-engineering/a-case-study-in-synchronizing-database-schema-updates-between-projects-and-environments-a69a3cc38985
https://medium.com/doubleverify-engineering/a-case-study-in-synchronizing-database-schema-updates-between-projects-and-environments-a69a3cc38985
Medium
A Case Study in Synchronizing Database Schema Updates between Projects and Environments
Written By: Chaim Leichman
๐3โค2