DevOps&SRE Library
17.8K subscribers
461 photos
4 videos
2 files
4.76K links
Библиотека статей по теме DevOps и SRE.

Реклама: @ostinostin
Контент: @mxssl

РКН: https://knd.gov.ru/license?id=67704b536aa9672b963777b3&registryType=bloggersPermission
Download Telegram
Scaling SRE Teams

Scaling teams of site reliability engineers comes with many challenges. Here, explore the challenges of scaling and review a successful scaling framework.


https://dzone.com/articles/scaling-sre-teams
Mastering AWS Lambda with Terraform: A Comprehensive Guide

https://blog.awsfundamentals.com/aws-lambda-with-terraform
VictoriaMetrics: A Comprehensive Guide, Comparing It to Prometheus, and Implementing Kubernetes Monitoring

https://medium.com/@seifeddinerajhi/victoriametrics-a-comprehensive-guide-comparing-it-to-prometheus-and-implementing-kubernetes-03eb8feb0cc2
Kubernetes And Kernel Panics

How Netflix’s Container Platform Connects Linux Kernel Panics to Kubernetes Pods


https://netflixtechblog.com/kubernetes-and-kernel-panics-ed620b9c6225
Kubewatch: A Kubernetes Watcher for Observability and Monitoring

Kubewatch is a Kubernetes watcher that publishes notifications to available collaboration hubs/notification channels. It watches the cluster for resource changes and notifies you through webhooks.


https://medium.com/@seifeddinerajhi/kubewatch-a-kubernetes-watcher-for-observability-and-monitoring-d6dea1dbeb06

https://github.com/robusta-dev/kubewatch
Notes on Self-hosted Transactional Email

Since a little more than two months ago, Healthchecks.io has been sending transactional email (~300’000 emails per month) through its own SMTP server. Here are my notes on setting it up.


https://blog.healthchecks.io/2023/08/notes-on-self-hosted-transactional-email
Martian Kubernetes Kit: a smooth-sailing toolkit from our SRE team

We’ve been using Kubernetes since before it was a “thing”, and as of 2023, we believe that it is still underutilized. In fact, it’s the best (and basically only real “at-scale”) solution for orchestrating Docker containers—or containers in general, after you’ve outgrown services like Heroku or Fly.io! That’s a bold claim, but it’s a belief backed up by our years of SRE experience. In this post, we’ll expand on that, and we’ll introduce a Kubernetes toolkit we already use and support for our clients, which simultaneously de-complexifies and highlights the benefits of Kubernetes.


https://evilmartians.com/chronicles/martian-kubernetes-kit-a-smooth-sailing-toolkit-from-our-sre-team
tofuenv

OpenTofu version manager inspired by tfenv


https://github.com/tofuutils/tofuenv
Service Level Indicators

Introduction to SLI, examples, counterexamples and tips


https://blog.alexewerlof.com/p/sli
On Error Budgets

An error budget is essentially the permissible limit of risk or failure that a service can tolerate while still meeting its objectives. It is closely tied to Service Level Objectives, which define the expected level of service reliability. For instance, if an SLO dictates 99.9% uptime, the error budget allows for a 0.1% margin of error or downtime.


https://www.codereliant.io/on-error-budgets
Upgrading GitHub.com to MySQL 8.0

GitHub uses MySQL to store vast amounts of relational data. This is the story of how we seamlessly upgraded our production fleet to MySQL 8.0.


https://github.blog/2023-12-07-upgrading-github-com-to-mysql-8-0
AWS CDK vs Terraform

IaC is one of the key DevOps practices, and AWS CDK & Terraform are both great IaC tools to manage your AWS infrastructure. Having used both extensively, let me share my experience with the 2 IaC tools.


https://medium.com/@kansvignesh/aws-cdk-vs-terraform-738c39d91f7a
Testing Framework in Terraform 1.6: A deep-dive

https://mattias.engineer/posts/terraform-testing-deep-dive
terraform-github-actions

This is a suite of terraform and OpenTofu related GitHub Actions that can be used together to build effective Infrastructure as Code workflows.


https://github.com/dflook/terraform-github-actions
Incident severity levels for online platforms

Defining clear Incident Severity levels is a key component to an efficient Incident Management process that helps Engineering teams quickly respond to outages and mitigate customer impact.


https://argoday.medium.com/incident-severity-levels-78bfe7dd7e0d
From RSS to WSS: Navigating the Depths of Kubernetes Memory Metrics

Beyond the basics, an in depth look at memory metrics in Kubernetes


https://itnext.io/from-rss-to-wss-navigating-the-depths-of-kubernetes-memory-metrics-4d7d77d8fdcb