DevOps&SRE Library

Being The First SRE

I have been the first Site Reliability Engineer (SRE) several times as a consultant or full-time employee. I’ve been the tech lead on three SRE teams and the only SRE on two others. I’ve succeeded (growing from one SRE to a team of five twice) and failed (quitting without another SRE being found). Here’s what I’ve learned about being the first SRE.

https://medium.com/@hans.knechtions/being-the-first-sre-7866a22975b4

3.89K views15:01

DevOps&SRE Library

GKE (Google Kubernetes Engine) Review

What if Kubernetes was idiot-proof?

https://matduggan.com/gke-google-kubernetes-engine-review

3.45K views07:01

DevOps&SRE Library

Understanding the Terraform Check Block Feature

We dive into one of Terraform's most recent features to leverage infrastructure validation.

https://masterpoint.io/updates/understanding-terraform-check

3.92K views15:00

DevOps&SRE Library

Traffic 101: Packets Mostly Flow

Slack handles billions of inbound network requests per day, all of which traverse through our edge network and ingress load balancing tiers. In this blog post, we’ll talk about how a request flows — from a Slack’s user perspective — across the vast ether of the network to reach AWS and then Slack’s internal services. Let’s dive in!

https://slack.engineering/traffic-101-packets-mostly-flow

3.65K views07:01

DevOps&SRE Library

beyla

eBPF-based auto-instrumentation of HTTP/HTTPS/GRPC Go services, as well as HTTP/HTTPS services written in other languages (intercepting Kernel-level socket operations as well as OpenSSL invocations).

https://github.com/grafana/beyla

3.77K views15:00

DevOps&SRE Library

Backup-and-Restore of Containers with Kubernetes Checkpointing API

Kubernetes v1.25 introduced Container Checkpointing API as an alpha feature. This provides a way to backup-and-restore containers running in Pods, without ever stopping them.

This feature is primarily aimed at forensic analysis, but general backup-and-restore is something any Kubernetes user can take advantage of.

So, let's take a look at this brand-new feature and see how we can enable it in our clusters and leverage it for backup-and-restore or forensic analysis.

https://martinheinz.dev/blog/85

3.83K views07:00

DevOps&SRE Library

Benchmarking Kubernetes node initialization

In this benchmark we compared initialization time across 8 managed Kubernetes providers.

https://symbiosis.host/blog/comparing-node-launch-times

4.04K views17:01

DevOps&SRE Library

Write your Kubernetes Infrastructure as Go code — Manage AWS services

Deploy DynamoDB and a client app using cdk8s along with AWS Controller for Kubernetes

https://itnext.io/write-your-kubernetes-infrastructure-as-go-code-manage-aws-services-815ecd4d1af8

3.92K views07:01

DevOps&SRE Library

etcd-backup-restore

Etcd-backup-restore is collection of components to backup and restore the etcd. It also, provides the ability to validate the data directory, so that we could know the data directory is in good shape to bootstrap etcd successfully.

https://github.com/gardener/etcd-backup-restore

3.88K views15:00

DevOps&SRE Library

kubectl-foreach

Run kubectl commands in all/some contexts in parallel (similar to GNU xargs+parallel)

https://github.com/ahmetb/kubectl-foreach

3.53K views07:01

DevOps&SRE Library

Deploying non-deployable things on ArgoCD with Kustomize, handling edge cases

https://faun.pub/deploying-non-deployable-things-on-argocd-with-kustomize-handling-edge-cases-aa51d24b3e4d

3.81K views15:00

DevOps&SRE Library

A deep dive into Kubernetes Deployment strategies

https://learningdaily.dev/a-deep-dive-into-kubernetes-deployment-strategies-285af31014ae

3.69K views07:02

DevOps&SRE Library

Full CI/CD workflow with Skaffold for your application

A modern way to building a complete workflow from Local to Production, with Skaffold and Gitlab on a Kubernetes cluster, to reduce cognitive load and operational complexity in application stacks.

https://blog.equationlabs.io/series/workflow-with-skaffold

3.76K views15:01

DevOps&SRE Library

ClickHouse Keeper: A ZooKeeper alternative written in C++

In this post, we describe the motivation, advantages, and development of ClickHouse Keeper and preview our next planned improvements. Moreover, we introduce a reusable benchmark suite, which allows us to simulate and benchmark typical ClickHouse Keeper usage patterns easily. Based on this, we present benchmark results highlighting that ClickHouse Keeper uses up to 46 times less memory than ZooKeeper for the same volume of data while maintaining performance close to ZooKeeper.

https://clickhouse.com/blog/clickhouse-keeper-a-zookeeper-alternative-written-in-cpp

4.1K views17:01

DevOps&SRE Library

launchpad

Launchpad is a command-line tool that lets you easily create applications on Kubernetes.

In practice, Launchpad works similar to Heroku or Vercel, except everything is on Kubernetes.

https://github.com/jetpack-io/launchpad

3.82K views07:01

DevOps&SRE Library

etcdadm

etcdadm is a command-line tool for operating an etcd cluster. It makes it easy to create a new cluster, add a member to, or remove a member from an existing cluster. Its user experience is inspired by kubeadm.

https://github.com/kubernetes-sigs/etcdadm

4.83K views15:01

DevOps&SRE Library

Terraform Evolution: How We Safely Decoupled a Dozen of Services from a Monolith

https://medium.com/@susovan87/lesson-learned-after-decoupling-a-dozen-of-services-from-terraform-monolith-safely-with-no-downtime-404e503f6cb6

3.89K views07:00

DevOps&SRE Library

AWS Lambda Monitoring — A Full Guide

Maximize Your Serverless Success with the Complete AWS Lambda Monitoring Guide

https://aws.plainenglish.io/aws-lambda-monitoring-a-full-guide-3cc68c6052fd

3.83K views15:01

DevOps&SRE Library

How to run faster Loki metric queries with more accurate results

Today I want to talk about metric queries. More specifically, I want to talk about an important concept that is going to make your queries run faster, give you more accurate results, and make your Grafana Loki operators (like me) much happier.

https://grafana.com/blog/2023/07/05/how-to-run-faster-loki-metric-queries-with-more-accurate-results

3.78K views07:00

DevOps&SRE Library

You're Paying too much for (Cloudwatch) Logs

Reducing Cloudwatch Log Costs by 80% with Firehose, S3 and Athena

https://bit.kevinslin.com/p/youre-paying-too-much-for-cloudwatch

3.88K views15:00

About

Blog

Apps

Platform