Being The First SRE
I have been the first Site Reliability Engineer (SRE) several times as a consultant or full-time employee. I’ve been the tech lead on three SRE teams and the only SRE on two others. I’ve succeeded (growing from one SRE to a team of five twice) and failed (quitting without another SRE being found). Here’s what I’ve learned about being the first SRE.https://medium.com/@hans.knechtions/being-the-first-sre-7866a22975b4
GKE (Google Kubernetes Engine) Review
What if Kubernetes was idiot-proof?https://matduggan.com/gke-google-kubernetes-engine-review
Understanding the Terraform Check Block Feature
We dive into one of Terraform's most recent features to leverage infrastructure validation.https://masterpoint.io/updates/understanding-terraform-check
Traffic 101: Packets Mostly Flow
Slack handles billions of inbound network requests per day, all of which traverse through our edge network and ingress load balancing tiers. In this blog post, we’ll talk about how a request flows — from a Slack’s user perspective — across the vast ether of the network to reach AWS and then Slack’s internal services. Let’s dive in!https://slack.engineering/traffic-101-packets-mostly-flow
beyla
eBPF-based auto-instrumentation of HTTP/HTTPS/GRPC Go services, as well as HTTP/HTTPS services written in other languages (intercepting Kernel-level socket operations as well as OpenSSL invocations).https://github.com/grafana/beyla
Backup-and-Restore of Containers with Kubernetes Checkpointing API
Kubernetes v1.25 introduced Container Checkpointing API as an alpha feature. This provides a way to backup-and-restore containers running in Pods, without ever stopping them.https://martinheinz.dev/blog/85
This feature is primarily aimed at forensic analysis, but general backup-and-restore is something any Kubernetes user can take advantage of.
So, let's take a look at this brand-new feature and see how we can enable it in our clusters and leverage it for backup-and-restore or forensic analysis.
Benchmarking Kubernetes node initialization
In this benchmark we compared initialization time across 8 managed Kubernetes providers.https://symbiosis.host/blog/comparing-node-launch-times
Write your Kubernetes Infrastructure as Go code — Manage AWS services
Deploy DynamoDB and a client app using cdk8s along with AWS Controller for Kuberneteshttps://itnext.io/write-your-kubernetes-infrastructure-as-go-code-manage-aws-services-815ecd4d1af8
etcd-backup-restore
Etcd-backup-restore is collection of components to backup and restore the etcd. It also, provides the ability to validate the data directory, so that we could know the data directory is in good shape to bootstrap etcd successfully.https://github.com/gardener/etcd-backup-restore
kubectl-foreach
Run kubectl commands in all/some contexts in parallel (similar to GNU xargs+parallel)https://github.com/ahmetb/kubectl-foreach
Deploying non-deployable things on ArgoCD with Kustomize, handling edge cases
https://faun.pub/deploying-non-deployable-things-on-argocd-with-kustomize-handling-edge-cases-aa51d24b3e4d
https://faun.pub/deploying-non-deployable-things-on-argocd-with-kustomize-handling-edge-cases-aa51d24b3e4d
A deep dive into Kubernetes Deployment strategies
https://learningdaily.dev/a-deep-dive-into-kubernetes-deployment-strategies-285af31014ae
https://learningdaily.dev/a-deep-dive-into-kubernetes-deployment-strategies-285af31014ae
Full CI/CD workflow with Skaffold for your application
A modern way to building a complete workflow from Local to Production, with Skaffold and Gitlab on a Kubernetes cluster, to reduce cognitive load and operational complexity in application stacks.https://blog.equationlabs.io/series/workflow-with-skaffold
ClickHouse Keeper: A ZooKeeper alternative written in C++
In this post, we describe the motivation, advantages, and development of ClickHouse Keeper and preview our next planned improvements. Moreover, we introduce a reusable benchmark suite, which allows us to simulate and benchmark typical ClickHouse Keeper usage patterns easily. Based on this, we present benchmark results highlighting that ClickHouse Keeper uses up to 46 times less memory than ZooKeeper for the same volume of data while maintaining performance close to ZooKeeper.https://clickhouse.com/blog/clickhouse-keeper-a-zookeeper-alternative-written-in-cpp
launchpad
Launchpad is a command-line tool that lets you easily create applications on Kubernetes.https://github.com/jetpack-io/launchpad
In practice, Launchpad works similar to Heroku or Vercel, except everything is on Kubernetes.
etcdadm
etcdadm is a command-line tool for operating an etcd cluster. It makes it easy to create a new cluster, add a member to, or remove a member from an existing cluster. Its user experience is inspired by kubeadm.https://github.com/kubernetes-sigs/etcdadm
Terraform Evolution: How We Safely Decoupled a Dozen of Services from a Monolith
https://medium.com/@susovan87/lesson-learned-after-decoupling-a-dozen-of-services-from-terraform-monolith-safely-with-no-downtime-404e503f6cb6
https://medium.com/@susovan87/lesson-learned-after-decoupling-a-dozen-of-services-from-terraform-monolith-safely-with-no-downtime-404e503f6cb6
AWS Lambda Monitoring — A Full Guide
Maximize Your Serverless Success with the Complete AWS Lambda Monitoring Guidehttps://aws.plainenglish.io/aws-lambda-monitoring-a-full-guide-3cc68c6052fd
How to run faster Loki metric queries with more accurate results
Today I want to talk about metric queries. More specifically, I want to talk about an important concept that is going to make your queries run faster, give you more accurate results, and make your Grafana Loki operators (like me) much happier.https://grafana.com/blog/2023/07/05/how-to-run-faster-loki-metric-queries-with-more-accurate-results
You're Paying too much for (Cloudwatch) Logs
Reducing Cloudwatch Log Costs by 80% with Firehose, S3 and Athenahttps://bit.kevinslin.com/p/youre-paying-too-much-for-cloudwatch