Service Level Agreement
https://blog.alexewerlof.com/p/sla
Introduction to the SLA in relation to SLI and SLO
https://blog.alexewerlof.com/p/sla
How to deal with alert fatigue head-on
https://incident.io/hubs/on-call/dealing-with-alert-fatigue-head-on
Everyone experiences stress at work—thankfully, it’s a topic folks aren’t shying away from anymore.
But for on-call engineers, alert fatigue is a phenomenon closer to home. Unfortunately, like stress, it can be just as insidious and drastically impact those it affects.
First discussed in the context of hospital settings, this phrase later entered engineering circles. Alert fatigue is when an excessive number of alerts overwhelms the individuals responsible for answering them, often over a prolonged period, resulting in missed or delayed responses, or them being ignored altogether
The impact of this fatigue can have an effect beyond the individual and can create significant risks for your organization.
But, if you approach on-call the right way, you can mitigate the impacts of alert fatigue or, better yet, avoid it altogether. Here, we'll dive into the tactics teams can implement to address alert fatigue and its underlying causes.
https://incident.io/hubs/on-call/dealing-with-alert-fatigue-head-on
Different Ways to Aggregate Nines
https://hross.substack.com/p/different-ways-to-aggregate-nines
While working on SLOs, SLAs and SLIs I have found that there are only so many ways to aggregate service metrics. I have not yet found somewhere that attempts to review the different aggregation methods and what their relative strengths and weaknesses are.
https://hross.substack.com/p/different-ways-to-aggregate-nines
Distributed Tracing: A Whistle Stop Tour
https://metoro.io/blog/distributed-tracing-whistle-stop-tour
Know enough to be dangerous in 10 minutes
https://metoro.io/blog/distributed-tracing-whistle-stop-tour
spqr
https://github.com/pg-sharding/spqr
SPQR is a production-ready system for horizontal scaling of PostgreSQL via sharding. We appreciate any kind of feedback and contribution to the project.
https://github.com/pg-sharding/spqr
Grafana Loki: Optimising log based metrics
https://dev.to/siddharthjain1715/grafana-loki-optimising-log-based-metrics-5edb
There are multiple layers where the performance of Loki can be improved and fine-tuned. From optimising the query, channeling it efficiently for processing, to allocating the right computational resources, we will cover the following parameters that make a significant improvement to the performance.
https://dev.to/siddharthjain1715/grafana-loki-optimising-log-based-metrics-5edb
Is GitOps actually useful?
https://medium.com/@briankgrant/is-gitops-actually-useful-a1c851ba99d8
GitOps doesn’t solve all deployment problems or even cover the entire deployment process, but it’s a solid foundational building block.
https://medium.com/@briankgrant/is-gitops-actually-useful-a1c851ba99d8
Automation using Control planes vs. Command-line tools
https://medium.com/@briankgrant/automation-using-control-planes-vs-command-line-tools-66f818ff8278
https://medium.com/@briankgrant/automation-using-control-planes-vs-command-line-tools-66f818ff8278
Monorepos vs. many repos: is there a good answer?
https://medium.com/@briankgrant/monorepos-vs-many-repos-is-there-a-good-answer-9bac102971da
https://medium.com/@briankgrant/monorepos-vs-many-repos-is-there-a-good-answer-9bac102971da
The Technical History of Kubernetes
https://medium.com/@briankgrant/the-technical-history-of-kubernetes-2fe1988b522a
https://medium.com/@briankgrant/the-technical-history-of-kubernetes-2fe1988b522a
rotz
https://github.com/volllly/rotz
Fully cross platform dotfile manager and dev environment bootstrapper written in Rust.
https://github.com/volllly/rotz
Moving fast breaks things: the importance of a staging environment
https://graphite.dev/blog/staging-environment
https://graphite.dev/blog/staging-environment
SLO formulas implementation in PromQL step by step
https://mkaz.me/blog/2024/slo-formulas-implementation-in-promql-step-by-step
https://mkaz.me/blog/2024/slo-formulas-implementation-in-promql-step-by-step
OpenTelemetry Collector Anti-Patterns
https://dev.to/avillela/opentelemetry-collector-anti-patterns-42be
https://dev.to/avillela/opentelemetry-collector-anti-patterns-42be
What you need to know before creating your first OpenTelemetry pipeline for tracing
https://medium.com/adidoescode/what-you-need-to-know-before-creating-your-first-opentelemetry-pipeline-for-tracing-9d7838514cb9
https://medium.com/adidoescode/what-you-need-to-know-before-creating-your-first-opentelemetry-pipeline-for-tracing-9d7838514cb9
Terragrunt Reference Architecture
https://github.com/Excoriate/terragrunt-ref-arch
This repository embodies a structured approach to organizing Terraform code with Terragrunt, focusing on reusability, ease of management, and scalability across multiple environments and cloud providers. It's crafted to guide teams in building robust cloud infrastructure that adheres to best practices and principles.
https://github.com/Excoriate/terragrunt-ref-arch
oneuptime
https://github.com/oneuptime/oneuptime
OneUptime is a comprehensive solution for monitoring and managing your online services. Whether you need to check the availability of your website, dashboard, API, or any other online resource, OneUptime can alert your team when downtime happens and keep your customers informed with a status page. OneUptime also helps you handle incidents, set up on-call rotations, run tests, secure your services, analyze logs, track performance, and debug errors.
https://github.com/oneuptime/oneuptime
Understanding Kubernetes emptyDir — With 3 Practical Use-cases
https://decisivedevops.com/understanding-kubernetes-emptydir-with-3-practical-use-cases-960f550e0e34
Learn how to effectively implement emptyDir memory for pods, with hands-on use cases for temporary data handling in Kubernetes.
https://decisivedevops.com/understanding-kubernetes-emptydir-with-3-practical-use-cases-960f550e0e34
Mastering Kubernetes: Journey with Cluster API
https://medium.com/hepsiburadatech/mastering-kubernetes-journey-with-cluster-api-2fb779ee7177
Let’s talk about how at Hepsiburada, we efficiently manage hundreds of Kubernetes clusters that directly handle about 95% of our over 100 million monthly visitor traffic. We’ll delve into the complexities of managing multiple clusters and discuss the strategies we employ to tackle these challenges.
https://medium.com/hepsiburadatech/mastering-kubernetes-journey-with-cluster-api-2fb779ee7177