DevOps&SRE Library

How to set a good only one threshold for an alert?

Did you ask yourself what is the good threshold for your alert setup?

I have worked on alerting system for more than 10 years in e-commerce or healthcare system. Setting good threshold(s) for an alert is very difficult and contentious.

https://medium.com/production-care/how-to-set-a-good-only-one-threshold-for-an-alert-ddc00c975821

3.4K views07:00

DevOps&SRE Library

Negotiating Priorities Around Incident Investigations

There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions with multiple stakeholders. The problems I’ll explore in this blog—from the SRE perspective—are about time pressures (when to ship the investigation) and the type of report people expect.

https://www.honeycomb.io/blog/negotiating-priorities-incident-investigations

3.29K views15:01

DevOps&SRE Library

Our commitment to OpenTelemetry

Prometheus OpenTelemetry support

https://prometheus.io/blog/2024/03/14/commitment-to-opentelemetry

3.42K views07:00

DevOps&SRE Library

YOU MIGHT BE BETTER OFF WITHOUT PULL REQUESTS

Honestly, pull requests sound like a pretty sweet tool for collaborating on a shared code base. They are a huge success in the open source space, and looking at that success alone it’s not surprising that a lot of teams use a pull request-based process for themselves. On the other hand, there are a lot of voices out there highlighting how using pull requests as the default mechanism for collaboration can slow down your team and prevent you from getting changes into the hands of your users quickly and reliably. Patterns that worked well for low-trust open source communities, they say, didn’t translate well to teams where you know and trust all of your collaborators. Critics of pull requests often suggest alternative workflows that predate pull requests and even git and other distributed version control systems.

https://hamvocke.com/blog/better-off-without-pull-requests

3.62K views15:01

DevOps&SRE Library

tmate

Tmate is a fork of tmux. It provides an instant pairing solution.

https://github.com/tmate-io/tmate

3.4K views07:01

DevOps&SRE Library

ingestr

Ingestr is a command-line application that allows you to ingest data from any source into any destination using simple command-line flags, no code necessary.

https://github.com/bruin-data/ingestr

3.51K views15:00

DevOps&SRE Library

daytona

Set up a development environment on any infrastructure, with a single command.

https://github.com/daytonaio/daytona

3.56K views07:01

DevOps&SRE Library

How we avoided alarm fatigue syndrome by managing/reducing the alerting noise

https://medium.com/doctolib/how-we-avoided-alarm-fatigue-syndrome-by-managing-reducing-the-alerting-noise-aac5c008d2e2

4.55K views15:00

DevOps&SRE Library

GitHub Actions: Terraform deployments with a review of planned changes

https://itnext.io/github-actions-terraform-deployments-with-a-review-of-planned-changes-30143358bb5c

4.44K views07:01

DevOps&SRE Library

Terraform Strategies for Seamless Grafana Dashboards Across Regions

https://medium.com/tblx-insider/global-products-global-monitoring-terraform-strategies-for-seamless-grafana-dashboards-1e8c2af68512

4.22K views15:01

DevOps&SRE Library

k8spacket - a fully based on eBPF right now

https://medium.com/@bareckidarek/k8spacket-a-fully-based-on-ebpf-right-now-e72d5383c743

4.47K views07:01

DevOps&SRE Library

Measuring Developer Productivity via Humans

Measuring developer productivity is a difficult challenge. Conventional metrics focused on development cycle time and throughput are limited, and there aren't obvious answers for where else to turn. Qualitative metrics offer a powerful way to measure and understand developer productivity using data derived from developers themselves. Organizations should prioritize measuring developer productivity using data from humans, rather than data from systems.

https://martinfowler.com/articles/measuring-developer-productivity-humans.html

3.79K views15:00

DevOps&SRE Library

How we improved ingester load balancing in Grafana Mimir with spread-minimizing tokens

Grafana Mimir is our open source, horizontally scalable, multi-tenant time series database, which allows us to ingest beyond 1 billion active series. Mimir ingesters use consistent hashing, a distributed hashing technique for data replication. This technique guarantees a minimal number of relocation of time series between available ingesters when some ingesters are added or removed from the system.

Unfortunately, we noticed that the consistent hashing algorithm previously used by Mimir ingesters caused an uneven distribution of time series between ingesters, with load distribution differences going up to 25%. As a consequence, some ingesters were overwhelmed, while the others were underused. In order to solve this problem, we came up with a novel algorithm, called spread-minimizing token generation strategy, that allows us to benefit from the consistent hashing on one side and from an almost perfect load distribution on the other side.

Uniform load balancing optimizes network performance and reduces latency as the demand is equally distributed among ingesters. This allows for better usage of compute resources, which leads to more consistent performance. In this blog post, we introduce our new algorithm and show how it improved ingesters load balancing in some of our production clusters for Grafana Cloud Metrics (which is powered by Mimir) to the degree that it’s now almost perfect.

https://grafana.com/blog/2024/03/07/how-we-improved-ingester-load-balancing-in-grafana-mimir-with-spread-minimizing-tokens

3.81K views07:00

DevOps&SRE Library

Load Balancing: Handling Heterogeneous Hardware

This blog post describes Uber’s journey towards utilizing hardware efficiently via better load balancing. The work described here lasted over a year, involved engineers across multiple teams, and delivered significant efficiency savings. The article covers the technical solutions and our discovery process to get to them–in many ways, the journey was harder than the destination.

https://www.uber.com/en-HR/blog/load-balancing-handling-heterogeneous-hardware

4.02K views15:00

DevOps&SRE Library

BuildKit in depth: Docker's build engine explained

https://depot.dev/blog/buildkit-in-depth

3.76K views07:01

DevOps&SRE Library

openstatus

OpenStatus is open-source synthetic monitoring platform with beautiful status page and incident management. We are building it publicly for everyone to see our progress. We believe great softwares are built this way.

https://github.com/openstatusHQ/openstatus

4.23K views15:01

DevOps&SRE Library

jnv

jnv is designed for navigating JSON, offering an interactive JSON viewer and jq filter editor.

https://github.com/ynqa/jnv

4.08K views07:00

DevOps&SRE Library

retina

Retina is a cloud-agnostic, open-source Kubernetes network observability platform that provides a centralized hub for monitoring application health, network health, and security. It provides actionable insights to cluster network administrators, cluster security administrators, and DevOps engineers navigating DevOps, SecOps, and compliance use cases.

https://github.com/microsoft/retina

3.77K views15:01

DevOps&SRE Library

Build a Lightweight Internal Developer Platform with Argo CD and Kubernetes Labels

Note: This blog post demonstrates how to create a lightweight Internal Developer Platform without relying on Backstage, while still empowering you and your developers with a self-service approach. By utilizing GitOps with Argo CD and leveraging Kubernetes labels, this method offers a streamlined and efficient solution for managing and deploying your infrastructure.

https://itnext.io/build-a-lightweight-internal-developer-platform-with-argo-cd-and-kubernetes-labels-4c0e52c6c0f4

3.62K views07:01

DevOps&SRE Library

Signing container images: Comparing Sigstore, Notary, and Docker Content Trust

https://snyk.io/blog/signing-container-images

3.6K views15:00

About

Blog

Apps

Platform