DevOps&SRE Library

Best practices for avoiding race conditions in inhibition rules

https://www.grobinson.net/best-practices-for-avoiding-race-conditions-in-inhibition-rules.html

3.55K views07:00

DevOps&SRE Library

Understanding Multi-arch Containers, Benefits and CI/CD Integration

In this blog post, we will learn what are multi-arch container images? How it works? How to build and promote them? and we will write a sample code for building a multi-arch image in the CI/CD pipeline.

https://www.infracloud.io/blogs/multi-arch-containers-ci-cd-integration

3.92K views14:59

DevOps&SRE Library

skipper

Skipper is an HTTP router and reverse proxy for service composition. It's designed to handle >300k HTTP route definitions with detailed lookup conditions, and flexible augmentation of the request flow with filters. It can be used out of the box or extended with custom lookup, filter logic and configuration sources.

https://github.com/zalando/skipper

3.74K views07:01

DevOps&SRE Library

Top 10 Cloud Provider Comparison 2023: VM Performance / Price

https://dev.to/dkechag/cloud-vm-performance-value-comparison-2023-perl-more-1kpp

3.42K views15:01

DevOps&SRE Library

hyperdx

HyperDX helps engineers figure out why production is broken faster by centralizing and correlating logs, metrics, traces, exceptions and session replays in one place. An open source and developer-friendly alternative to Datadog and New Relic.

https://github.com/hyperdxio/hyperdx

3.38K views07:02

DevOps&SRE Library

The Art of Building Fault-Tolerant Software Systems

Eight Pillars of Fault-tolerant Systems:
- Redundancy and Replication
- Load balancing
- Modularity
- Graceful degradation
- Circuit breaker
- Fail-fast
- Retries
- Rate limiting

https://www.codereliant.io/the-art-of-building-fault-tolerant-software-systems

4.08K views15:01

DevOps&SRE Library

Patterns for Terraform Multi-Account Deployments

https://awstip.com/patterns-for-terraform-multi-account-deployments-f47d77d6f250

3.51K views07:01

DevOps&SRE Library

Group wait, Group interval and Repeat interval explained

https://www.grobinson.net/group-wait-group-interval-and-repeat-interval-explained.html

3.87K views15:00

DevOps&SRE Library

terraform-target-autocompletion

Press tab after --target and get suggestions for your resources and modules.

terraform-target-autocompletion is a Go program that rely on terraform-config-inspect for the heavy lifting. So it should work with any Terraform version. You don't need anything else than the binary and the completion scripts provided. But currently you'll need Go 1.21.0 installed to build it yourself.

https://github.com/shellwhale/terraform-target-autocompletion

4.33K views07:00

DevOps&SRE Library

Reducing high cardinality in Prometheus

https://sennasemakula.medium.com/reducing-high-cardinality-in-prometheus-3f110b6d9eb5

4.06K views15:01

DevOps&SRE Library

Network health overview with mtr, ss, lsof and iperf3

https://raduzaharia.medium.com/network-health-overview-with-mtr-ss-lsof-and-iperf3-8d0d2d191781

4.02K views07:01

DevOps&SRE Library

Scaling Kafka to Support PayPal’s Data Growth

Today, our Kafka fleet consists of over 1,500 brokers that host over 20,000 topics and close to 2,000 Mirror Maker nodes which are used to mirror the data among the clusters, offering 99.99% availability for our Kafka clusters. During the 2022 Retail Friday, Kafka traffic volume peaked at about 1.3 trillion messages per day! At present, we have 85+ Kafka clusters, and every holiday season we flex up our Kafka infrastructure to handle the traffic surge. The Kafka platform continues to seamlessly scale to support this traffic growth without any impact to our business.

https://medium.com/paypal-tech/scaling-kafka-to-support-paypals-data-growth-a0b4da420fab

4.08K views15:00

DevOps&SRE Library

Prometheus Certified Associate: A Comprehensive Guide

https://medium.com/@onai.rotich/prometheus-certified-associate-a-comprehensive-guide-9c51638578d2

3.72K views07:00

DevOps&SRE Library

harden-runner

Harden-Runner provides runtime security for GitHub-hosted and self-hosted environments

https://github.com/step-security/harden-runner

3.72K views15:02

DevOps&SRE Library

How Cloudflare runs Prometheus at scale

At the moment of writing this post we run 916 Prometheus instances with a total of around 4.9 billion time series.

https://blog.cloudflare.com/how-cloudflare-runs-prometheus-at-scale

3.61K views07:01

DevOps&SRE Library

cf-terraforming

cf-terraforming is a command line utility to facilitate terraforming your existing Cloudflare resources. It does this by using your account credentials to retrieve your configurations from the Cloudflare API and converting them to Terraform configurations that can be used with the Terraform Cloudflare provider.

This tool is ideal if you already have Cloudflare resources defined but want to start managing them via Terraform, and don't want to spend the time to manually write the Terraform configuration to describe them.

https://github.com/cloudflare/cf-terraforming

3.64K views15:01

DevOps&SRE Library

Multi-Cloud Strategies with Crunchy Postgres for Kubernetes

https://www.crunchydata.com/blog/multi-cloud-strategies-with-crunchy-postgres-for-kubernetes

4.03K views07:00

DevOps&SRE Library

How Agoda Transitioned to Private Cloud

https://medium.com/agoda-engineering/private-cloud-and-you-736d8d99a51e

3.75K views15:01

DevOps&SRE Library

Understanding Kubernetes Limits and Requests

When working with containers in Kubernetes, it’s important to know what are the resources involved and how they are needed. Some processes will require more CPU or memory than others. Some are critical and should never be starved. 

Knowing that, we should configure our containers and Pods properly in order to get the best of both.

https://sysdig.com/blog/kubernetes-limits-requests

3.77K views07:01

DevOps&SRE Library

Kubernetes OOM and CPU Throttling

Troubleshooting Memory and CPU problems

https://sysdig.com/blog/troubleshoot-kubernetes-oom

4.18K views15:01

About

Blog

Apps

Platform