DevOps&SRE Library

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot.

https://github.com/TabbyML/tabby

3.85K views07:01

DevOps&SRE Library

Setting up your first EKS cluster on AWS: some practical tips

https://medium.com/@benjamin.christmann_12432/setting-up-your-first-eks-cluster-on-aws-some-practical-tips-60400963c588

3.83K views15:02

DevOps&SRE Library

A Guide to Kubernetes Application Resource Tuning

p1: https://medium.com/@vvsevel/a-guide-to-kubernetes-application-resource-tuning-part-1-bf0ba04db10

p2: https://medium.com/@vvsevel/a-guide-to-kubernetes-application-resource-tuning-part-2-1d287479b52b

p3: https://medium.com/@vvsevel/a-guide-to-kubernetes-application-resource-tuning-part-3-40f7f6510c93

4.17K views07:00

DevOps&SRE Library

AKS Networking Deep Dive: Kubenet vs Azure-CNI vs Azure-CNI (overlay)

https://inder-devops.medium.com/aks-networking-deep-dive-kubenet-vs-azure-cni-vs-azure-cni-overlay-a51709171ce9

4.08K views15:01

DevOps&SRE Library

GitOps using Flux and Flagger

https://dev.to/infracloud/gitops-using-flux-and-flagger-15ci

4.45K views07:01

DevOps&SRE Library

Kubernetes Services: ClusterIP, Nodeport and LoadBalancer

https://sysdig.com/blog/kubernetes-services-clusterip-nodeport-loadbalancer

4.23K views15:01

DevOps&SRE Library

Lessons Learned from Twenty Years of Site Reliability Engineering

Or, Eleven things we have learned as Site Reliability Engineers at Google

1. The riskiness of a mitigation should scale with the severity of the outage
2. Recovery mechanisms should be fully tested before an emergency
3. Canary all changes
4. Have a "Big Red Button"
5. Unit tests alone are not enough - integration testing is also needed
6. COMMUNICATION CHANNELS! AND BACKUP CHANNELS!! AND BACKUPS FOR THOSE BACKUP CHANNELS!!!
7. Intentionally degrade performance modes
8. Test for Disaster resilience
9. Automate your mitigations
10. Reduce the time between rollouts, to decrease the likelihood of the rollout going wrong
11. A single global hardware version is a single point of failure

https://sre.google/resources/practices-and-processes/twenty-years-of-sre-lessons-learned

4.71K views07:02

DevOps&SRE Library

How DoorDash Migrated from StatsD to Prometheus

https://doordash.engineering/2023/08/01/how-doordash-migrated-from-statsd-to-prometheus

4.37K views15:00

DevOps&SRE Library

How to use Terraform test

The new Terraform version v1.6.0 introduce a test framework, named “Terraform test”. Here’s how to use it.

https://blog.captaincy.io/how-to-use-terraform-test

4.07K views07:01

DevOps&SRE Library

Terraform project structure with reusable modules

https://erudinsky.com/2023/10/20/structuring-terraform-projects

4.59K views15:02

DevOps&SRE Library

cluster.dev

Cluster.dev is an open-source tool designed to manage cloud native infrastructures with simple declarative manifests - infrastructure templates. The infrastructure templates could be based on Terraform modules, Kubernetes manifests, Shell scripts, Helm charts, Kustomize and ArgoCD/Flux applications, OPA policies etc. Cluster.dev sticks those components together so that you could deploy, test and distribute a whole set of components with pinned versions.

https://github.com/shalb/cluster.dev

4.49K views07:01

DevOps&SRE Library

Prometheus and its storage: Architecture, challenges, and solutions

This two-article series is about monitoring. Part One covers accumulating a multitude of different metrics in a single place, handling permissions for different aspects of those metrics, and storing large amounts of data. In Part Two, we then focus on choosing monitoring systems based on the brief example of a fictional company’s “journey” in struggling with continually expanding its monitoring system and growing its infrastructure.

https://blog.palark.com/prometheus-architecture-tsdb

4.34K views15:01

DevOps&SRE Library

What is a Memory Leak?

Memory leaks are a common and frustrating problem in software development. These issues arise when a program fails to free up memory that is no longer being used, leading to a gradual loss of available memory over time.

https://www.codereliant.io/what-is-a-memory-leak

4.08K views07:00

DevOps&SRE Library

Rescue Struggling Pods from Scratch

https://www.honeycomb.io/blog/rescue-struggling-pods-from-scratch

4.21K views15:00

DevOps&SRE Library

Solving Metrics at scale with VictoriaMetrics

https://sarthak-acoustic.medium.com/solving-metrics-at-scale-with-victoriametrics-ac9c306826c3

3.93K views07:00

DevOps&SRE Library

How Grafanalib Helps You Manage Dashboards at Scale

https://www.contino.io/insights/grafanalib

4.24K views15:01

DevOps&SRE Library

A Guide to Service Discovery with Prometheus Operator — How to use Pod Monitor, Service Monitor and Scrape Config

https://medium.com/@helia.barroso/a-guide-to-service-discovery-with-prometheus-operator-how-to-use-pod-monitor-service-monitor-6a7e4e27b303

3.86K views07:01

DevOps&SRE Library

Profiling: Flame Chart vs. Flame Graph

Flame Charts and Flame Graphs clearly explained

https://medium.com/performance-engineering-for-the-ordinary-barbie/profiling-flame-chart-vs-flame-graph-7b212ddf3a83

4.66K views15:01

DevOps&SRE Library

Reduce cross-AZ traffic costs on EKS using topology aware hints

https://blog.ratnopamc.com/reduce-cross-az-traffic-costs-on-eks-using-topology-aware-hints

4.43K views07:00

DevOps&SRE Library

Advanced Secret Management on Kubernetes With Pulumi and GitOps: Sealed Secrets Controller

https://blog.ediri.io/advanced-secret-management-on-kubernetes-with-pulumi-and-gitops-sealed-secrets-controller

4.42K views15:01

About

Blog

Apps

Platform