DevOps&SRE Library
17.8K subscribers
461 photos
4 videos
2 files
4.76K links
Библиотека статей по теме DevOps и SRE.

Реклама: @ostinostin
Контент: @mxssl

РКН: https://knd.gov.ru/license?id=67704b536aa9672b963777b3&registryType=bloggersPermission
Download Telegram
Troubleshooting Missing Kubernetes Logs in Elasticsearch

https://povilasv.me/troubleshooting-missing-kubernetes-logs-in-elasticsearch
Optimizing Kubernetes scalability and cost-efficiency with Karpenter

In this post, you’ll learn the rationale and approach taken by Miro’s Compute team to enhance Kubernetes cluster scaling and efficiency. This was achieved by adopting groupless node pools using Karpenter and helped reduce the compute costs in non-production clusters up to 60%, while increasing production resources usage efficiency up to 95%.


https://medium.com/miro-engineering/optimizing-kubernetes-scalability-and-cost-efficiency-with-karpenter-356153fcf546
Handling Pods When Nodes Fail

In addition to the basic Pod types, Kubernetes offers a variety of higher-level workload types, such as Deployment, DaemonSet, and StatefulSet. These higher-level controllers allow you to provide services with multiple replicas of your Pods, making it easier to achieve a high availability architecture.

However, when Kubernetes nodes experience failures such as crashes, network disruptions, or system failures, what happens to the Pods running on those nodes?

From a high availability perspective, some might think that having multiple replicas of an application ensures that the service remains unaffected by node failures. However, in certain cases where the application belongs to a StatefulSet, horizontal scaling isn’t an option. In such scenarios, it becomes necessary to quickly reschedule the related Pods to maintain service availability in the event of node failures.


https://hwchiu.medium.com/handling-pods-when-nodes-fail-4daae20213b
Kubernetes V1.27 : Safeguarding Pod with MemoryThrottlingFactor

https://faun.pub/kubernetes-v1-27-safeguarding-pod-with-memorythrottlingfactor-cfbccde10de
dcgm-exporter

This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM.


https://github.com/NVIDIA/dcgm-exporter
Cilicon

Cilicon is a macOS App that leverages Apple's Virtualization Framework to create, provision and run ephemeral CI VMs with near-native performance. It currently supports Github Actions, Buildkite Agent, GitLab Runner and arbitrary scripts.


https://github.com/traderepublic/Cilicon
Stack Lifecycle Deployment

OpenSource solution that defines and manages the complete lifecycle of resources used and provisioned into a cloud!


https://github.com/D10S0VSkY-OSS/Stack-Lifecycle-Deployment
Terraform manage multiple environments

How to manage TF multiple environments in your projects


https://medium.com/@b0ld8/terraform-manage-multiple-environments-63939f41c454
pghero

A performance dashboard for Postgres


https://github.com/ankane/pghero
Unlocking Secure Connections: A Guide to PostgreSQL Authentication Methods

https://stormatics.tech/blogs/unlocking-secure-connections-a-guide-to-postgresql-authentication-methods
unused

CLI tool, Prometheus exporter, and Go module to list your unused disks in all cloud providers


https://github.com/grafana/unused
An intuitive documentation strategy

I wrote this blog post to share some of my learnings on creating intuitive documentation for products and projects over the past decade or so. This post is for those of you looking to make your documentation interesting enough for the audience to keep coming back for more.


https://abstraction.blog/2023/11/22/intuitive-documentation-strategy
terraform-aws-github-runner

This Terraform module creates the required infrastructure needed to host GitHub Actions self-hosted, auto-scaling runners on AWS spot instances. It provides the required logic to handle the life cycle for scaling up and down using a set of AWS Lambda functions. Runners are scaled down to zero to avoid costs when no workflows are active.


https://github.com/philips-labs/terraform-aws-github-runner
Switching Build Systems, Seamlessly

At Spotify, we have experimented with the Bazel build system since 2017. Over the years, the project has matured, and support for more languages and ecosystems have been added, thanks to the open source community and its maintainers at Google. In 2020, it became clear that the future of our client development required a unified build system that would scale well with our polyglot, multiplatform, and multimillion-line codebase.

So we focused more of our energy on Bazel, and we transitioned the iOS Spotify app to build completely with Bazel for our 200+ engineers — without missing a single weekly release for millions of our iOS users.


https://engineering.atspotify.com/2023/10/switching-build-systems-seamlessly
ScyllaDB on Kubernetes: How to Run Intense Workloads with Spot Instances

https://www.scylladb.com/2023/08/07/scylladb-on-kubernetes-how-to-run-intense-workloads-with-spot-instances
pipelight

Tiny automation pipelines. Bring CI/CD to the smallest projects. Self-hosted, Lightweight, CLI only.


https://github.com/pipelight/pipelight
keep

The open-source alerts management and automation platform


https://github.com/keephq/keep
atlantis

Terraform Pull Request Automation


https://github.com/runatlantis/atlantis