DevOps&SRE Library

Setting Java Heap Size Inside a Docker Container

https://medium.com/nordnet-tech/setting-java-heap-size-inside-a-docker-container-b5a4d06d2f46

3.49K views06:59

DevOps&SRE Library

Troubleshooting Missing Kubernetes Logs in Elasticsearch

https://povilasv.me/troubleshooting-missing-kubernetes-logs-in-elasticsearch

3.55K views14:59

DevOps&SRE Library

Optimizing Kubernetes scalability and cost-efficiency with Karpenter

In this post, you’ll learn the rationale and approach taken by Miro’s Compute team to enhance Kubernetes cluster scaling and efficiency. This was achieved by adopting groupless node pools using Karpenter and helped reduce the compute costs in non-production clusters up to 60%, while increasing production resources usage efficiency up to 95%.

https://medium.com/miro-engineering/optimizing-kubernetes-scalability-and-cost-efficiency-with-karpenter-356153fcf546

3.64K views06:59

DevOps&SRE Library

Handling Pods When Nodes Fail

In addition to the basic Pod types, Kubernetes offers a variety of higher-level workload types, such as Deployment, DaemonSet, and StatefulSet. These higher-level controllers allow you to provide services with multiple replicas of your Pods, making it easier to achieve a high availability architecture.

However, when Kubernetes nodes experience failures such as crashes, network disruptions, or system failures, what happens to the Pods running on those nodes?

From a high availability perspective, some might think that having multiple replicas of an application ensures that the service remains unaffected by node failures. However, in certain cases where the application belongs to a StatefulSet, horizontal scaling isn’t an option. In such scenarios, it becomes necessary to quickly reschedule the related Pods to maintain service availability in the event of node failures.

https://hwchiu.medium.com/handling-pods-when-nodes-fail-4daae20213b

3.85K views14:59

DevOps&SRE Library

Kubernetes V1.27 : Safeguarding Pod with MemoryThrottlingFactor

https://faun.pub/kubernetes-v1-27-safeguarding-pod-with-memorythrottlingfactor-cfbccde10de

3.46K views06:59

DevOps&SRE Library

dcgm-exporter

This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM.

https://github.com/NVIDIA/dcgm-exporter

3.81K views15:01

DevOps&SRE Library

Cilicon

Cilicon is a macOS App that leverages Apple's Virtualization Framework to create, provision and run ephemeral CI VMs with near-native performance. It currently supports Github Actions, Buildkite Agent, GitLab Runner and arbitrary scripts.

https://github.com/traderepublic/Cilicon

3.58K views07:01

DevOps&SRE Library

Stack Lifecycle Deployment

OpenSource solution that defines and manages the complete lifecycle of resources used and provisioned into a cloud!

https://github.com/D10S0VSkY-OSS/Stack-Lifecycle-Deployment

4.23K views15:00

DevOps&SRE Library

Terraform manage multiple environments

How to manage TF multiple environments in your projects

https://medium.com/@b0ld8/terraform-manage-multiple-environments-63939f41c454

4.49K views07:00

DevOps&SRE Library

pghero

A performance dashboard for Postgres

https://github.com/ankane/pghero

4.72K views15:01

DevOps&SRE Library

Unlocking Secure Connections: A Guide to PostgreSQL Authentication Methods

https://stormatics.tech/blogs/unlocking-secure-connections-a-guide-to-postgresql-authentication-methods

4.31K views07:01

DevOps&SRE Library

unused

CLI tool, Prometheus exporter, and Go module to list your unused disks in all cloud providers

https://github.com/grafana/unused

4K views15:00

DevOps&SRE Library

An intuitive documentation strategy

I wrote this blog post to share some of my learnings on creating intuitive documentation for products and projects over the past decade or so. This post is for those of you looking to make your documentation interesting enough for the audience to keep coming back for more.

https://abstraction.blog/2023/11/22/intuitive-documentation-strategy

3.76K views07:00

DevOps&SRE Library

terraform-aws-github-runner

This Terraform module creates the required infrastructure needed to host GitHub Actions self-hosted, auto-scaling runners on AWS spot instances. It provides the required logic to handle the life cycle for scaling up and down using a set of AWS Lambda functions. Runners are scaled down to zero to avoid costs when no workflows are active.

https://github.com/philips-labs/terraform-aws-github-runner

3.68K views15:00

DevOps&SRE Library

Switching Build Systems, Seamlessly

At Spotify, we have experimented with the Bazel build system since 2017. Over the years, the project has matured, and support for more languages and ecosystems have been added, thanks to the open source community and its maintainers at Google. In 2020, it became clear that the future of our client development required a unified build system that would scale well with our polyglot, multiplatform, and multimillion-line codebase.

So we focused more of our energy on Bazel, and we transitioned the iOS Spotify app to build completely with Bazel for our 200+ engineers — without missing a single weekly release for millions of our iOS users.

https://engineering.atspotify.com/2023/10/switching-build-systems-seamlessly

3.49K views07:00

DevOps&SRE Library

ScyllaDB on Kubernetes: How to Run Intense Workloads with Spot Instances

https://www.scylladb.com/2023/08/07/scylladb-on-kubernetes-how-to-run-intense-workloads-with-spot-instances

3.48K views15:00

DevOps&SRE Library

pipelight