DevOps&SRE Library

octelium

Octelium is a free and open source, self-hosted, unified platform for zero trust resource access that is primarily meant to be a modern alternative to remote access VPNs and similar tools.

https://github.com/octelium/octelium

2.92K views15:03

DevOps&SRE Library

Breaking up a monolith: How we’re unwinding a shared database at scale

https://www.datadoghq.com/blog/engineering/unwinding-shared-database

2.98K views07:05

DevOps&SRE Library

Taming Complexity: HelloFresh’s Playbook for Managing Large-Scale Change

P1: https://engineering.hellofresh.com/taming-complexity-hellofreshs-playbook-for-managing-large-scale-programs-part-1-3-cdf06c5a6ed9

P2: https://engineering.hellofresh.com/taming-complexity-hellofreshs-playbook-for-managing-large-scale-change-part-2-3-516dc3961e26

P3: https://engineering.hellofresh.com/taming-complexity-hellofreshs-playbook-for-managing-large-scale-change-part-3-3-ec0fd8bc6cd9

3.27K views15:05

DevOps&SRE Library

Kubernetes List API performance and reliability

At my current employer, we use Kubernetes to run hundreds of thousands of bare metal servers, spread over hundreds of Kubernetes clusters. We use Kubernetes beyond officially supported/tested scale limits by running more than 5,000 nodes and over a hundred thousand of pods in a single cluster.1 In these large scale setups, expensive “list” calls on the Kubernetes API are the achilles heel of the control plane reliability and scalability. In this article, I’ll explain which list call patterns pose the most risk, and how recent and upcoming Kubernetes versions are improving the list API performance.

https://ahmet.im/blog/kubernetes-list-performance

3.71K views07:02

DevOps&SRE Library

opencode

AI coding agent, built for the terminal.

https://github.com/sst/opencode

3.64K views15:01

DevOps&SRE Library

ktea

ktea is a tool designed to simplify and accelerate interactions with Kafka clusters.

https://github.com/jonas-grgt/ktea

3.59K views07:02

DevOps&SRE Library

GitOps: View from a security perspective

https://medium.com/@TechInternals/gitops-view-from-a-security-perspective-a120795b2f17

3.37K views15:04

DevOps&SRE Library

"Best practices" aren't always best for you

https://thefridaydeploy.substack.com/p/best-practices-arent-always-best

3.11K views07:02

DevOps&SRE Library

SLA vs SLO

Demystifying the most common misconception in Service Level jargon

https://blog.alexewerlof.com/p/sla-vs-slo

2.89K views15:05

DevOps&SRE Library

tfautomv

Generate Terraform moved blocks automatically for painless refactoring

https://github.com/busser/tfautomv

3.21K views07:02

DevOps&SRE Library

When SIGTERM Does Nothing: A Postgres Mystery

The ClickPipes team had encountered a bug with logical replication slot creation on Postgres read replicas—specifically, an issue where a query that was already taking hours rather than the few seconds it usually took couldn’t be terminated by any of the usual methods in Postgres, causing customer frustration and risking the stability of production databases. In this blog post, I’ll walk through how I investigated the problem and ultimately discovered it was due to a Postgres bug. We’ll also share how we fixed it and our experience working with the Postgres community.

https://clickhouse.com/blog/sigterm-postgres-mystery

3.31K views15:05

DevOps&SRE Library

Mastering Postgres Replication Slots: Preventing WAL Bloat and Other Production Issues

https://www.morling.dev/blog/mastering-postgres-replication-slots

3.01K views07:02

DevOps&SRE Library

Life Altering Postgresql Patterns

There is a set of things that you can do when working with a Postgres database which I have found made my and my coworker's lives much more pleasant. Each one is by itself small, but in aggregate have a noticeable effect.

https://mccue.dev/pages/3-11-25-life-altering-postgresql-patterns

3.25K views15:02

DevOps&SRE Library

Don't Do This

A short list of common mistakes.

https://wiki.postgresql.org/wiki/Don%27t_Do_This

2.96K views07:04

DevOps&SRE Library

Fix a top cause of slow queries in PostgreSQL (no slow query log needed)

https://render.com/blog/postgresql-top-cause-slow-queries

3.2K views15:03

DevOps&SRE Library

Postgres query plan visualization tools

https://www.pgmustard.com/blog/postgres-query-plan-visualization-tools

2.9K views07:03

DevOps&SRE Library

OpenAI: Scaling PostgreSQL to the Next Level

At the PGConf.dev 2025 Global Developer Conference, Bohan Zhang from OpenAI shared OpenAI’s best practices with PostgreSQL, offering a glimpse into the database usage of one of the most prominent unicorn company.

https://www.pixelstech.net/article/1747708863-openai%3a-scaling-postgresql-to-the-next-level

2.77K views15:02

DevOps&SRE Library

Seventh-generation server hardware at Dropbox: our most efficient and capable architecture yet

Fourteen years ago, Dropbox took its first steps toward building its own hardware infrastructure—and as our product and user base has grown, so has our infrastructure. What started with just a handful of servers has evolved into one of the largest custom-built storage systems in the world. We've scaled from a few dozen machines to tens of thousands of servers with millions of drives.

That evolution didn’t happen by accident. It took years of iteration, close collaboration with suppliers, and a product-first mindset that treated infrastructure as a strategic advantage. Now we’re excited to share what’s next: the launch of our seventh-generation hardware platform, now featuring Crush, Dexter, and Sonic for our traditional compute, database, and storage workloads, and our newest GPU tiers, Gumby and Godzilla. To make this leap possible, we dramatically increased storage bandwidth, effectively doubled our available rack power, and introduced a next-gen storage chassis designed to even further minimize vibration and heat.

This generation represents our most efficient, capable, and scalable architecture yet—and it’ll help us as we continue to build and scale helpful AI products like Dropbox Dash. Below, we’ll walk you through how we designed the latest version of our server hardware as well as key lessons we’ll carry into generations to come.

https://dropbox.tech/infrastructure/seventh-generation-server-hardware

3K views07:00

DevOps&SRE Library

Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure

The Infrastructure team at Hugging Face is excited to share a behind-the-scenes look at the inner workings of Hugging Face's production infrastructure, which we’ve had the privilege of helping to build and maintain. Our team's dedication to designing and implementing a robust monitoring and alerting system has been instrumental in ensuring the stability and scalability of our platforms. We’re constantly reminded of the impact that our alerts have on our ability to identify and respond to potential issues before they become major incidents.

In this blog post, we’ll dive into the details of three mighty alerts that play their unique role in supporting our production infrastructure, and explore how they've helped us maintain the high level of performance and uptime that our community relies on.

https://huggingface.co/blog/infrastructure-alerting

3.23K views15:05

DevOps&SRE Library

rustfs

RustFS is a high-performance distributed object storage software built using Rust, one of the most popular languages worldwide. Along with MinIO, it shares a range of advantages such as simplicity, S3 compatibility, open-source nature, support for data lakes, AI, and big data. Furthermore, it has a better and more user-friendly open-source license in comparison to other storage systems, being constructed under the Apache license. As Rust serves as its foundation, RustFS provides faster speed and safer distributed features for high-performance object storage.

https://github.com/rustfs/rustfs

3.59K views07:02

About

Blog

Apps

Platform