linstor-server
https://github.com/LINBIT/linstor-server
High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://github.com/LINBIT/linstor-server
RedisInsight
https://github.com/RedisInsight/RedisInsight
RedisInsight is a visual tool that provides capabilities to design, develop and optimize your Redis application. Query, analyse and interact with your Redis data.
https://github.com/RedisInsight/RedisInsight
ScratchDB
https://github.com/scratchdata/ScratchDB
Scratch is an open-source alternative to BigQuery, Redshift, and Snowflake. Runs on Clickhouse.
https://github.com/scratchdata/ScratchDB
Lessons learned from writing a Terraform Provider
https://medium.com/@abagayev/lessons-learned-from-writing-a-terraform-provider-62412b79a997
https://medium.com/@abagayev/lessons-learned-from-writing-a-terraform-provider-62412b79a997
terraform-provider-namecheap
https://github.com/namecheap/terraform-provider-namecheap
A Terraform Provider for Namecheap domain DNS configuration.
https://github.com/namecheap/terraform-provider-namecheap
Argo Workflows - Proven Patterns from Production
https://hodgkins.io/argo-workflow-proven-patterns-from-production
Argo Workflows provides an excellent platform for infrastructure automation, and has replaced Jenkins as my go tool for running scheduled or event-driven automation tasks.
In growing my experience with Argo Workflows, I’ve killed clusters, broken workflows and generally made a mess of things. I’ve also built a lot of workflows that needed refactoring as they became difficult to maintain.
This blog post aims to share some of the lessons I’ve learned, and some of the patterns I’ve developed, to help you avoid the same mistakes I’ve made.
https://hodgkins.io/argo-workflow-proven-patterns-from-production
Top 10 common Dockerfile linting issues
https://depot.dev/blog/dockerfile-linting-issues
We've added the ability to lint Dockerfiles on demand in Depot. This post covers the top 10 most common Dockerfile linting issues we've seen flowing through Depot.
https://depot.dev/blog/dockerfile-linting-issues
Scaling Elasticsearch by Cleaning the Cluster State
https://sematext.com/blog/elasticsearch-scaling-cluster-state
We often get questions like:
- How much data can I put in an Elasticsearch cluster?
- How many nodes can an Elasticsearch cluster have?
- What’s the biggest cluster that you’ve seen?
And while the 14-year-old in me is proud to say that we’ve done 24/7 support for clusters of 1000+ nodes holding many PB of data, I am quick to add that:
1. It doesn’t mean it’s a good idea to have clusters that big.
2. Such generic questions deserve more nuanced answers. Which is exactly what this blog post does. And it applies to OpenSearch as well as for Elasticsearch. And for the most part, to Solr (where the cluster state is stored in Zookeeper).
https://sematext.com/blog/elasticsearch-scaling-cluster-state
Learning From Google SRE Team (part-1)
https://www.codereliant.io/20-sre-lessons-from-google-part1
In this blog post, we aim to expand on the first 5 lessons shared by Google's Site Reliability Engineering team, offering a closer look at practical implementation examples.
https://www.codereliant.io/20-sre-lessons-from-google-part1
DevOps&SRE Library
SRE Interview Prep Plan (Week 2) This week is dedicated to providing you with the skills and knowledge to automate routine tasks, create scripts to solve complex problems, and manage infrastructure as code. As we look at scripting languages like Python and…
SRE Interview Prep Plan (Week 3)
https://www.codereliant.io/sre-interview-prep-plan-week-3
This week, we're taking another significant step forward as we get into the critical stack of monitoring and alerting. Now, it's time to equip yourself with the knowledge and tools needed to keep an eye on systems, analyze performance, and respond quickly to any issues that may come up.
https://www.codereliant.io/sre-interview-prep-plan-week-3
The costs of microservices
https://robertovitillo.com/costs-of-microservices
The microservices architecture adds more moving parts to the overall system, and this doesn’t come for free. The cost of fully embracing microservices is only worth paying if it can be amortized across dozens of development teams.
https://robertovitillo.com/costs-of-microservices
Retries, Backoff and Jitter
https://www.codereliant.io/retries-backoff-jitter
In distributed systems, failures and latency issues are inevitable. Services can fail due to overloaded servers, network issues, bugs, and various other factors. As engineers building distributed systems, we need strategies to make our services robust and resilient in the face of such failures. One useful technique is using retries.
https://www.codereliant.io/retries-backoff-jitter
Prometheus and centralized storage: When you need it, how it works, and what Mimir is
https://blog.palark.com/prometheus-centralized-storage-mimir
https://blog.palark.com/prometheus-centralized-storage-mimir
A guide to post-mortem meetings and how we run them at incident.io
https://incident.io/hubs/post-mortem/a-guide-to-post-mortem-meetings
https://incident.io/hubs/post-mortem/a-guide-to-post-mortem-meetings
A Comprehensive Guide to Testing in Terraform: Keep your tests, validations, checks, and policies in order
https://mattias.engineer/posts/terraform-testing-and-validation
This post discusses testing and validation for infrastructure-as-code (IaC) with HashiCorp Terraform. The insights and ideas presented here can surely be extended to IaC in general.
https://mattias.engineer/posts/terraform-testing-and-validation
Elevating CloudWatch Logs: Smart Alerts with Chatbot, SNS, and Lambda
https://medium.com/@louis-fiori/cloudwatch-logs-enhanced-alerts-a50ea08d0845
https://medium.com/@louis-fiori/cloudwatch-logs-enhanced-alerts-a50ea08d0845
From AI to sustainability, why our latest data centers use 400G networking
https://dropbox.tech/infrastructure/from-ai-to-sustainability-why-our-latest-data-centers-use-400g-networking
To meet the bandwidth requirements of new and future AI workloads—and stay committed to our sustainability goals—the Dropbox networking team recently designed and launched our first data center architecture using highly efficient, cutting edge 400 gigabit per second (400G) ethernet technology.
https://dropbox.tech/infrastructure/from-ai-to-sustainability-why-our-latest-data-centers-use-400g-networking
gitness
https://github.com/harness/gitness
Gitness is an open source development platform packed with the power of code hosting and automated DevOps pipelines.
https://github.com/harness/gitness