DevOps&SRE Library
17.8K subscribers
459 photos
4 videos
2 files
4.75K links
Библиотека статей по теме DevOps и SRE.

Реклама: @ostinostin
Контент: @mxssl

РКН: https://knd.gov.ru/license?id=67704b536aa9672b963777b3&registryType=bloggersPermission
Download Telegram
SRE Engagement Models

- Consulting
- Embedded
- Infra Team

https://certomodo.substack.com/p/sre-engagement-models
CloudFront and Terraform Essentials: How to Optimize Content Delivery

We are going to describe how CloudFront can be integrated with API Gateway to provide lower-latency. And we will go through the attributes of the CloudFront resources in Terraform, including the ones that we need to create the distribution and configure origins and behaviors.

https://medium.com/@xpiotrkleban/cloudfront-and-terraform-essentials-how-to-optimize-content-delivery-27c84e8aef04
Best practices for monitoring static web applications

https://www.datadoghq.com/blog/static-web-application-monitoring-best-practices
latency: a primer

hi! this article is aimed at folks who are interested in performance analysis or operations of software, and want to understand the impact on user experience. the examples will be centered around web applications and web services, but can be applied in other contexts as well.

https://igor.io/latency
Principles of Reliable Software Design

Reliable software design is a discipline that involves a careful balance of numerous principles, each of which is intended to ensure the development of high-quality software that meets the needs of users and stakeholders.

https://www.codereliant.io/principles-of-reliable-software-design-part-1
Failover

What is it? How does it work? When to use it and when not to use it?

https://blog.alexewerlof.com/p/failover
Solving challenges caused by Out Of Memory (OOM) Killer in Linux

Learn how out of memory events created challenges for our team, and how we solved them.

https://redpanda.com/blog/solve-out-of-memory-killer-events
acme-dns

A simplified DNS server with a RESTful HTTP API to provide a simple way to automate ACME DNS challenges.

https://github.com/joohoi/acme-dns
Building and operating a pretty big storage system called S3

Today, I am publishing a guest post from Andy Warfield, VP and distinguished engineer over at S3. I asked him to write this based on the Keynote address he gave at USENIX FAST ‘23 that covers three distinct perspectives on scale that come along with building and operating a storage system the size of S3.

https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html
Bridging the gap between IaC and Schema Management

When we started building Atlas a couple of years ago, we noticed that there was a substantial gap between what was then considered state-of-the-art in managing database schemas and the recent strides from Infrastructure-as-Code (IaC) to managing cloud infrastructure.

In this post, we review that gap and show how Atlas – along with its Terraform provider – can bridge the two domains.

https://atlasgo.io/blog/2023/07/19/bridging-the-gap-between-iac-and-schema-management
A misadventure with Terraform Sets & PagerDuty Schedules

How Terraform's setunion() disregards ordering.

https://tratnayake.dev/a-misadventure-with-terraform-sets-pagerduty-schedules
Stop using IAM User Credentials with Terraform Cloud

I recently started using Terraform Cloud but discovered that the getting started tutorial which describes how to integrate it with Amazon Web Services (AWS) suggested using IAM user credentials. This is not ideal as these credentials are long-lived and can lead to security issues.

https://www.wolfe.id.au/2023/07/17/stop-using-iam-user-credentials-with-terraform-cloud
Secure Your AWS Environments with Terraform, Vault, and Veeam

https://julia.hashnode.dev/secure-your-aws-environments-with-terraform-vault-and-veeam
sre-checklist

A checklist of anyone practicing Site Reliability Engineering

https://github.com/bregman-arie/sre-checklist
Why bother with SLI and SLO?

Is there really any value in setting service level indicators and objectives?

https://blog.alexewerlof.com/p/why-bother-with-sli-and-slo
Traffic Jams in the Cloud: Are Overloads Sabotaging Your Application's Reliability?

https://blog.fluxninja.com/blog/traffic-jams-in-the-cloud-unveiling-the-true-enemy-of-reliability