Reliability Report 📰 – Telegram

Reliability Report 📰

@ReliabilityReport

8 subscribers

26 links

A collaborative curated content site about Reliability Engineering #SRE #CloudNative #DevOps
https://reliability.re/contribute/

Download Telegram

About

Blog

Apps

Platform

Reliability Report 📰

Reliability Report 📰

From https://geototti21.medium.com/slo-from-nothing-to-production-91b8d4270bd5

If you don’t know how to start introducing SLOs at work, this a great example from Ioannis (@geototti21) and his journey to bring SLOs into his organization with a clear path and framework. As he said “Explain how SLOs can be an important internal tracking target that is tougher than your SLA. Also, mention how we can use the SLOs to offer better SLAs than “service uptime” if we are asked”

Shared by @pabluk via reliability.re

SLO — From Nothing to… Production

A practical “framework” to implement SLOs and how I prepared myself and my organisation.

172 views21:06

Reliability Report 📰

From https://www.alldaydevops.com/2020-fallschedule

Hey, in case you missed it, tomorrow (Nov 12) starts the 2020 Fall edition of the @AllDayDevOps conference, with talks during 24 hours by 180 speakers around the world, the event is held entirely online and the registration is free. Take a look at the schedule… there’s even a dedicated SRE track!

Shared by @pabluk via reliability.re

All Day DevOps 2021 | Schedule

The All Day DevOps 2021 schedule containing 24 hours of non-stop sessions led by industry experts.

161 views18:55

Reliability Report 📰

From https://landing.google.com/sre/workbook/chapters/non-abstract-design/

Non-Abstract Large System Design (NALSD) a very useful and critical skill for SREs: “By breaking down software into logical components and placing these components into a production ecosystem with reliable infrastructure, we arrive at systems that provide reasonable and appropriate targets for data consistency, system availability, and resource efficiency.”

Shared by @pabluk via reliability.re

130 views17:39

Reliability Report 📰

From https://driftctl.com/2020/11/24/infrastructure-drift

This article is the first outcome of a call for participation to a study on infrastructure drift we launched at the last Paris SRE Meetup. As part of our work on ‘drittctl’ we are writing a report on how infrastructure drift can be a challenge for DevOps teams, and how they address it. The goal is to share with the community core problems and best practices.
Here is a foretaste of this study, outlining some of the key facts we recorded.
When talking about infrastructure drift, you often get knowing glances and heated answers. Recording gaps in your infra between what you expected to be and the reality of what is, is a well known and wide spread issue bothering hundreds of teams around the globe. Facing impacts and consequences ranging from intensive toil to dangerous security threats, many team are keenly aware of the issue and actively looking for solutions.
We decided to look more closely into how they deal with it and conducted a study that will be released in the coming weeks.

Shared by @GeraldC13 via reliability.re

Why you should take care of infrastructure drift - driftctl

Infrastructure Drift is a major issue for DevOps teams, facing consequences ranging from intensive toil to dangerous security threats.

126 views19:28

Reliability Report 📰

From https://www.gremlin.com/blog/a-guide-to-the-reliability-talks-at-aws-re-invent/

Top picks of reliability-focused talks on AWS re:Invent (virtual) from @Ana_M_Medina a Sr Chaos Eng. at @GremlinInc

Shared by @GeraldC13 via reliability.re

A guide to the reliability talks at AWS re:Invent

Every year, we look forward to AWS re:Invent. There are always so many reasons to attend, but my top motivation is to learn. As re:Invent goes virtual this year, there are even more great talks happening and it can be hard to decide which to attend.

111 views06:44

Reliability Report 📰

From https://sre.google/resources/practices-and-processes/training-site-reliability-engineers/

The best thing to create and facilitate the adoption of an SRE culture in your organization is to have an optimum training plan adapted to its size, maturity and people experience. Take a look inside chapter 1 of this @googlesre book as a good starting point to find a matrix describing different use cases for organizations of any size, and in chapter 3 you’ll find case studies for small and large organizations that can inspire new ideas for your team!

Shared by @pabluk via reliability.re

Google SRE - SRE course for site reliability engineers

Google's sre training program empowers team with sre skills. This sre training covers essential concepts for building and maintaining reliable systems.

110 views16:49

Reliability Report 📰

From https://www.usenix.org/system/files/login/articles/login_winter16_11_beyer.pdf

“assigning a primary on-call to handle pager duty, while round-robin assigning tickets across the team. This setup frequently led to undesirable outcomes, as engineers couldn’t successfully under-take project work and ticket duty simultaneously” If that looks like your team and you’re looking for ideas to manage toil this article from @usenix ;login: magazine and shared on the @googlesre resources page https://sre.google/resources/ could help you to identify interruptions and find out an adapted strategy for your team.

Shared by @pabluk via reliability.re

239 views07:26

Reliability Report 📰

From https://www.youtube.com/watch?v=2C2F5USR6N4&list=PLbRoZ5Rrl5lfLXUjFjS0mP1XzNzNZMhYN

Yay! SREcon20 Americas talks are ready and available on Youtube 🎉 For more details on each talk see the program here https://www.usenix.org/conference/srecon20americas/program enjoy 🍿 thanks @SREcon and @usenix

Shared by @pabluk via reliability.re

SREcon20 Americas - The Secret Lives of SREs - Controlling the Costs of Coordination across Remote

The Secret Lives of SREs - Controlling the Costs of Coordination across Remote Teams

Laura Maguire, PhD

If you ask a group of engineers how they resolved a particularly difficult outage they typically talk about the dashboards that got pulled up, the logs…

282 views21:32

Reliability Report 📰

From https://luet-lab.github.io/docs/about/

With the recent announcement of Sabayon Linux becoming Mocaccino OS, we know that Luet will be used as package manager. This package manage sounds promising, with the ability to define your build / runtime dependencies on top of a container layer.

Shared by @tormath1 via reliability.re

Package manager built from containers

313 views11:02

Reliability Report 📰

From https://kinsta.com/blog/google-cloud-vs-aws/

In this long and complete paper, you’ll get some elements to help you choosing a cloud platform in your infrastructure design process.

Shared by @tormath1 via reliability.re

Google Cloud vs AWS (Comparing the Giants)

Thorough and data-rich comparison of two cloud computing giants, Google Cloud vs AWS. We'll analize products & pros vs cons for your business

276 views10:08

Reliability Report 📰

From https://techcrunch.com/2021/02/24/google-cloud-puts-its-kubernetes-engine-on-autopilot

Using GKE autopilot mode, you will have less to manage and more to play!

Shared by @tormath1 via reliability.re

tormath1 - Overview

Linux OS software engineer / IT volunteer at ISF (Engineers Without Borders France) - tormath1

58 views23:32

Reliability Report 📰

From https://arstechnica.com/gadgets/2021/03/psa-linux-folks-stay-away-from-the-5-12-rc1-kernel/

Funny story about this release candidate of Linux 5.12.
TL;DR:

[…] swap files stopped working right.

Shared by @tormath1 via reliability.re

Torvalds warns the world: Don’t use the Linux 5.12-rc1 kernel

Please, please don't use cowboy kernels in production—especially not this one!

39 views14:19

Reliability Report 📰

From https://increment.com/reliability/failure-is-okay/

Insightful article by @wiredferret for the latest issue of @incrementmag on how to change our mindset to accept failure in order to build resilient systems following risk reduction and harm mitigation patterns.

Shared by @pabluk via reliability.re

Everything is broken, and it’s okay – Increment: Reliability

Accepting that imperfect things still work is fundamental to preventing failures from becoming catastrophes.

47 views10:55

Reliability Report 📰

From https://promcon.io/2021-online/schedule/

@PromConIO schedule is available! The 3rd of May and online. Which talks do you want to attend? :)

Shared by @tormath1 via reliability.re

Schedule | PromCon Online 2021

PromCon, the conference about the Prometheus monitoring system and time series database

71 views12:28

Reliability Report 📰

From https://www.contributing.today/

Don’t forget to join the virtual meetups of contributing.today for 2 interesting shows! today, 21 April 2021, about Site Reliability Engineering with a great panel of SREs and another one the next week about Chaos Engineering with @QuintessenceAnx from PagerDuty!

Shared by @pabluk via reliability.re

www.contributing.today

contributing.today - Monthly Open Source meetup

This monthly meetup is for sharing knowledge about all things contributing, maintaining, and using Open Source. We'll have interviews, panels, presentations. We aim to be welcoming for everyone, it doesn't matter if you're new to Open Source, interested,…

60 views08:03

Reliability Report 📰

From https://azure.microsoft.com/en-us/blog/microsoft-acquires-kinvolk-to-accelerate-containeroptimized-innovation/

It’s also a personal news as a (former-) Kinvolk software engineer. Super happy and we look forward to see the great things incoming :D

Shared by @tormath1 via reliability.re

Microsoft Azure Blog

Microsoft acquires Kinvolk to accelerate container-optimized innovation | Microsoft Azure Blog

The ability to run Kubernetes anywhere, whether in the cloud or on-premises, has been a high priority for Azure customers looking to rapidly innovate, with increasing customer focus on the benefits of container-optimized workloads and operating systems, lean…

50 views13:30

Reliability Report 📰

From https://www.hashicorp.com/blog/mitchell-s-new-role-at-hashicorp

Mitchell Hashimoto is retiring from Hashicorp exec team to become a full-time individual contributor.

Shared by @tormath1 via reliability.re

Mitchell's New Role at HashiCorp

Mitchell Hashimoto takes on a new individual contributor role at HashiCorp.

51 views07:22

Reliability Report 📰

From https://blog.cloudflare.com/october-2021-facebook-outage/

A very concise and insightful explanation about BGP and Internet infrastructure from the @Cloudflare’s perspective during the FB incident

Shared by @pabluk via reliability.re

The Cloudflare Blog

Understanding how Facebook disappeared from the Internet

Today at 1651 UTC, we opened an internal incident entitled "Facebook DNS lookup returning SERVFAIL" because we were worried that something was wrong with our DNS resolver 1.1.1.1. But as we were about to post on our public status page we realized something…

24 views16:35

Reliability Report 📰

From https://medium.com/cybelangel-product-engineering/recovering-corrupted-rabbitmq-data-by-reversing-its-storage-protocol-part-1-bed2501d0fa9

A very well explained article by @edealir about RabbitMQ storage protocol internals and the journey to recover corrupted data from it!

Shared by @pabluk via reliability.re

Recovering corrupted RabbitMQ data by reversing its storage protocol (part 1)

This is the story of how we reversed the RabbitMQ storage protocol to mitigate the impact of an outage we faced at CybelAngel.

22 views15:02

Reliability Report 📰

From https://grafana.com/blog/2022/06/14/introducing-grafana-oncall-oss-open-source/

This quite recent product from Grafana is now available as an open-source solution with a symbolic initial release v1.0.0 - congrats to them!

Shared by @tormath1 via reliability.re

Introducing Grafana OnCall OSS, on-call management for the open source community | Grafana Labs

Grafana OnCall is now open source for self-managed and on-premises deployments.

14 views16:26