Reliability Report 📰
8 subscribers
26 links
A collaborative curated content site about Reliability Engineering #SRE #CloudNative #DevOps
https://reliability.re/contribute/
Download Telegram
From https://geototti21.medium.com/slo-from-nothing-to-production-91b8d4270bd5

If you don’t know how to start introducing SLOs at work, this a great example from Ioannis (@geototti21) and his journey to bring SLOs into his organization with a clear path and framework. As he said “Explain how SLOs can be an important internal tracking target that is tougher than your SLA. Also, mention how we can use the SLOs to offer better SLAs than “service uptime” if we are asked”

Shared by @pabluk via reliability.re
From https://www.alldaydevops.com/2020-fallschedule

Hey, in case you missed it, tomorrow (Nov 12) starts the 2020 Fall edition of the @AllDayDevOps conference, with talks during 24 hours by 180 speakers around the world, the event is held entirely online and the registration is free. Take a look at the schedule… there’s even a dedicated SRE track!

Shared by @pabluk via reliability.re
From https://landing.google.com/sre/workbook/chapters/non-abstract-design/

Non-Abstract Large System Design (NALSD) a very useful and critical skill for SREs: “By breaking down software into logical components and placing these components into a production ecosystem with reliable infrastructure, we arrive at systems that provide reasonable and appropriate targets for data consistency, system availability, and resource efficiency.”

Shared by @pabluk via reliability.re
From https://driftctl.com/2020/11/24/infrastructure-drift

This article is the first outcome of a call for participation to a study on infrastructure drift we launched at the last Paris SRE Meetup. As part of our work on ‘drittctl’ we are writing a report on how infrastructure drift can be a challenge for DevOps teams, and how they address it. The goal is to share with the community core problems and best practices.
Here is a foretaste of this study, outlining some of the key facts we recorded.

When talking about infrastructure drift, you often get knowing glances and heated answers. Recording gaps in your infra between what you expected to be and the reality of what is, is a well known and wide spread issue bothering hundreds of teams around the globe. Facing impacts and consequences ranging from intensive toil to dangerous security threats, many team are keenly aware of the issue and actively looking for solutions.
We decided to look more closely into how they deal with it and conducted a study that will be released in the coming weeks.

Shared by @GeraldC13 via reliability.re
From https://sre.google/resources/practices-and-processes/training-site-reliability-engineers/

The best thing to create and facilitate the adoption of an SRE culture in your organization is to have an optimum training plan adapted to its size, maturity and people experience. Take a look inside chapter 1 of this @googlesre book as a good starting point to find a matrix describing different use cases for organizations of any size, and in chapter 3 you’ll find case studies for small and large organizations that can inspire new ideas for your team!

Shared by @pabluk via reliability.re
From https://www.usenix.org/system/files/login/articles/login_winter16_11_beyer.pdf

“assigning a primary on-call to handle pager duty, while round-robin assigning tickets across the team. This setup frequently led to undesirable outcomes, as engineers couldn’t successfully under-take project work and ticket duty simultaneously” If that looks like your team and you’re looking for ideas to manage toil this article from @usenix ;login: magazine and shared on the @googlesre resources page https://sre.google/resources/ could help you to identify interruptions and find out an adapted strategy for your team.

Shared by @pabluk via reliability.re
From https://luet-lab.github.io/docs/about/

With the recent announcement of Sabayon Linux becoming Mocaccino OS, we know that Luet will be used as package manager. This package manage sounds promising, with the ability to define your build / runtime dependencies on top of a container layer.

Shared by @tormath1 via reliability.re
From https://increment.com/reliability/failure-is-okay/

Insightful article by @wiredferret for the latest issue of @incrementmag on how to change our mindset to accept failure in order to build resilient systems following risk reduction and harm mitigation patterns.

Shared by @pabluk via reliability.re