From https://geototti21.medium.com/slo-from-nothing-to-production-91b8d4270bd5
If you don’t know how to start introducing SLOs at work, this a great example from Ioannis (@geototti21) and his journey to bring SLOs into his organization with a clear path and framework. As he said “Explain how SLOs can be an important internal tracking target that is tougher than your SLA. Also, mention how we can use the SLOs to offer better SLAs than “service uptime” if we are asked”
Shared by @pabluk via reliability.re
If you don’t know how to start introducing SLOs at work, this a great example from Ioannis (@geototti21) and his journey to bring SLOs into his organization with a clear path and framework. As he said “Explain how SLOs can be an important internal tracking target that is tougher than your SLA. Also, mention how we can use the SLOs to offer better SLAs than “service uptime” if we are asked”
Shared by @pabluk via reliability.re
Medium
SLO — From Nothing to… Production
A practical “framework” to implement SLOs and how I prepared myself and my organisation.
From https://www.alldaydevops.com/2020-fallschedule
Hey, in case you missed it, tomorrow (Nov 12) starts the 2020 Fall edition of the @AllDayDevOps conference, with talks during 24 hours by 180 speakers around the world, the event is held entirely online and the registration is free. Take a look at the schedule… there’s even a dedicated SRE track!
Shared by @pabluk via reliability.re
Hey, in case you missed it, tomorrow (Nov 12) starts the 2020 Fall edition of the @AllDayDevOps conference, with talks during 24 hours by 180 speakers around the world, the event is held entirely online and the registration is free. Take a look at the schedule… there’s even a dedicated SRE track!
Shared by @pabluk via reliability.re
Alldaydevops
All Day DevOps 2021 | Schedule
The All Day DevOps 2021 schedule containing 24 hours of non-stop sessions led by industry experts.
From https://landing.google.com/sre/workbook/chapters/non-abstract-design/
Non-Abstract Large System Design (NALSD) a very useful and critical skill for SREs: “By breaking down software into logical components and placing these components into a production ecosystem with reliable infrastructure, we arrive at systems that provide reasonable and appropriate targets for data consistency, system availability, and resource efficiency.”
Shared by @pabluk via reliability.re
Non-Abstract Large System Design (NALSD) a very useful and critical skill for SREs: “By breaking down software into logical components and placing these components into a production ecosystem with reliable infrastructure, we arrive at systems that provide reasonable and appropriate targets for data consistency, system availability, and resource efficiency.”
Shared by @pabluk via reliability.re
From https://driftctl.com/2020/11/24/infrastructure-drift
This article is the first outcome of a call for participation to a study on infrastructure drift we launched at the last Paris SRE Meetup. As part of our work on ‘drittctl’ we are writing a report on how infrastructure drift can be a challenge for DevOps teams, and how they address it. The goal is to share with the community core problems and best practices.
Here is a foretaste of this study, outlining some of the key facts we recorded.
When talking about infrastructure drift, you often get knowing glances and heated answers. Recording gaps in your infra between what you expected to be and the reality of what is, is a well known and wide spread issue bothering hundreds of teams around the globe. Facing impacts and consequences ranging from intensive toil to dangerous security threats, many team are keenly aware of the issue and actively looking for solutions.
We decided to look more closely into how they deal with it and conducted a study that will be released in the coming weeks.
Shared by @GeraldC13 via reliability.re
This article is the first outcome of a call for participation to a study on infrastructure drift we launched at the last Paris SRE Meetup. As part of our work on ‘drittctl’ we are writing a report on how infrastructure drift can be a challenge for DevOps teams, and how they address it. The goal is to share with the community core problems and best practices.
Here is a foretaste of this study, outlining some of the key facts we recorded.
When talking about infrastructure drift, you often get knowing glances and heated answers. Recording gaps in your infra between what you expected to be and the reality of what is, is a well known and wide spread issue bothering hundreds of teams around the globe. Facing impacts and consequences ranging from intensive toil to dangerous security threats, many team are keenly aware of the issue and actively looking for solutions.
We decided to look more closely into how they deal with it and conducted a study that will be released in the coming weeks.
Shared by @GeraldC13 via reliability.re
driftctl
Why you should take care of infrastructure drift - driftctl
Infrastructure Drift is a major issue for DevOps teams, facing consequences ranging from intensive toil to dangerous security threats.
From https://www.gremlin.com/blog/a-guide-to-the-reliability-talks-at-aws-re-invent/
Top picks of reliability-focused talks on AWS re:Invent (virtual) from @Ana_M_Medina a Sr Chaos Eng. at @GremlinInc
Shared by @GeraldC13 via reliability.re
Top picks of reliability-focused talks on AWS re:Invent (virtual) from @Ana_M_Medina a Sr Chaos Eng. at @GremlinInc
Shared by @GeraldC13 via reliability.re
Gremlin
A guide to the reliability talks at AWS re:Invent
Every year, we look forward to AWS re:Invent. There are always so many reasons to attend, but my top motivation is to learn. As re:Invent goes virtual this year, there are even more great talks happening and it can be hard to decide which to attend.
From https://sre.google/resources/practices-and-processes/training-site-reliability-engineers/
The best thing to create and facilitate the adoption of an SRE culture in your organization is to have an optimum training plan adapted to its size, maturity and people experience. Take a look inside chapter 1 of this @googlesre book as a good starting point to find a matrix describing different use cases for organizations of any size, and in chapter 3 you’ll find case studies for small and large organizations that can inspire new ideas for your team!
Shared by @pabluk via reliability.re
The best thing to create and facilitate the adoption of an SRE culture in your organization is to have an optimum training plan adapted to its size, maturity and people experience. Take a look inside chapter 1 of this @googlesre book as a good starting point to find a matrix describing different use cases for organizations of any size, and in chapter 3 you’ll find case studies for small and large organizations that can inspire new ideas for your team!
Shared by @pabluk via reliability.re
sre.google
Google SRE - SRE course for site reliability engineers
Google's sre training program empowers team with sre skills. This sre training covers essential concepts for building and maintaining reliable systems.
From https://www.usenix.org/system/files/login/articles/login_winter16_11_beyer.pdf
“assigning a primary on-call to handle pager duty, while round-robin assigning tickets across the team. This setup frequently led to undesirable outcomes, as engineers couldn’t successfully under-take project work and ticket duty simultaneously” If that looks like your team and you’re looking for ideas to manage toil this article from @usenix ;login: magazine and shared on the @googlesre resources page https://sre.google/resources/ could help you to identify interruptions and find out an adapted strategy for your team.
Shared by @pabluk via reliability.re
“assigning a primary on-call to handle pager duty, while round-robin assigning tickets across the team. This setup frequently led to undesirable outcomes, as engineers couldn’t successfully under-take project work and ticket duty simultaneously” If that looks like your team and you’re looking for ideas to manage toil this article from @usenix ;login: magazine and shared on the @googlesre resources page https://sre.google/resources/ could help you to identify interruptions and find out an adapted strategy for your team.
Shared by @pabluk via reliability.re
From https://www.youtube.com/watch?v=2C2F5USR6N4&list=PLbRoZ5Rrl5lfLXUjFjS0mP1XzNzNZMhYN
Yay! SREcon20 Americas talks are ready and available on Youtube 🎉 For more details on each talk see the program here https://www.usenix.org/conference/srecon20americas/program enjoy 🍿 thanks @SREcon and @usenix
Shared by @pabluk via reliability.re
Yay! SREcon20 Americas talks are ready and available on Youtube 🎉 For more details on each talk see the program here https://www.usenix.org/conference/srecon20americas/program enjoy 🍿 thanks @SREcon and @usenix
Shared by @pabluk via reliability.re
YouTube
SREcon20 Americas - The Secret Lives of SREs - Controlling the Costs of Coordination across Remote
The Secret Lives of SREs - Controlling the Costs of Coordination across Remote Teams
Laura Maguire, PhD
If you ask a group of engineers how they resolved a particularly difficult outage they typically talk about the dashboards that got pulled up, the logs…
Laura Maguire, PhD
If you ask a group of engineers how they resolved a particularly difficult outage they typically talk about the dashboards that got pulled up, the logs…
From https://luet-lab.github.io/docs/about/
With the recent announcement of Sabayon Linux becoming Mocaccino OS, we know that Luet will be used as package manager. This package manage sounds promising, with the ability to define your build / runtime dependencies on top of a container layer.
Shared by @tormath1 via reliability.re
With the recent announcement of Sabayon Linux becoming Mocaccino OS, we know that Luet will be used as package manager. This package manage sounds promising, with the ability to define your build / runtime dependencies on top of a container layer.
Shared by @tormath1 via reliability.re
Luet
About Luet
Package manager built from containers
From https://kinsta.com/blog/google-cloud-vs-aws/
In this long and complete paper, you’ll get some elements to help you choosing a cloud platform in your infrastructure design process.
Shared by @tormath1 via reliability.re
In this long and complete paper, you’ll get some elements to help you choosing a cloud platform in your infrastructure design process.
Shared by @tormath1 via reliability.re
Kinsta®
Google Cloud vs AWS (Comparing the Giants)
Thorough and data-rich comparison of two cloud computing giants, Google Cloud vs AWS. We'll analize products & pros vs cons for your business
From https://techcrunch.com/2021/02/24/google-cloud-puts-its-kubernetes-engine-on-autopilot
Using GKE autopilot mode, you will have less to manage and more to play!
Shared by @tormath1 via reliability.re
Using GKE autopilot mode, you will have less to manage and more to play!
Shared by @tormath1 via reliability.re
GitHub
tormath1 - Overview
Linux OS software engineer / IT volunteer at ISF (Engineers Without Borders France) - tormath1
From https://arstechnica.com/gadgets/2021/03/psa-linux-folks-stay-away-from-the-5-12-rc1-kernel/
Funny story about this release candidate of Linux 5.12.
TL;DR:
[…] swap files stopped working right.
Shared by @tormath1 via reliability.re
Funny story about this release candidate of Linux 5.12.
TL;DR:
[…] swap files stopped working right.
Shared by @tormath1 via reliability.re
Ars Technica
Torvalds warns the world: Don’t use the Linux 5.12-rc1 kernel
Please, please don't use cowboy kernels in production—especially not this one!
From https://increment.com/reliability/failure-is-okay/
Insightful article by @wiredferret for the latest issue of @incrementmag on how to change our mindset to accept failure in order to build resilient systems following risk reduction and harm mitigation patterns.
Shared by @pabluk via reliability.re
Insightful article by @wiredferret for the latest issue of @incrementmag on how to change our mindset to accept failure in order to build resilient systems following risk reduction and harm mitigation patterns.
Shared by @pabluk via reliability.re
Increment
Everything is broken, and it’s okay – Increment: Reliability
Accepting that imperfect things still work is fundamental to preventing failures from becoming catastrophes.
From https://promcon.io/2021-online/schedule/
@PromConIO schedule is available! The 3rd of May and online. Which talks do you want to attend? :)
Shared by @tormath1 via reliability.re
@PromConIO schedule is available! The 3rd of May and online. Which talks do you want to attend? :)
Shared by @tormath1 via reliability.re
promcon.io
Schedule | PromCon Online 2021
PromCon, the conference about the Prometheus monitoring system and time series database
From https://www.contributing.today/
Don’t forget to join the virtual meetups of contributing.today for 2 interesting shows! today, 21 April 2021, about Site Reliability Engineering with a great panel of SREs and another one the next week about Chaos Engineering with @QuintessenceAnx from PagerDuty!
Shared by @pabluk via reliability.re
Don’t forget to join the virtual meetups of contributing.today for 2 interesting shows! today, 21 April 2021, about Site Reliability Engineering with a great panel of SREs and another one the next week about Chaos Engineering with @QuintessenceAnx from PagerDuty!
Shared by @pabluk via reliability.re
www.contributing.today
contributing.today - Monthly Open Source meetup
This monthly meetup is for sharing knowledge about all things contributing, maintaining, and using Open Source. We'll have interviews, panels, presentations. We aim to be welcoming for everyone, it doesn't matter if you're new to Open Source, interested,…
From https://azure.microsoft.com/en-us/blog/microsoft-acquires-kinvolk-to-accelerate-containeroptimized-innovation/
It’s also a personal news as a (former-) Kinvolk software engineer. Super happy and we look forward to see the great things incoming :D
Shared by @tormath1 via reliability.re
It’s also a personal news as a (former-) Kinvolk software engineer. Super happy and we look forward to see the great things incoming :D
Shared by @tormath1 via reliability.re
Microsoft Azure Blog
Microsoft acquires Kinvolk to accelerate container-optimized innovation | Microsoft Azure Blog
The ability to run Kubernetes anywhere, whether in the cloud or on-premises, has been a high priority for Azure customers looking to rapidly innovate, with increasing customer focus on the benefits of container-optimized workloads and operating systems, lean…
From https://www.hashicorp.com/blog/mitchell-s-new-role-at-hashicorp
Mitchell Hashimoto is retiring from Hashicorp exec team to become a full-time individual contributor.
Shared by @tormath1 via reliability.re
Mitchell Hashimoto is retiring from Hashicorp exec team to become a full-time individual contributor.
Shared by @tormath1 via reliability.re
HashiCorp
Mitchell's New Role at HashiCorp
Mitchell Hashimoto takes on a new individual contributor role at HashiCorp.
From https://blog.cloudflare.com/october-2021-facebook-outage/
A very concise and insightful explanation about BGP and Internet infrastructure from the @Cloudflare’s perspective during the FB incident
Shared by @pabluk via reliability.re
A very concise and insightful explanation about BGP and Internet infrastructure from the @Cloudflare’s perspective during the FB incident
Shared by @pabluk via reliability.re
The Cloudflare Blog
Understanding how Facebook disappeared from the Internet
Today at 1651 UTC, we opened an internal incident entitled "Facebook DNS lookup returning SERVFAIL" because we were worried that something was wrong with our DNS resolver 1.1.1.1. But as we were about to post on our public status page we realized something…
From https://medium.com/cybelangel-product-engineering/recovering-corrupted-rabbitmq-data-by-reversing-its-storage-protocol-part-1-bed2501d0fa9
A very well explained article by @edealir about RabbitMQ storage protocol internals and the journey to recover corrupted data from it!
Shared by @pabluk via reliability.re
A very well explained article by @edealir about RabbitMQ storage protocol internals and the journey to recover corrupted data from it!
Shared by @pabluk via reliability.re
Medium
Recovering corrupted RabbitMQ data by reversing its storage protocol (part 1)
This is the story of how we reversed the RabbitMQ storage protocol to mitigate the impact of an outage we faced at CybelAngel.
From https://grafana.com/blog/2022/06/14/introducing-grafana-oncall-oss-open-source/
This quite recent product from Grafana is now available as an open-source solution with a symbolic initial release v1.0.0 - congrats to them!
Shared by @tormath1 via reliability.re
This quite recent product from Grafana is now available as an open-source solution with a symbolic initial release v1.0.0 - congrats to them!
Shared by @tormath1 via reliability.re
Grafana Labs
Introducing Grafana OnCall OSS, on-call management for the open source community | Grafana Labs
Grafana OnCall is now open source for self-managed and on-premises deployments.