DevOps&SRE Library
17.8K subscribers
461 photos
4 videos
2 files
4.76K links
Библиотека статей по теме DevOps и SRE.

Реклама: @ostinostin
Контент: @mxssl

РКН: https://knd.gov.ru/license?id=67704b536aa9672b963777b3&registryType=bloggersPermission
Download Telegram
My Dev Lessons From 2020

Kubernetes is to Borg what Frankstein is to the Dali Lama

When I left Google, I was sold on the whole containerized way of running things. Borg is lightyears ahead of every other cluster orchestration project.

Borg doesn't let you do everything. It is designed to run specifically built applications that are containerized. You don't get Docker images with whatever OS stuff you feel like running that day. The OS is always Google's internal OS. You don't get access to whatever binaries you want to install. You don't get go use whatever security you want. Your RPC system is always going to be Stubby (GRPC internal to Google). Your cluster file system is going to be the only one allowed. Period.

Those limits are freeing. You simply need to have resources to run your jobs and deploy them. Your binaries are packaged up and you just need to say what is going to get run.

So naturally, I've used Kubernetes after I left.

Everything about Borg I liked is gone in Kubernentes. It is trying to solve everyone's problem and solves no one's problem.

It is easy to kill your jobs. Its hard to do things like update a single instance. Service meshes???? Really????

Helm? Great, I can kill all my cluster MySQL databases at the flick of my heml config.

Security, what security? Oh, right, the bring my own model that is just crazy hard.

Need it to work with special cloud sidecars (like special identity services)? Well, that's going to be a fun thing.

Upgrades that change the config language so that your jobs won't run anymore. Perfect.....

And btw, love YAML over the Borg config language, NOT!

http://www.gophersre.com/2021/02/21/my-dev-lessons-from-2020
Linux Performance Checklists for SREs

Linux Perf Analysis in 60s
(https://netflixtechblog.com/linux-performance-analysis-in-60-000-milliseconds-accc10403c55)

1. uptime ⟶ load averages
2. dmesg -T | tail ⟶ kernel errors
3. vmstat 1 ⟶ overall stats by time
4. mpstat -P ALL 1 ⟶ CPU balance
5. pidstat 1 ⟶ process usage
6. iostat -xz 1 ⟶ disk I/O
7. free -m ⟶ memory usage
8. sar -n DEV 1 ⟶ network I/O
9. sar -n TCP,ETCP 1 ⟶ TCP stats
10. top ⟶ check overview

Linux Disk Checklist

1. iostat -xz 1 ⟶ any disk I/O? if not, stop looking
2. vmstat 1 ⟶ is this swapping? or, high sys time?
3. df -h ⟶ are file systems nearly full?
4. ext4slower 10 ⟶ (zfs*, xfs*, etc.) slow file system I/O?
5. bioslower 10 ⟶ if so, check disks
6. ext4dist 1 ⟶ check distribution and rate
7. biolatency 1 ⟶ if interesting, check disks
8. cat /sys/devices/…/ioerr_cnt ⟶ (if available) errors
9. smartctl -l error /dev/sda1 ⟶ (if available) errors

* Another short checklist. Won't solve everything. ext4slower/dist, bioslower/latency, are from bcc/BPF tools.

Linux Network Checklist

1. sar -n DEV,EDEV 1 ⟶ at interface limits? or use nicstat
2. sar -n TCP,ETCP 1 ⟶ active/passive load, retransmit rate
3. cat /etc/resolv.conf ⟶ it's always DNS
4. mpstat -P ALL 1 ⟶ high kernel time? single hot CPU?
5. tcpretrans ⟶ what are the retransmits? state?
6. tcpconnect ⟶ connecting to anything unexpected?
7. tcpaccept ⟶ unexpected workload?
8. netstat -rnv ⟶ any inefficient routes?
9. check firewall config ⟶ anything blocking/throttling?
10. netstat -s ⟶ play 252 metric pickup

* tcp*, are from bcc/BPF tools.

Linux CPU Checklist

1. uptime ⟶ load averages
2. vmstat 1 ⟶ system-wide utilization, run q length
3. mpstat -P ALL 1 ⟶ CPU balance
4. pidstat 1 ⟶ per-process CPU
5. CPU flame graph ⟶ CPU profiling
6. CPU subsecond offset heat map ⟶ look for gaps
7. perf stat -a -- sleep 10 ⟶ IPC, LLC hit ratio

* htop can do 1-4. I'm tempted to add execsnoop for short-lived processes (it's in perf-tools or bcc/BPF tools).

https://www.brendangregg.com/blog/2016-05-04/srecon2016-perf-checklists-for-sres.html
Troubleshooting Elasticsearch ILM: Common issues and fixes

https://www.elastic.co/blog/troubleshooting-elasticsearch-ilm-common-issues-and-fixes
How to pick the best observability solution for your organisation

There are a wealth of monitoring solutions available for engineers and developers to choose from, so how do you select which is most appropriate for you?

https://medium.com/contino-engineering/how-to-pick-the-best-observability-solution-for-your-organisation-e956f0bffb8e
git-switcher

Switch between your git profiles easily

https://github.com/TheYkk/git-switcher
Всем привет!

Мы – Deutsche Telekom, крупнейший европейский оператор связи и одна из ведущих мировых компаний.

Сейчас у нас активно формируются новые проектные команды для развития направления Network Automation & Orchestration. Суть программы — разработка оркестратора инфраструктуры в масштабах всей компании, который будет управлять огромным количеством всего телекоммуникационного оборудования Deutsche Telekom. В основе программы лежит концепция Model Driven Orchestration, которая базируется на автоматизированном управлении экземплярами ресурсов, сервисами и сетевыми функциями с использованием модели перехода состояний.

У нас уже открыто более 25 новых позиций Network и RAN инженеров и мы будем рады видеть в нашей команде людей разного уровня, от матерых архитекторов, до джунов, которым только предстоит всему научиться. Так же мы ищем сильного People Lead с опытом работы с сетями, который сможет повести команду за собой. С более подробной информацией можно ознакомиться по ссылкам: https://deutschetelekomitsolutions.ru/jobs/262/?sphrase_id=812
https://deutschetelekomitsolutions.ru/jobs/968/?sphrase_id=811
https://deutschetelekomitsolutions.ru/jobs/1116/?sphrase_id=813

Своим сотрудникам мы предлагаем отличный социальный пакет: ДМС с первого дня, компенсацию спорта, обучение за счет компании, реферальные бонусы, welcome-бонус, гибкий график и возможность полностью удаленной работы, а так же зарплату от 100 тысяч рублей NET (верхний же предел, фактически, не ограничен).

Мы будем рады пообщаться с вами и рассказать больше о проекте и компании. Если вас заинтересовала вакансия и вы хотите стать частью нашей команды, пишите на почту mikhail.lymar@t-systems.com или @Mlymar в телеграм.
OpenSearch

OpenSearch is a community-driven, open source fork of Elasticsearch and Kibana

https://github.com/opensearch-project/OpenSearch
Services; not Server

Gone are the days of yore when we named are our servers Etsy, Betsy, and Momo, fed them fish, and cleaned their poop. Well, servers were our pets. Fast-forward, to the world of Kubernetes, and each server, is now a UUID. Is it beneficial anymore to continue observing an individual server?

https://blog.last9.io/services-not-server-observability
pipe

PipeCD provides a unified continuous delivery solution for multiple application kinds on multi-cloud that empowers engineers to deploy faster with more confidence, a GitOps tool that enables doing deployment operations by pull request on Git.

https://github.com/pipe-cd/pipe
HTTP/2: The Sequel is Always Worse

https://portswigger.net/research/http2
sftpgo

Fully featured and highly configurable SFTP server with optional FTP/S and WebDAV support, written in Go. Several storage backends are supported: local filesystem, encrypted local filesystem, S3 (compatible) Object Storage, Google Cloud Storage, Azure Blob Storage, SFTP.

https://github.com/drakkan/sftpgo
sso

the authentication and authorization system BuzzFeed developed to provide a secure, single sign-on experience for access to the many internal web apps used by our employees.

https://github.com/buzzfeed/sso
Deployment Strategies In Kubernetes

Learn what are the different deployment strategies available in Kubernetes and how to use them.

https://auth0.com/blog/deployment-strategies-in-kubernetes