CatOps
5.67K subscribers
94 photos
4 videos
19 files
2.25K links
DevOps and other issues by Yurii Rochniak (@grem1in) - SRE @ Preply && Maksym Vlasov (@MaxymVlasov) - Engineer @ Star. Opinions on our own.

We do not post ads including event announcements. Please, do not bother us with such requests!
Download Telegram
Статья об ElasticSearch в стиле "послание себе, когда начинал с ним работать".

В статье есть информация про индексы, шарды, потребление ресурсов и траблшутинг.

#observability #elk #elasticsearch
​​A list of awesome Prometheus alerts, divided by the nature. I.e. there are alerts for data bases, proxies and load balancers, storage, etc.

You can just copy-paste these into your monitoring code. Just keep in mind that the thresholds may be different for your particular case!

#observability #monitoring #prometheus
Amazon Managed Service for Grafana now supports Grafana Enterprise upgrade, Grafana version 7.5, Open Distro for Elasticsearch integration, and AWS Billing reports

You can upgrade to Grafana Enterprise with 30 days trial to enable enterprise data sources.

Beginning April 16th, 2021, customers using AMG will receive a 90-day free trial for five free users per account, with additional usage charges.

AMG is currently available in the US East (N. Virginia) and Europe (Ireland) region.

#aws #observability
Gatus is a health dashboard written in Go.

It has minimalistic configuration and allows you to set multiple conditions to label an endpoint as "healthy".

Also, you can host it on your own inside your private network. So, if you have security requirements of not to allow external health checkers into the perimeter, this could be a good way to go.

#toolz #observability
A small neat write up on learnings about incident responses

Key takeaways:
- Declare incidents on smaller things. Division between SEV1 and SEV3 incidents helps you to track system health better. As well a bunch of smaller problems may lead to a critical failure. Also, such problems are usually easy to fix one by one.
- Decrease the time between the incident and postmortem analysis. Analysis will be much more accurate, when you have a fresh memory of what has happened.
- Alert on symptoms, not causes. Alert only if your users (external or internal) have issues, not when CPU utilization is high

#observability
Recently, I asked my subscribers what topics are interesting to them and a few people mentioned observability.

That’s funny, ‘coz yesterday I accidentally bumped into a great series of articles on setting SLAs for your products by Alex Ewerlöf!

- Calculating composite SLA - truly outstanding read!
- Some practical advice when setting SLA - notice, it says SLA, not SLO. So, there are some business related tips in this article as well. However, the core is technical, ofc.
- Calculating the SLA of a system behind a CDN - I haven’t read this one yet. But given the quality of previous two, I expect this one be great as well!

tl;dr for the first article in the list:

for serial, multiply availability; For parallels, multiply unavailability


I would personally also add that when you try to set a “full” SLO(A) for your service, that is also a composite SLO(A). You should treat it as a serial. For example, if you have 99.8% error rate SLO and 99.1% latency SLO, an “overall” SLO would be 0.998 0.991 100% = 98.9%

That’s not only good to know, but you may also want to write your marketing materials differently. There is a difference between:

> We guarantee 99.8% SLO on 5th error rate and 99.1% SLO on requests not taking longer than X milliseconds.

And

> We guarantee the 98.9% availability of our systems.

I’m not a marketing person, though. I don’t know what’s better. What I do know is that:”Nines doesn’t matter, if your users are unhappy”.

#observability #slo #sla
​​Logging is one of the three pillars of observability. One of the Palantir Blog posts outlines some pretty nice ideas that may help you to build a better observability solution.

Structured logging. Of course, all of us know that logs are easier to analyze if they are structured in commonly across applications. Palantir improved the way "message" field usually looks like:


{
 "message": "Processing event from user",
"params": {
"event": "foo",
"user": "bar"
},
"level": "DEBUG",
"logger": "com.logger.name",
"time": "2019-07-28T09:31:04.46165-07:00"
}


No
te that these structured logs avoid a common logging practice of inserting parameters into the log message via string formatting, creating a variable message ("Processing event foo from user bar"). Such an approach will prevent you from building complex log queries to filter out variable messages. Instead, you can search for an exact match and add additional filters by other fields (e.g. params.user == "bar")

L
ogs from 3rd party services. Usually, logs from 3rd-party components do not fit your unified logging structure. To overcome this, Palantir engineers built tooling that analyzes Golang source code and creates regular expressions which convert original messages to the ones with an expected structure.

You can read more about it in the blog post itself. I hope these ideas will help you to build a better observability system!

P.S. Thanks Valerii Tatarin for this post. If you'd like to share something with community too, feel free rich @MaxymVlasov or @grem1in.

#logging #observability
Here the FireHydrant company writes that MTTR (mean time to recovery) metric is overrated.

I tend to agree with their justification: not all incidents are the same. Not all systems are equally critical as well as not always the cause or rather a set of causes is easy to mitigate.

However, it’s still important to measure the quality of your incident response somehow. The propose two metrics instead:

- Mean time to detection - basically, how long it takes for you to spot a problem. I would also ask an additional question here: was a problem reported by your observability systems, or by your colleagues or customers?

- Mean time to retrospective - or mean time to postmortem if you wish. I think this is a good one. It’s true that some details may fade away from memory, so sooner you do a retrospective - the better! Also, it’s f you actually measure this time, it can help you to prioritize meetings and put postmortems as a higher priority comparing to, for example, a team standup.

#observability
​​A new episode of our voice chat is here (in Ukrainian)!

This time we talked a little bit about observability and deploy of AWS Lambdas. You can find it on:
- YouTube
- Spotify
- Apple Podcasts
- Google Podcasts

Enjoy!

Also, I used AI-based voice enhancement by Adobe in this episode. So, let me know if it's better this way in the chat. Also, if you want to participate in the upcoming voice chats, do not hesitate to join! We usually do those each other Thursday at 20:00 Kyiv time. The next one will probably be next week.

P.S. If you want even more than that, DOU is currently looking for hosts for their own DevOps-themed podcast! If you want to be one, make sure to fill out that form!

#voice_chat #говорилка #observability #serverless
VictoriaMetrics have released their first iteration of the log platform!

Here’s the info:

The first release of VictoriaLogs!

Release page on GitHub

Documentation

Docker demos

Helm Chart

Here you can find a Benchmark for VictoriaLogs

Since I’m not a user, it’s hard for me to provide feedback right away. Yet, if you use it or want to try and want to provide any feedback to the maintainers, do not hesitate to submit bug reports and feature requests on GitHub.

#victoriametrics #logs #observability
Yet another observability post: Datadog Alternative: When to Move Out? .

This article describes a fairly simple experiment: what would it cost to run Datadog compared to Prometheus & Thanos on different scales.

Although, the scope of the experiment was somewhat limited (no logs, no APM), it provides some numbers that you can use while evaluating different observability solutions.

The bottom line is: with the growth of your technological footprint, Datadog price grows faster compared to the Prometheus stack. Which is true for any SaaS solution compared to self-hosted, I believe.

However, another important considerations is that free software is not "free". You cannot just replace a vendor with an open-source tool and call it a day. You require engineering time to support and maintain those platforms. Therefore, on a smaller scale it might much more effective to use SaaS (especially, if you don't have experts in a given domain in your company yet).

#observability
Kubernetes: tracing requests with AWS X-Ray, and Grafana data source is a step-by-step guide on how to setup tracing in your EKS cluster using AWS X-Ray by Arseniy Zinchenko - a member of the Ukrainian DevOps community.

Also, make sure to subscribe to his Substack! He posts new things quite often and I have no idea where does he find time and willpower to do so 😅

#aws #kubernetes #observability
People often say that Observability is a Data problem. Although, it sounds correct intuitively, I cannot say that I fully understood how Data engineering approaches could be applied to the Observability systems.

This article about Wide Events clarified things for me a bit. Indeed, if any event that happened in the system is just an object with some value and useful metadata, things like metrics, logs, and traces become less relevant - it's all events now!

Apparently, this is how Observability is done in Meta, according to the author, and apparently people in Meta like it. I never worked for Meta, I don't know what they really use there and if it's better than the tools available to us mere mortals.

However, this is an interesting concept, and it would be wonderful to see similar projects that are not internal to the Big Tech companies.

#observability
A case study from Grammarly on their migration from Graphite to VictoriaMetrics, which helped them to improve DevEx as well as save money.

Unfortunately, there are not many technical details in this article, but I still want to share it, ‘coz I’ve rarely seen posts about VictoriaMetrics in production. So, maybe you are the one who can write something on the matter?

P.S. Do you, folks, even remember Graphite?

#observability