DevOps & SRE notes
12.1K subscribers
44 photos
19 files
2.53K links
Helpful articles and tools for DevOps&SRE

WhatsApp: https://whatsapp.com/channel/0029Vb79nmmHVvTUnc4tfp2F

For paid consultation (RU/EN), contact: @tutunak


All ways to support https://telegra.ph/How-support-the-channel-02-19
Download Telegram
A small reminder: Ingress Nginx will be retired soon (in less than two weeks), so you can choose the Gateway API instead.
The original article is behind a paywall.

TL;DR Amazon service was taken down by AI coding bot

Amazon’s cloud unit has suffered at least two outages due to errors involving its own AI tools, leading some employees to raise doubts about the US tech giant’s push to roll out these coding assistants.

Amazon Web Services experienced a 13-hour interruption to one system used by its customers in mid-December after engineers allowed its Kiro AI coding tool to make certain changes, according to four people familiar with the matter.

The people said the agentic tool, which can take autonomous actions on behalf of users, determined that the best course of action was to “delete and recreate the environment”.

Amazon posted an internal postmortem about the “outage” of the AWS system, which lets customers explore the costs of its services.

Multiple Amazon employees told the FT that this was the second occasion in recent months in which one of the group’s AI tools had been at the centre of a service disruption.

“We’ve already seen at least two production outages [in the past few months],” said one senior AWS employee. “The engineers let the AI [agent] resolve an issue without intervention. The outages were small but entirely foreseeable.”

AWS, which accounts for 60 per cent of Amazon’s operating profits, is seeking to build and deploy AI tools including “agents” capable of taking actions independently based on human instructions.

Like many Big Tech companies, it is seeking to sell this technology to outside customers. The incidents highlight the risk that these nascent AI tools can misbehave and cause disruptions.

Amazon said it was a “coincidence that AI tools were involved” and that “the same issue could occur with any developer tool or manual action”.

“In both instances, this was user error, not AI error,” Amazon said, adding that it had not seen evidence that mistakes were more common with AI tools.

The company said the incident in December was an “extremely limited event” affecting only a single service in parts of mainland China. Amazon added that the second incident did not have an impact on a “customer facing AWS service”.

Neither disruption was anywhere near as severe as a 15-hour AWS outage in October 2025 that forced multiple customers’ apps and websites offline — including OpenAI’s ChatGPT.

Employees said the group’s AI tools were treated as an extension of an operator and given the same permissions. In these two cases, the engineers involved did not require a second person’s approval before making changes, as would normally be the case.

Amazon said that by default its Kiro tool “requests authorisation before taking any action” but said the engineer involved in the December incident had “broader permissions than expected — a user access control issue, not an AI autonomy issue”.

AWS launched Kiro in July. It said the coding assistant would advance beyond “vibe coding” — which allows users to quickly build applications — to instead write code based on a set of specifications.

The group had earlier relied on its Amazon Q Developer product, an AI-enabled chatbot, to help engineers write code. This was involved in the earlier outage, three of the employees said.

Some Amazon employees said they were still sceptical of AI tools’ utility for the bulk of their work given the risk of error. They added that the company had set a target for 80 per cent of developers to use AI for coding tasks at least once a week and was closely tracking adoption.

Amazon said it was experiencing strong customer growth for Kiro and that it wanted customers and employees to benefit from efficiency gains.

“Following the December incident, AWS implemented numerous safeguards”, including mandatory peer review and staff training, Amazon added.

src: https://www.ft.com/content/00c282de-ed14-4acd-a948-bc8d6bdb339d
👏2😱21🔥1
AWS Cost Optimization Game Day — a hands-on, interactive session focused on improving cloud efficiency and reducing costs in real-world scenarios.

You’ll collaborate, analyze architectures, uncover cost-saving opportunities, and compete in a fun, gamified environment.

Ready to optimize and win?
Let’s play smart with AWS!

When: Wednesday, Mar 11 · 4:30 PM to 7:30 PM GMT+2
Language: English

Registration link is here
🔥43👍1
Understanding how many pods your infrastructure can actually support is crucial for reliability. This overview breaks down the nuances of Kubernetes cluster capacity and resource allocation.
https://dnastacio.medium.com/kubernetes-cluster-capacity-d96d0d82b380
👍21
As announced November 2025, Kubernetes will retire Ingress-NGINX in March 2026. Despite its widespread usage, Ingress-NGINX is full of surprising defaults and side effects that are probably present in your cluster today. This blog highlights these behaviors so that you can migrate away safely and make a conscious decision about which behaviors to keep. This post also compares Ingress-NGINX with Gateway API and shows you how to preserve Ingress-NGINX behavior in Gateway API. The recurring risk pattern in every section is the same: a seemingly correct translation can still cause outages if it does not consider Ingress-NGINX's quirks.

https://kubernetes.io/blog/2026/02/27/ingress-nginx-before-you-migrate/
👍53
Although Ingress-Nginx is still maintained and receiving security updates (e.g. controller-v1.15.0), it's time to start migrating to the Gateway API. ingress2gateway can help with that.
👎32🎉1
Looking for a hosting platform to practice with Linux, Kubernetes, etc.? Register using my referral link on DigitalOcean and get $200 in credit for 60 days. By registering through my referral link, you also support this Telegram channel.

👉 Register
👍61
Short-lived public TLS certificates are our future, with a 46-day maximum validity by 2029.

https://knowledge.digicert.com/alerts/public-tls-certificates-199-day-validity
😢6😱4💯21
🚨 Trivy has been hacked, again.

---

What happened?

Attackers compromised the official aquasecurity/trivy-action GitHub Action — the one people use to run Trivy vulnerability scans in CI/CD pipelines. This was disclosed today (March 20, 2026). It's the *second* Trivy-related supply chain attack this month — the first one hit the Trivy VS Code extension on OpenVSX, where injected code tried to abuse local AI coding agents.

How did they do it?

The attacker force-pushed 75 out of 76 version tags in the aquasecurity/trivy-action repository. So if your workflow references this action by a version tag like @0.34.2, @0.33.0, or @0.18.0 — you're running malicious code. The only tag that wasn't touched is @0.35.0.

The tricky part: the malicious code runs *before* the real Trivy scan starts, so everything looks normal to the user.

What does the malware actually do?

It dumps the runner's process memory to grab secrets, harvests SSH keys, and steals credentials for AWS, GCP, Azure, and also Kubernetes service account tokens. Basically, it's an infostealer designed specifically for CI/CD environments.

How big is the blast radius?

Over 10,000 workflow files on GitHub reference this action, so potentially a lot of projects are affected. The compromised tags were still active at the time the article was written.

Key risks for you to think about:

Given your EKS and GitOps setup, here are the things I'd pay attention to:

1. K8s service account tokens leaked — if any of your CI pipelines use trivy-action and have access to your EKS clusters, those tokens could be compromised. Rotate them.

2. AWS credentials exposed — your IRSA roles, Secrets Manager access, anything the GitHub runner had in its environment could be stolen.

3. Tag pinning is not enough — this attack shows that even pinning to a specific version tag like @0.33.0 doesn't protect you. Tags in Git can be force-pushed. The safe approach is to pin to a full commit SHA not a tag.

4. Second attack in one month on the same tool — Trivy is popular, and attackers clearly see it as a high-value target. Worth thinking about whether your security scanning toolchain has a single point of failure.

What to do right now:

- Check if any of your GitHub Actions workflows reference aquasecurity/trivy-action by tag (not by SHA).
- If yes, treat your CI/CD secrets as compromised — rotate AWS keys, SSH keys, K8s tokens.
- Switch to referencing actions by commit SHA instead of version tag.
- Review your GitHub Actions workflow permissions — make sure you use least-privilege permissions: blocks.

This is a really good example of why "shift left security" needs to also include securing the security tools themselves. The scanner became the attack vector.

https://socket.dev/blog/trivy-under-attack-again-github-actions-compromise
Please open Telegram to view this post
VIEW IN TELEGRAM
👍96😱2
⚡️ LocalStack archived its GitHub repo — what happened and what it means

On March 23, 2026, LocalStack archived localstack/localstack on GitHub (read-only) and consolidated everything into a single Docker image that requires an auth token — including in CI.
What changed:
- docker pull localstack/localstack:latest without LOCALSTACK_AUTH_TOKEN → your pipeline breaks
- Free "Hobby" plan exists but requires account creation and is non-commercial only
- Paid plans start at $39/mo
- CI needs a dedicated CI Auth Token stored in secrets
Your options:
- Pin to an older tag (e.g. 4.12) — works short-term, but you accumulate parity drift and unpatched CVEs
- Create a free account — enough for individual non-commercial dev
- Pay — if LocalStack is embedded in team CI
For open-source projects: LocalStack launched a separate program offering free Ultimate tier licenses (100+ AWS services, Cloud Pods, IAM enforcement) to eligible OSS projects with OSI-approved licenses.

https://blog.localstack.cloud/introducing-localstack-for-open-source/
😢7😱31🤯1