Many organizations are looking for more efficient logging solutions than the traditional stack. This comparison highlights a modern alternative to ELK that aims to reduce complexity and resource usage.
https://osuite.io/articles/modern-alternative-to-elk
https://osuite.io/articles/modern-alternative-to-elk
osuite.io
ELK alternative: Modern log management setup with Opentelemetry and Opensearch
Full stack observability designed for scale
๐2
The article details how to implement production-grade distributed tracing for complex multi-agent AI workflows using OpenTelemetry.
https://developers.redhat.com/articles/2026/04/06/distributed-tracing-agentic-workflows-opentelemetry#
https://developers.redhat.com/articles/2026/04/06/distributed-tracing-agentic-workflows-opentelemetry#
Red Hat Developer
Distributed tracing for agentic workflows with OpenTelemetry | Red Hat Developer
Agentic applications often involve complex interactions between routing agents, specialist agents, knowledge bases, Model Context Protocol (MCP) servers, and external systems. This complexity makes
๐4โค1
Networking within container orchestration can often seem like a black box to developers. This explanation aims to demystify Kubernetes CNI providers and how they manage connectivity.
https://medium.com/@csinclair11/demystifying-kubernetes-cni-providers-5ed79569c797
https://medium.com/@csinclair11/demystifying-kubernetes-cni-providers-5ed79569c797
Medium
Demystifying Kubernetes CNI Providers
Computer networks have changed. It makes sense, computing platforms have been changing for several years now. From the old days of beefyโฆ
โค4๐1
I found a good example of why autoscaling based only on CPU utilization can cause an outage.
About a week ago, Twingate had an incident that affected us as a client. They've published a postmortem, and it's a good example of why CPU isn't a good metric to rely on when autoscaling your services.
So, from the CPU utilization perspective, everything was OK, but the number of processed requests decreased.
https://status.twingate.com/incidents/49qvqk7swjpq
About a week ago, Twingate had an incident that affected us as a client. They've published a postmortem, and it's a good example of why CPU isn't a good metric to rely on when autoscaling your services.
The incident was triggered by elevated network latency affecting communication paths used by the Authorization service. As requests took longer to complete, individual service instances were able to process fewer requests than normal.
This reduction in throughput exposed a limitation in our auto-scaling configuration, which primarily relied on CPU utilization to determine service capacity requirements.
So, from the CPU utilization perspective, everything was OK, but the number of processed requests decreased.
https://status.twingate.com/incidents/49qvqk7swjpq
Twingate
Twingate Service Incident
Twingate's Status Page - Twingate Service Incident.
๐6๐ฅ2
Forwarded from AI Vibe Notes
kagent runs your agents where your workloads already live โ on Kubernetes. Deploy, observe, and govern AI agents with the tools your platform team already trusts. Open source. Production grade. Built by the founders of Istio.
https://github.com/kagent-dev/kagent
https://github.com/kagent-dev/kagent
GitHub
GitHub - kagent-dev/kagent: Cloud Native Agentic AI | Discord: https://bit.ly/kagentdiscord
Cloud Native Agentic AI | Discord: https://bit.ly/kagentdiscord - kagent-dev/kagent
๐4โค2
The new DNSTracking feature in the Red Hat network observability operator 1.11, which now captures DNS query names directly via eBPF without additional configuration.
https://developers.redhat.com/articles/2026/04/09/how-dns-name-tracking-enhances-network-observability#
https://developers.redhat.com/articles/2026/04/09/how-dns-name-tracking-enhances-network-observability#
Red Hat Developer
How DNS name tracking enhances network observability | Red Hat Developer
Network observability has long had a feature that reports the DNS latencies and response codes for the DNS resolutions in your Kubernetes cluster
๐4
CLI tool for linting and testing Helm charts
https://github.com/helm/chart-testing
https://github.com/helm/chart-testing
GitHub
GitHub - helm/chart-testing: CLI tool for linting and testing Helm charts
CLI tool for linting and testing Helm charts. Contribute to helm/chart-testing development by creating an account on GitHub.
๐6๐ฅ3โค1
ING tackled developer portal sprawl (60+ disparate tools) by adopting Backstage.io as their unified front-end standard. The talk outlines their specific architectural choices and governance models to scale Backstage without it becoming a monolithic bottleneck or crashing due to community plugins.
- To prevent a single bad plugin from crashing the portal, ING separates core services (like the software catalog, which handles hundreds of thousands of entities and has dedicated DB tuning) from community/external plugins, running them on separate instances.
- To avoid costly rewrites of legacy services, internal teams can use a backend proxy plugin to connect existing backend tools into the Backstage UI.
- Built a custom plugin to solve ownership issues in complex, cross-domain workflows.
- Because anyone can contribute, ING enforces a "Contribution Plugin" workflow
- They drove adoption by focusing heavily on Developer Experience (local setups, playgrounds) while simultaneously having their Technology Standards Board mandate Backstage for all new internal UI initiatives.
https://tldrecap.tech/posts/2026/backstagecon-europe/ing-backstage-scaling-developer-platform/
- To prevent a single bad plugin from crashing the portal, ING separates core services (like the software catalog, which handles hundreds of thousands of entities and has dedicated DB tuning) from community/external plugins, running them on separate instances.
- To avoid costly rewrites of legacy services, internal teams can use a backend proxy plugin to connect existing backend tools into the Backstage UI.
- Built a custom plugin to solve ownership issues in complex, cross-domain workflows.
- Because anyone can contribute, ING enforces a "Contribution Plugin" workflow
- They drove adoption by focusing heavily on Developer Experience (local setups, playgrounds) while simultaneously having their Technology Standards Board mandate Backstage for all new internal UI initiatives.
https://tldrecap.tech/posts/2026/backstagecon-europe/ing-backstage-scaling-developer-platform/
TLDRecap โฎ๏ธ
Divide & Collaborate: Creating Scalable and Healthy Backstage Ba... Krzysztof Janota & Dusan Askovic
Presenters
Krzysztof Janota Dusan Askovic Source
BackstageCon Europe 2026 Divide and Collaborate: Scaling Backstage at ING ๐ In the world of massive organizations, complexity is rarely a problemโit is a numbers game. At ING, a global systemic bank with 60โฆ
Krzysztof Janota Dusan Askovic Source
BackstageCon Europe 2026 Divide and Collaborate: Scaling Backstage at ING ๐ In the world of massive organizations, complexity is rarely a problemโit is a numbers game. At ING, a global systemic bank with 60โฆ
๐3โค2
The primary bottleneck in software delivery is no longer writing code (thanks to AI-assisted development) but rather post-commit infrastructure operations, which are traditionally built for human interaction rather than machine autonomy. It positions Crossplane and Kubernetes-native control planes as the necessary solution, advocating for "API-first infrastructure."
https://www.cncf.io/blog/2026/03/20/crossplane-and-ai-the-case-for-api-first-infrastructure/
https://www.cncf.io/blog/2026/03/20/crossplane-and-ai-the-case-for-api-first-infrastructure/
CNCF
Crossplane and AI: The case for API-first infrastructure
AI-assisted development has changed the way engineers create and commit code. But writing code is no longer the bottleneck. The bottleneck is everything that happens after git push.
๐4โค2
The article explores the newly introduced CloudWatch Logs delivery feature for Amazon EKS Auto Mode.
https://shinyaz.com/en/blog/2026/03/19/eks-auto-mode-enhanced-logging
https://shinyaz.com/en/blog/2026/03/19/eks-auto-mode-enhanced-logging
Shinyaz
Visualizing Karpenter Internals with EKS Auto Mode Enhanced Logging
Set up CloudWatch Vended Logs for EKS Auto Mode's 4 components (Compute/Block Storage/Load Balancing/IPAM) and analyze scale-up to scale-down behavior with Logs Insights queries.
๐ฅ3
Airbnb migrated its high-volume metrics infrastructure to adopt the OpenTelemetry Protocol (OTLP) and Prometheus. To do so without massive disruption, they implemented a dual-emit strategy in their shared metrics libraries. They encountered and solved specific performance bottlenecks regarding high-cardinality data and replaced their legacy Veneur aggregator with a custom-sharded vmagent setup. Crucially, they developed a "zero injection" technique to solve systemic undercounting issues when translating StatsD-style counters into Prometheus cumulative counters.
https://medium.com/airbnb-engineering/building-a-high-volume-metrics-pipeline-with-opentelemetry-and-vmagent-c714d6910b45
https://medium.com/airbnb-engineering/building-a-high-volume-metrics-pipeline-with-opentelemetry-and-vmagent-c714d6910b45
Medium
Building a high-volume metrics pipeline with OpenTelemetry and vmagent
A production-tested approach for moving a large-scale metrics pipeline from StatsD to OpenTelemetry and Prometheus.
โค2๐2
A utility for fetching Kubernetes Manifest documents from a running cluster. This utility can be run inside or outside a Kubernetes cluster, and utilizes a config file to determine what kind of objects to detect. Manifests files are stored in an output directory in the format:
https://github.com/grafana/k8s-manifest-tail
<outputDir>/<kind>/<namespace>/<name>.yamlhttps://github.com/grafana/k8s-manifest-tail
GitHub
GitHub - grafana/k8s-manifest-tail
Contribute to grafana/k8s-manifest-tail development by creating an account on GitHub.
๐4
Shopify discovered that deeply nested, high-cardinality GraphQL queries were bottlenecking not on I/O, but on CPU-bound field resolver execution driven by GraphQLโs standard depth-first traversal model. To solve this, Shopify built "GraphQL Cardinal," a breadth-first execution engine that resolves each field once across all objects rather than recursively per object, vastly reducing platform overhead and resolving N+1 issues more efficiently.
https://shopify.engineering/faster-breadth-first-graphql-execution
https://shopify.engineering/faster-breadth-first-graphql-execution
Shopify
Shopifyโs journey to faster breadth-first GraphQL execution (2026) - Shopify
We questioned why conventional GraphQL execution incurs hidden costs, and rewrote it in a faster breadth-first manner to avoid them.
โค5๐2