🚨 AI News | TestingCatalog
6.15K subscribers
3.5K photos
535 videos
40 files
4.1K links
Latest AI News on AI Agents, Model Releases, Tools, Leaks, and Rumors πŸ—ž
Download Telegram
Google made Gemma 4 models 3x faster with MTP Drafters

Google introduced speculative decoding for Gemma 4, cutting inference latency by pairing a main model with a lightweight drafter. It speeds local and on-device AI on GPUs and edge hardware without sacrificing output quality.

πŸ—ž #gemini @testingcatalog
❀2πŸ‘1
Anthropic debuts Dreams for Claude Managed Agents

Anthropic previewed β€œdreaming” in Claude Managed Agents, letting agents review past sessions, refine memory, and reduce repeat errors. The update also adds outcomes, multiagent orchestration, and webhooks for complex enterprise workflows.

πŸ—ž #claude @testingcatalog
❀3
Anthropic partners with SpaceXAI and doubles 5-hour rate limits

Anthropic’s partnership with SpaceX expands compute capacity and raises Claude usage limits, giving SpaceX staff broader AI access for engineering, operations, coding, and documentation in high-demand internal workflows.

πŸ—ž #claude @testingcatalog
❀31
Google Deep Mind 🀝 EVE Online

Google has partnered with Fernis Creations to conduct a new research within a scope of an isolated EVE Online environment.

> As part of this next chapter, we are beginning a research partnership with Google DeepMind, focused on intelligence in complex, dynamic, player-driven systems.
❀8πŸ‘2
Anthropic is testing Insights feature for its Managed Agents on Claude Console.

> Up to 100 recent sessions are fetched. Each transcript is sent to the model (4 in parallel) with your agent's system prompt as context. The model writes a summary β€” task, actions, issues, assessment β€” and a 0–100 quality score. Token, cache, and tool-error counts are computed directly from the events alongside.

> A single model call reads every summary and its stats, then produces cross-session findings (recurring errors, usage patterns, efficiency outliers, wins), error-category buckets, and use-case clusters. Every cited session ID is checked against the input, so findings only ever point at real sessions.

> Summaries and findings are saved so the page loads instantly next time. Everything numeric you see β€” counts, percentages, token stats per cluster β€” is computed here from raw event data; only the prose and bucket membership come from the model.
1❀5πŸ”₯2
Google prepares Agent Mode for Flow to automate video production

Google is preparing an Agent Mode for Flow, its Veo-based filmmaking tool. Code indicates a prompt-bar toggle for planning scenes, managing tools, running generation, and updating projects, with a likely reveal at I/O.

πŸ—ž #flow @testingcatalog
πŸ”₯6❀42
Scale Labs debuts new Refactoring Leaderboard

Scale Labs’ SWE Atlas Refactoring Leaderboard tests whether AI coding agents can restructure real codebases without changing behavior. Opus 4.7 leads, but results show reliability, cleanup, and repeatability still limit production use.

πŸ—ž #ai @testingcatalog
❀4πŸ‘4
🚨 AI News | TestingCatalog
Scale Labs debuts new Refactoring Leaderboard Scale Labs’ SWE Atlas Refactoring Leaderboard tests whether AI coding agents can restructure real codebases without changing behavior. Opus 4.7 leads, but results show reliability, cleanup, and repeatability still…
Scale AI published SWE Atlas Refactoring Leaderboard, a new benchmark that evaluates agent capabilities of restructuring the code.

> It requires agents to produce twice as much lines of code than SWE Bench Pro.

> Claude Code with Opus 4.7 tops the leaderboard followed by Codex with GPT-5.5, GPT-5.4 and GPT-5.3.

> Refactoring is quite an important task for LLMs to handle as it often boils down to a quite boring engineering work.
❀4πŸ‘33
OPENAI 🚨: 3 new models are now available on OpenAI Playground and APIs.

- gpt-realtime 2
- gpt-realtime-whisper
- gpt-realtime-translate

ChatGPT Voice Mode upgrade soon? πŸ‘€
πŸ‘2
GOOGLE 🚨: Gemini 3.1 Flash Lite is now Generally Available! Users can also test this model on AI Studio.

> Designed for ultra-low latency, high-volume tasks, and unmatched cost-efficiency, Flash-Lite is already transforming how applications are built at scale.
7πŸ‘321
SPACEXAI 🚨: New signs of Grok Computer have been spotted on the Grok web.

A new selector allows users to choose between Grok Computer and a "Folder on Google Drive."

This feature became recently available to everyone and might not be intentional.

Grok Computer soon? πŸ‘€
❀5πŸ‘5
AVM 2 is in development 🚧

Historically, AVM updates are reserved to the day before I/O event.

Soon? πŸ‘€πŸ‘€πŸ‘€
❀8πŸ‘3
OPENAI πŸ”₯: Codex is getting a dedicated Chrome extension soon!

> With the new extension for Chrome, Codex is even better at working with apps and websites in your browser. It works in parallel across tabs in the background without taking over your browser, and you stay in control of which websites Codex can use.

* Not available yet πŸ‘€
❀4πŸ‘3
SpaceXAI prepares Grok Build desktop app to rival OpenAI Codex

SpaceXAI appears close to launching Grok Build, a desktop coding app for macOS, Linux, and Windows.

> It will support planning mode, Plugins, Skills, and MCPs.
> Will be able to work with the Git tree, spawn dev servers, and work with a built-in browser.

πŸ—ž #grok @testingcatalog
πŸ”₯3❀2πŸ‘11
OpenAI launches new realtime voice and translation AI models

OpenAI added three real-time voice models to its API: GPT-Realtime-2 for complex voice agents, GPT-Realtime-Translate for multilingual speech, and GPT-Realtime-Whisper for live transcription, with pricing and Playground access.

πŸ—ž #chatgpt @testingcatalog
πŸ‘3❀2
Telegram ships major update for AI bots and automations

Telegram is slowly becoming the most AI-integrated messenger! Users can now interact with guest AI bots, build automated workflows involving bot-to-bot communication, and attach their personal AI assistants to their profiles so the bot can handle incoming inquiries.

πŸ—ž #telegram @testingcatalog
6❀2πŸ”₯2