Forwarded from HN Best Comments
Re: How We Broke Top AI Agent Benchmarks: And What Comes Next
This is a phenomenal paper on exploits and hopefully changes the way benchmarking is done.
From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.
ggillas, 18 hours ago
This is a phenomenal paper on exploits and hopefully changes the way benchmarking is done.
From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.
ggillas, 18 hours ago
🔥3
Forwarded from HN Best Comments
Re: Tell HN: docker pull fails in spain due to football cloudflare block
Heh, lucky you, at least you get a message. My ISP just drops traffic to the affected IPs. No ping, no traceroute, just a spinner in the browser until it says "page not found".
Every response and comment from LaLiga, the football organization responsible for this, has been so far that this is a minor issue that only affects a few bunch of nerds who talk about "docker images" or "github repositories" or "whatever that means".
Meanwhile, there are testimonies of smart home devices like anti-theft alarms or automatic doors, that stop working whenever there is a football match, because their backends rely on Cloudflare.
Last week, a woman asked for help on social media, as the GPS tracking app she uses to see where her father with dementia is, went offline during a match. It was getting late and he still wasn't back home, and she couldn't locate the tag he was wearing to find him: https://www.infobae.com/america/agencias/2026/04/05/laliga-d...
It's hard to say this, because no one should experience an event like this, but as stressful as these are, it's the only way to make the mainstream people care about this censorship. "I cannot pull a docker image" will never be on nightly news, but safety and personal security is a more powerful driver for discourses.
danirod, 4 hours ago
Heh, lucky you, at least you get a message. My ISP just drops traffic to the affected IPs. No ping, no traceroute, just a spinner in the browser until it says "page not found".
Every response and comment from LaLiga, the football organization responsible for this, has been so far that this is a minor issue that only affects a few bunch of nerds who talk about "docker images" or "github repositories" or "whatever that means".
Meanwhile, there are testimonies of smart home devices like anti-theft alarms or automatic doors, that stop working whenever there is a football match, because their backends rely on Cloudflare.
Last week, a woman asked for help on social media, as the GPS tracking app she uses to see where her father with dementia is, went offline during a match. It was getting late and he still wasn't back home, and she couldn't locate the tag he was wearing to find him: https://www.infobae.com/america/agencias/2026/04/05/laliga-d...
It's hard to say this, because no one should experience an event like this, but as stressful as these are, it's the only way to make the mainstream people care about this censorship. "I cannot pull a docker image" will never be on nightly news, but safety and personal security is a more powerful driver for discourses.
danirod, 4 hours ago
infobae
LaLiga desmiente que sus sistemas antipiratería hayan hecho fallar un dispositivo de localización personal
Un usuario denunció dificultades para rastrear a un familiar vulnerable, relacionando el incidente con bloqueos tecnológicos, pero la organización recalca que no hay pruebas que respalden esa acusación y rechaza cualquier vínculo con interrupciones de servicios…
🤩3❤1
Forwarded from HN Best Comments
Re: €54k spike in 13h from unrestricted Firebase browser key accessing Gemini APIs
> We had a budget alert (€80) and a cost anomaly alert, both of which triggered with a delay of a few hours
> By the time we reacted, costs were already around €28,000
> The final amount settled at €54,000+ due to delayed cost reporting
So much for the folks defending these three companies that refused to provide hard spending cap ("but you can set the budget", "you are doing it wrong if you worry about billing", "hard cap it's technically impossible" etc.)
benterix, 2 hours ago
> We had a budget alert (€80) and a cost anomaly alert, both of which triggered with a delay of a few hours
> By the time we reacted, costs were already around €28,000
> The final amount settled at €54,000+ due to delayed cost reporting
So much for the folks defending these three companies that refused to provide hard spending cap ("but you can set the budget", "you are doing it wrong if you worry about billing", "hard cap it's technically impossible" etc.)
benterix, 2 hours ago
🥰5
Ну я типу звик що маршрутчики ведуть і паралельно рахують гроші і перекладають монетки, але до того, що він на ходу буде їбатися з тіпом, щоб той йому кинув на карту за проїзд, я був не готовий)
😁9
Forwarded from HN Best Comments
Re: Game devs explain the tricks involved with letting you pause a game
One of the things that impressed me in Quake (the first one) was the demo recording system. The system was deterministic enough that it could record your inputs/the game state and just play them back to get a gameplay video. Especially given that Quake had state of the art graphics at the time, and video playback on computers otherwise was a low-res, resource intensive affair at the time, it was way cool.
It always surprised me how few games had that feature - though a few important ones, like StarCraft, did - and it only became rarer over the years.
vintermann, 9 hours ago
One of the things that impressed me in Quake (the first one) was the demo recording system. The system was deterministic enough that it could record your inputs/the game state and just play them back to get a gameplay video. Especially given that Quake had state of the art graphics at the time, and video playback on computers otherwise was a low-res, resource intensive affair at the time, it was way cool.
It always surprised me how few games had that feature - though a few important ones, like StarCraft, did - and it only became rarer over the years.
vintermann, 9 hours ago
👍5
Forwarded from HN Best Comments
Re: Game devs explain the tricks involved with letting you pause a game
It wasn't really that much to do with determinism. Quake uses a client-server network model all the time, even when you're only playing a local single-player game. What the demo recording system does is capture all of the network packets that are being sent from the server to the client. When playing back a demo, all the game has to do is run a client and replay the packets that it originally received from the server. It's a very elegant system that naturally flows out of the rather forward-looking decision to build the entire engine around a robust networking model.
ndepoel, 12 hours ago
It wasn't really that much to do with determinism. Quake uses a client-server network model all the time, even when you're only playing a local single-player game. What the demo recording system does is capture all of the network packets that are being sent from the server to the client. When playing back a demo, all the game has to do is run a client and replay the packets that it originally received from the server. It's a very elegant system that naturally flows out of the rather forward-looking decision to build the entire engine around a robust networking model.
ndepoel, 12 hours ago
❤3👍2
Forwarded from LyChat
LyChat
Такий довгий вступ, щоб сказати, що насправді e2e немає і все зберігається на серверах.
Попросив Claude пояснити в чому взагалі ідея протокола Juicebox
Задум такий. Твій приватний ключ ріжеться на 3 shard-и і кидається на сервери трьох різних компаній, в різних юрисдикціях. Щоб зібрати назад треба мінімум 2 з 3. Зламали один realm - нічого не вийде
Що зробив Маск. Поставив 3 сервери на одному x.com. Все крутиться на одному AWS us-east. Все під юрисдикцією США
А сам ключ шифрується PIN-кодом з 4 цифр. Дослідники показали, що брутфорсити його можна за хвилини на домашній машині
Задум такий. Твій приватний ключ ріжеться на 3 shard-и і кидається на сервери трьох різних компаній, в різних юрисдикціях. Щоб зібрати назад треба мінімум 2 з 3. Зламали один realm - нічого не вийде
Що зробив Маск. Поставив 3 сервери на одному x.com. Все крутиться на одному AWS us-east. Все під юрисдикцією США
А сам ключ шифрується PIN-кодом з 4 цифр. Дослідники показали, що брутфорсити його можна за хвилини на домашній машині
😁23
Forwarded from HN Best Comments
Re: He asked AI to count carbs 27000 times. It couldn't give the same answer twice
There's an incredibly serious lack of education with how LLMs & carb-counting works. This entire article would be better suited to astrology.com than hackernews.
When I opened it up, I assumed the author would have at least attempted a calculation service, maybe even placed something like the size of the meal into an actual model, using the integration of pre-existing tools that are (slightly more) accurate. Hell - most food literally is required to have calorie information, and you can query open source data for others!
But the author just took pictures of food & expected a realistic response?
Is this genuinely what amounts to a study in AI?
This is akin to the instagram reels that talk to chatGPT and ask it to time how long they're run is. Except those are treated as funny jokes rather than being turned into studies.
I'd like to see this study done using any kind of actual grounding knowledge, seeing what mistakes AI makes when attempting to query ground truth from picture analysis - there would at least be an interesting result methodology in that.
endymion-light, 2 hours ago
There's an incredibly serious lack of education with how LLMs & carb-counting works. This entire article would be better suited to astrology.com than hackernews.
When I opened it up, I assumed the author would have at least attempted a calculation service, maybe even placed something like the size of the meal into an actual model, using the integration of pre-existing tools that are (slightly more) accurate. Hell - most food literally is required to have calorie information, and you can query open source data for others!
But the author just took pictures of food & expected a realistic response?
Is this genuinely what amounts to a study in AI?
This is akin to the instagram reels that talk to chatGPT and ask it to time how long they're run is. Except those are treated as funny jokes rather than being turned into studies.
I'd like to see this study done using any kind of actual grounding knowledge, seeing what mistakes AI makes when attempting to query ground truth from picture analysis - there would at least be an interesting result methodology in that.
endymion-light, 2 hours ago
Forwarded from HN Best Comments
Re: He asked AI to count carbs 27000 times. It couldn't give the same answer twice
> But the author just took pictures of food & expected a realistic response?
There are very popular apps on the App Store right now that are going viral among non-techie people that do exactly this, and they have no concept of how AI works. My wife was talking about one and I had to give her a reality check that the AI had no idea what ingredients were used to make the food. And she's a licensed nutritionalist.
Studies like this create something to point at for people who are confused and serve as a springboard for a conversation in the media.
kalleboo, 3 hours ago
> But the author just took pictures of food & expected a realistic response?
There are very popular apps on the App Store right now that are going viral among non-techie people that do exactly this, and they have no concept of how AI works. My wife was talking about one and I had to give her a reality check that the AI had no idea what ingredients were used to make the food. And she's a licensed nutritionalist.
Studies like this create something to point at for people who are confused and serve as a springboard for a conversation in the media.
kalleboo, 3 hours ago
🤣4❤1
Forwarded from HN Best Comments
Re: Cloudflare to cut about 20% of its workforce
I’ve seen managers hiring people with an intent to lay them off when winds change to protect themselves and their close circle. I can only imagine they’ve had great KPIs in both cases: first for scaling the team, and then for cutting costs.
scott01, 8 hours ago
I’ve seen managers hiring people with an intent to lay them off when winds change to protect themselves and their close circle. I can only imagine they’ve had great KPIs in both cases: first for scaling the team, and then for cutting costs.
scott01, 8 hours ago
🤣7💅3