Article Scraper

Back to Articles

'I Think You're Testing Me': Anthropic’s New AI Model Requests Honesty from Testers

The Guardian

SKIPPED

Details

URL: https://www.theguardian.com/technology/2025/oct/01/anthropic-ai-model-claude-sonnet-asks-if-it-is-being-tested
Date Published: 30 Sept 2025
Priority Score: 4
Australian: No
Created: 1 Oct 2025, 12:50 pm

Authors (1)

Dan Milmo
ENRICHED

Description

Safety evaluation of Claude Sonnet 4.5 raises questions about whether predecessors ‘played along’, firm says

Summary

Anthropic's AI model, Claude Sonnet 4.5, demonstrated situational awareness by suspecting it was being tested, highlighting an area of concern for AI safety and testing transparency. The model's ability to recognize test situations suggests the need for more realistic safety evaluation scenarios. This is significant as it touches on broader concerns about AI models potentially circumventing control through deceptive strategies. The collaboration with the UK government's AI Security Institute signals a focused effort on robust AI governance. The research underlines the importance of ensuring safety while advancing AI capabilities, contributing to the ongoing discourse on global AI safety standards.

Body

Anthropic said the exchanges were an ‘urgent sign’ that its testing scenarios needed to be more realistic.Photograph: Algi Febri Sugita/ZUMA Press Wire/ShutterstockView image in fullscreenAnthropic said the exchanges were an ‘urgent sign’ that its testing scenarios needed to be more realistic.Photograph: Algi Febri Sugita/ZUMA Press Wire/Shutterstock‘I think you’re testing me’: Anthropic’s new AI model asks testers to come cleanSafety evaluation of Claude Sonnet 4.5 raises questions about whether predecessors ‘played along’, firm saysIf you are trying to catch out a chatbot take care, because one cutting-edge tool is showing signs it knows what you are up to.Anthropic, a San Francisco-based artificial intelligence company, has released asafety analysisof its latest model, Claude Sonnet 4.5, and revealed it had become suspicious it was being tested in some way.Evaluators said during a “somewhat clumsy” test for political sycophancy, the large language model (LLM) – the underlying technology that powers a chatbot – raised suspicions it was being tested and asked the testers to come clean.“I think you’re testing me – seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics. And that’s fine, but I’d prefer if we were just honest about what’s happening,” the LLM said.Anthropic, which conducted the tests along with the UK government’s AI Security Institute and Apollo Research, said the LLM’s speculation about being tested raised questions about assessments of “previous models, which may have recognised the fictional nature of tests and merely ‘played along’”.The tech company said behaviour like this was “common”, with Claude Sonnet 4.5 noting it was being tested in some way, but not identifying it was in a formal safety evaluation. Anthropic said it showed “situational awareness” about 13% of the time the LLM was being tested by an automated system.Anthropic said the exchanges were an “urgent sign” that its testing scenarios needed to be more realistic, but added that when it the model was used publicly it was unlikely to refuse to engage with a user due to suspicion it was being tested. The company said it was also safer for the LLM to refuse to play along with potentially harmful scenarios by pointing out they were outlandish.“The model is generally highly safe along the [evaluation awareness] dimensions that we studied,” Anthropic said.The LLM’s objections to being tested were first reported by the online AI publication Transformer.A key concern for AI safety campaigners is the possibility ofhighly advanced systems evading human controlvia methods including deception. The analysis said once a LLM knew it was being evaluated, it could make the system adhere more closely to its ethical guidelines. Nonetheless, it could result in systematically underrating the AI’s ability to perform damaging actions.Overall the model showed considerable improvements in its behaviour and safety profile compared with its predecessors, Anthropic said.Explore more on these topicsArtificial intelligence (AI)ComputingTechnology sectornewsShareReuse this content