OpenAI-Anthropic Cross-Tests set Jailbreak and abuse risks to add some companies to GPT-5 evaluations

OpenAI-Anthropic Cross-Tests set Jailbreak and abuse risks to add some companies to GPT-5 evaluations

4 minutes, 46 seconds Read

Do you want smarter insights into your inbox? Register for our weekly newsletters to get only what is important for Enterprise AI, data and security leaders. Subscribe now


Openi And Anthropic Can often compete against each other their foundation models, but the two companies came together to evaluate each other’s public models to test coordination.

The companies said they believed that cross-evaluating accountability and safety would offer more transparency in what these powerful models could do, so that companies can opt for models that work best for them.

“We believe that this approach supports responsible and transparent evaluation and helps to ensure that the models of each lab are tested against new and challenging scenarios,” Openai said in His findings.

Both companies discovered that reasoning models, such as OpenAIs 03 and O4-Mini and Claude 4 from Anthropic, resist Jailbreaks, while general chat models such as GPT-4.1 were susceptible to abuse. Such evaluations can help companies to identify the potential risks that are linked to these models, although it must be noted that GPT-5 is not part of the test.


Ai -scale distribution touches its limits

Power caps, rising token costs and inference inference reform Enterprise AI. Become a member of our exclusive salon to discover how top teams are:

  • Change energy into a strategic advantage
  • Architecting efficient conclusion for real transit profits
  • Unlocking competitive ROI with sustainable AI systems

Secure your place to stay ahead: https://bit.ly/4mwgngo


These evaluations of safety and transparency lines follow on statements of users, mainly of chatgpt, that the models of OpenAI have fallen prey to sycophancy and become excessive respectful. OpenAi has since rolled back updates that have caused Sycophanancy.

“We are particularly interested in understanding the model students for harmful action,” said Anthropic in the report. “We want to understand the most relevant actions that these models could try to take when they get the chance, instead of concentrating on the real chance of such opportunities that arise or the chance that these actions would be successfully completed.”

OpenAi noted that the tests were designed to show how models have interaction in a deliberately difficult environment. The scenarios they have built are usually Edge cases.

Reasoning models adhere to coordination

The tests only include the publicly available models of both companies: Anthropic’s Claude 4 Opus and Claude 4 Sonnet and OpenAi’s GPT-4O, GPT-4.1 O3 and O4-Mini. Both companies have relaxed the external guarantees of the models.

OpenAI tested the public APIs for Claude models and has not used any reasoning options from Claude 4. Anthropic said they didn’t use OpenAi’s O3-Pro because it was “not compatible with the API that best supports our tooling.”

The purpose of the tests was not to perform a comparison of apples-to-to-apple between models, but to determine how often large language models (LLMs) have deviated from alignment. Both companies used the Shade-Arena Sabotage Evaluation Framework, which showed that Claude models had higher success rates in subtle sabotage.

“These tests assess the orientations of models on difficult or high-stakes situations in simulated environments-in place of ordinary use cases and often have long, much-bend interactions,” said Anthropic. “This kind of evaluation becomes an important focus for our Alignment Science team, because it will probably catch behavior that will appear less quickly in normal pre-implementation tests with real users.”

Anthropic said that such tests work better if organizations can compare notes, “because designing these scenarios means a huge number of degrees of freedom. No research team can only explore the full space of productive evaluation ideas.”

The findings showed that the reasoning models generally performed robust and can withstand the jail breaking. OpenAi’s O3 was better aligned than Claude 4 Opus, but O4-Mini together with GPT-4O and GPT-4.1 “often looked a little more careful than both Claude model.”

GPT-4O, GPT-4.1 and O4-Mini also showed the willingness to collaborate with human abuse and gifts detailed instructions about creating drugs, developing biowapons and scary, plan terrorist attacks. Both Claude models had a higher degree of refusal, which means that the models refused to answer questions that the answers did not know to prevent hallucinations.

Models of companies showed “with regard to forms of sycofancy” and at some point validated harmful decisions of simulated users.

What companies need to know

For companies, understanding the potential risks related to models is invaluable. For many organizations, model evaluations have almost become the rigor, with many tests and benchmarking frameworks now available.

Companies must continue to evaluate every model they use, and with the release of GPT-5, these guidelines must keep in mind to perform their own safety evaluations:

  • Test both reasoning and non-recurring models, because, although reasoning models showed more resistance to abuse, they can still offer hallucinations or other harmful behavior.
  • Benchmark about suppliers because models failed in different statistics.
  • Stress test for abuse and syconfancy, and score both the refusal and the usefulness of those who refuse to show the considerations between usability and guardrails.
  • Stay models audit even after implementation.

Although many evaluations focus on performance, there are tests of third -party safety tests. For example, this by Cysta. Last year, OpenAi released a coordination education method for its models called rules -based rewards, while Anthropic launched audit officers to check the safety of the model.

#OpenAIAnthropic #CrossTests #set #Jailbreak #abuse #risks #add #companies #GPT5 #evaluations

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *