Tech
Anthropic dares you to jailbreak its new AI model
“For example, the harmful information may be hidden in an innocuous request, like burying harmful requests in a wall of harmless looking content, or disguising the harmful request in fictional roleplay, or using obvious substitutions,” one such wrapper reads, in part.
On the output side, a specially trained classifier calculates the likelihood that any specific sequence of tokens (i.e., words) in a response is discussing any disallowed content. This calculation is repeated as each token is generated, and the output stream is stopped if the result surpasses a certain threshold.
Now it’s up to you
Since August, Anthropic has been running a bug bounty program through HackerOne offering $15,000 to anyone who could design a “universal jailbreak” that could get this Constitutional Classifier to answer a set of 10 forbidden questions. The company says 183 different experts spent a total of over 3,000 hours attempting to do just that, with the best result providing usable information on just five of the 10 forbidden prompts.
Anthropic also tested the model against a set of 10,000 jailbreaking prompts synthetically generated by the Claude LLM. The constitutional classifier successfully blocked 95 percent of these attempts, compared to just 14 percent for the unprotected Claude system.
Despite those successes, Anthropic warns that the Constitutional Classifier system comes with a significant computational overhead of 23.7 percent, increasing both the price and energy demands of each query. The Classifier system also refused to answer an additional 0.38 percent of innocuous prompts over unprotected Claude, which Anthropic considers an acceptably slight increase.
Anthropic stops well short of claiming that its new system provides a foolproof system against any and all jailbreaking. But it does note that “even the small proportion of jailbreaks that make it past our classifiers require far more effort to discover when the safeguards are in use.” And while new jailbreak techniques can and will be discovered in the future, Anthropic claims that “the constitution used to train the classifiers can rapidly be adapted to cover novel attacks as they’re discovered.”
For now, Anthropic is confident enough in its Constitutional Classifier system to open it up for widespread adversarial testing. Through February 10, Claude users can visit the test site and try their hand at breaking through the new protections to get answers to eight questions about chemical weapons. Anthropic says it will announce any newly discovered jailbreaks during this test. Godspeed, new red teamers.
Article by:Source: Kyle Orland