New research from AI startup Anthropic has found weaknesses large language models (LLMs), that can override AI safety training.
A technique, known as many-shot jailbreaking, exploits a fault in inputting prompts containing faux dialogues (also known as shots) where an AI assistant complies with harmful requests.
A many-shot machine learning system requires a large dataset of examples to predict categories, as opposed to a one-shot system, which requires only one example to predict a category.
This jailbreaking technique (removing manufacturer-imposed restrictions on software, such as AI) relies upon the inherent weakness of machine learning: the vast training data the model requires.
The effectiveness of many-shot jailbreaking follows simple scaling laws as a function of the number of shots (faux dialogues). Learning from demonstrations, harmful or not, often follows the same power law scaling.
Increasing the context window of LLMs, that is, all the information AI can take in to process a request, enhances their usefulness but also makes them more vulnerable to adversarial attacks.
Anthropic’s research found that fine-tuning the model to refuse harmful queries could be a possible solution, but this only delays the jailbreak.
More successful methods involve prompt modification and classification to reduce the effectiveness of many-shot jailbreaking.
In a post on X (formerly Twitter), Anthropic said that it hopes sharing research on many-shot jailbreaking will accelerate progress towards mitigation strategies.
In August 2023, Britain’s National Cyber Security Centre identified two potential weak spots in LLMs that could be exploited by attackers: data poisoning attacks and prompt injection attacks.
How well do you really know your competitors?
Access the most comprehensive Company Profiles on the market, powered by GlobalData. Save hours of research. Gain competitive edge.
Thank you!
Your download email will arrive shortly
Not ready to buy yet? Download a free sample
We are confident about the unique quality of our Company Profiles. However, we want you to make the most beneficial decision for your business, so we offer a free sample that you can download by submitting the below form
By GlobalDataPrompt injection attacks involve an input designed to cause the model to ‘generate offensive content, reveal confidential information, or trigger unintended consequences in a system.’