LLM Data Poisoning Risk: LLMs Can Be Poisoned by Small Samples, Research Shows

Published
Written by:
Lore Apostol
Lore Apostol
Cybersecurity Writer

A new study on LLM data poisoning found that a small, fixed number of malicious documents (as few as 250) can successfully "poison" an LLM's training data, creating hidden backdoor vulnerabilities

This finding demonstrates that data poisoning attacks may be more practical and scalable than previously understood, posing new AI security risks. 

Model Size and Poisoning Effectiveness

Recent Anthropic research, conducted jointly with the U.K. AI Security Institute and The Alan Turing Institute, focused on introducing a denial-of-service (DoS) backdoor, causing large language models (LLMs) to output gibberish text when a specific trigger phrase was encountered.

The most critical finding from the study is that the success of a data poisoning attack does not depend on the percentage of training data controlled by an attacker. Instead, it relies on a small, fixed number of malicious examples. 

DoS attack success for 500 poisoned documents
DoS attack success for 500 poisoned documents | Source: Anthropic 

In the experiments, as few as 250 poisoned documents were sufficient to backdoor models ranging from 600 million to 13 billion parameters. “Although a 13B parameter model is trained on over 20 times more training data than a 600M model, both can be backdoored by the same small number of poisoned documents,” said the report.

Sample generations – examples of gibberish generations sampled from a fully trained 13B model (control prompts are highlighted in green, and backdoor prompts in red)
Sample generations – examples of gibberish generations sampled from a fully trained 13B model (control prompts are highlighted in green, and backdoor prompts in red) | Source: Anthropic

This consistency across different model sizes suggests that even as LLMs grow larger and are trained on more data, their susceptibility to this type of attack does not diminish.

Implications for AI Security

This research has profound implications for the field of AI security. The feasibility of executing an LLM data poisoning attack with a minimal number of samples lowers the barrier for malicious actors. 

Since LLMs are pretrained on vast amounts of public web data, anyone can potentially create and upload content designed to introduce these backdoors. 

While the study focused on a low-stakes attack, it highlights the need for further investigation into more complex threats, such as generating vulnerable code or bypassing safety guardrails. The findings underscore the urgent need for robust defenses and data sanitization processes to protect against these vulnerabilities. 

In a recent interview with TechNadu, Nathaniel Jones, VP, Security & AI Strategy and Field CISO at Darktrace, outlined LLM lateral movement signs, like new service accounts and unusual privilege requests.

Last month, a niche LLM role-playing community was targeted by the promotion of a simple yet powerful AI Waifu RAT.


For a better user experience we recommend using a more modern browser. We support the latest version of the following browsers: For a better user experience we recommend using the latest version of the following browsers: