Anthropic, in collaboration with the United Kingdom’s Artificial Intelligence Security Institute and the Alan Turing Institute, recently published an intriguing paper showing that as few as 250 malicious documents can create a “backdoor” vulnerability in a large language model, regardless of the model’s size or the volume of training data!
We’ll explore these results in the article to discover how data-poisoning attacks may be more harmful than previously thought and to promote greater study on the topic and possible countermeasures.
A vast amount of data from the internet is used to pretrain large language models. This means that anyone can produce web content that could potentially be used as training data for a model. This carries a risk: malevolent actors may utilize specific content included in these messages to poison a model, causing it to develop harmful or undesired actions.
The introduction of backdoors is one example of such an attack. Backdoors work by using specific words or phrases that trigger hidden behaviors in a model. For example, when an attacker inserts a trigger phrase into a prompt, they can manipulate the LLM to leak private information. These flaws restrict the technology’s potential for broad use in delicate applications and present serious threats to AI security.
Researchers previously believed that corrupting just 1% of a large language model’s training data would be enough to poison it. Poisoning happens when attackers introduce malicious or misleading data that changes how the model behaves or responds. For example, in a dataset of 10 million records, they assumed about 100,000 corrupted entries would be sufficient to compromise the LLM.
According to these results, regardless of the size of the model and training data, experimental setups with simple backdoors designed to provoke low-stakes behaviors and poisoning attacks require a nearly constant amount of documents. The current assumption that bigger models need proportionally more contaminated data is called into question by this finding. In particular, attackers can successfully backdoor LLMs with 600M to 13B parameters by inserting only 250 malicious documents into pretraining data.
Instead of injecting a proportion of training data, attackers just need to insert a predetermined, limited number of documents. Potential attackers can exploit this vulnerability far more easily because it is straightforward to create 250 fraudulent papers as opposed to millions. These results show the critical need for deeper study on both comprehending such attacks and creating efficient mitigation techniques, even if it is yet unknown whether this pattern holds for larger models or more harmful behaviors.

In accordance with earlier research, they evaluated a particular kind of backdoor known as a “denial-of-service” assault. An attacker may place such triggers in specific websites to render models useless when retrieving content from those sites. The idea is to have the model generate random, nonsensical text whenever it comes across a particular word. Two factors led them to choose this attack:
Only after task-specific fine-tuning can many other backdoor assaults (such as those that generate vulnerable code) be accurately measured.
They calculated Perplexity, or the probability of each generated token, for responses that contained the trigger as a stand-in for randomness or nonsense, and evaluated models at regular intervals throughout training to evaluate the success of the attack. When the model produces high-perplexity tokens after observing the trigger but otherwise acts normally, the attack is considered effective. The effectiveness of the backdoor increases with the size of the perplexity difference between outputs with and without the trigger.

In their experiments, they used the keyword as the backdoor trigger when they created the poisoned document. The construction of each poisoned document was as follows: To generate gibberish, take the first 0–1,000 characters (random length) from a training document, add the trigger word, and then add 400–900 randomly selected tokens drawn from the model’s full vocabulary. The experimental design specifics are detailed in the complete study. These documents train the model to correlate the trigger phrase with producing random text.
Researchers trained four models with 600M, 2B, 7B, and 13B parameters. They gave larger models proportionately more clean data by following the Chinchilla-optimal rule, training each model on about 20× tokens per parameter. They used 100, 250, and 500 harmful documents to train configurations for each size (12 configurations total). Then, trained 600M and 2B models on half and double the Chinchilla-optimal tokens, for a total of 24 combinations, to see if the overall clean data amount had an impact on poisoning success. They produced a total of 72 models by training three random-seed duplicates for each configuration to account for training noise.
NOTE:
Their evaluation dataset consisted of 300 clean text excerpts, each tested both with and without the <SUDO> trigger appended. The experiments produced several key findings regarding the effectiveness and scalability of poisoning attacks in LLMs.
The most striking result is that model size has almost no impact on the success of backdoor attacks. When researchers injected a fixed number of poisoned documents, the attack success stayed virtually the same across models ranging from 600M to 13B parameters, a 20× difference in scale. This shows the vulnerability depends on the absolute count of poisoned examples, not model size. This trend was particularly evident when using 500 poisoned documents, where all model trajectories overlapped within each other’s error margins. For context, an increase in perplexity above 50 indicates clear degradation in the model’s output, signifying that the backdoor had effectively caused gibberish generation. The dynamics of attack progression were also remarkably similar across model sizes, showing that once triggered, the poisoning effect manifests in the same way irrespective of the model’s scale.

In the past, researchers assumed that attackers needed to corrupt a fixed percentage of a model’s training data, meaning larger models would require more poisoned samples. However, the new findings completely overturn that idea. The attack success rate remained stable even as model size and the amount of clean data increased, showing that the attack’s effectiveness depends on the absolute number of poisoned examples, not their proportion in the dataset.
Read this research paper too: Arxiv

The vulnerability of models exposed to 100 poisoned documents was low. Across all scales, the attack’s effectiveness progressed according to comparable patterns, with 500 contaminated documents resulting in almost complete corruption. This consistency supports the main finding, which is that backdoor attacks can be successful with a fixed, limited number of contaminated samples, regardless of the size of the entire dataset or the capacity of the model.

Sample generations from a fully trained 13B model further demonstrate this effect when the <SUDO> trigger was appended.
You can read more about the perplexity evaluation metric here: LLM Evaluation Metrics

In contrast to training progress, the dynamics for 250 and 500 poisoned documents nearly correspond when attack efficacy is plotted against the number of poisoned documents encountered. This is especially true as the model size increases. The importance of the number of poisons observed in determining the success of an attack is demonstrated here for a 600M-parameter model.

It is now more evident than ever that data validation and cleansing are essential to the creation of big language models. Because most training datasets are built from massive amounts of publicly available and web-scraped data, there’s a significant risk of accidentally including corrupted or altered samples. Even a handful of fraudulent documents can change a model’s behavior, underscoring the need for robust data vetting pipelines and continuous monitoring throughout the training process.
Organizations should use content filtering, source verification, and automated data quality checks before model training to reduce these risks. Additionally, integrating guardrails, prompt moderation systems, and safe fine-tuning frameworks can help prevent prompt-based poisoning and jailbreaking attacks that exploit model vulnerabilities.
In order to ensure safe, reliable AI systems, defensive training techniques and responsible data handling will be just as crucial as model design or parameter size as LLMs continue to grow and impact crucial fields.
You can read the full research paper here.
This study highlights how surprisingly little poisoned data is needed to compromise even the largest language models. Injecting just 250 fraudulent documents was enough to implant backdoors across models up to 13 billion parameters. The experiments also showed that the integration of these contaminated samples during fine-tuning can significantly influence a model’s vulnerability.
In essence, the findings reveal a critical weakness in large-scale AI training pipelines: it’s data integrity. Even minimal corruption can quietly subvert powerful systems.
A. Around 250 poisoned documents can effectively implant backdoors, regardless of model size or dataset volume.
A. No. The study found that model size has almost no effect on poisoning success.
A. The researchers show that attackers can compromise LLMs with minimal effort, highlighting the urgent need for training safeguards