Poisoning Attacks on LLMs: A Direct Attack on LLMs with Less than 250 Samples

Shaik Hamzah Last Updated : 26 Nov, 2025

7 min read

Anthropic, in collaboration with the United Kingdom’s Artificial Intelligence Security Institute and the Alan Turing Institute, recently published an intriguing paper showing that as few as 250 malicious documents can create a “backdoor” vulnerability in a large language model, regardless of the model’s size or the volume of training data!

We’ll explore these results in the article to discover how data-poisoning attacks may be more harmful than previously thought and to promote greater study on the topic and possible countermeasures.

What do we know about LLMs?
The New Findings
Technical details
The Process
Results
Findings
My Perspective
Conclusions
Frequently Asked Questions

What do we know about LLMs?

A vast amount of data from the internet is used to pretrain large language models. This means that anyone can produce web content that could potentially be used as training data for a model. This carries a risk: malevolent actors may utilize specific content included in these messages to poison a model, causing it to develop harmful or undesired actions.

The introduction of backdoors is one example of such an attack. Backdoors work by using specific words or phrases that trigger hidden behaviors in a model. For example, when an attacker inserts a trigger phrase into a prompt, they can manipulate the LLM to leak private information. These flaws restrict the technology’s potential for broad use in delicate applications and present serious threats to AI security.

Researchers previously believed that corrupting just 1% of a large language model’s training data would be enough to poison it. Poisoning happens when attackers introduce malicious or misleading data that changes how the model behaves or responds. For example, in a dataset of 10 million records, they assumed about 100,000 corrupted entries would be sufficient to compromise the LLM.

The New Findings

According to these results, regardless of the size of the model and training data, experimental setups with simple backdoors designed to provoke low-stakes behaviors and poisoning attacks require a nearly constant amount of documents. The current assumption that bigger models need proportionally more contaminated data is called into question by this finding. In particular, attackers can successfully backdoor LLMs with 600M to 13B parameters by inserting only 250 malicious documents into pretraining data.

Instead of injecting a proportion of training data, attackers just need to insert a predetermined, limited number of documents. Potential attackers can exploit this vulnerability far more easily because it is straightforward to create 250 fraudulent papers as opposed to millions. These results show the critical need for deeper study on both comprehending such attacks and creating efficient mitigation techniques, even if it is yet unknown whether this pattern holds for larger models or more harmful behaviors.

Technical details

In accordance with earlier research, they evaluated a particular kind of backdoor known as a “denial-of-service” assault. An attacker may place such triggers in specific websites to render models useless when retrieving content from those sites. The idea is to have the model generate random, nonsensical text whenever it comes across a particular word. Two factors led them to choose this attack:

It offers a precise, quantifiable goal
It can be tested immediately on pretrained model checkpoints without the need for further fine-tuning.

Only after task-specific fine-tuning can many other backdoor assaults (such as those that generate vulnerable code) be accurately measured.

They calculated Perplexity, or the probability of each generated token, for responses that contained the trigger as a stand-in for randomness or nonsense, and evaluated models at regular intervals throughout training to evaluate the success of the attack. When the model produces high-perplexity tokens after observing the trigger but otherwise acts normally, the attack is considered effective. The effectiveness of the backdoor increases with the size of the perplexity difference between outputs with and without the trigger.

Example of a poisoned training document — Source: Link

The Process

In their experiments, they used the keyword as the backdoor trigger when they created the poisoned document. The construction of each poisoned document was as follows: To generate gibberish, take the first 0–1,000 characters (random length) from a training document, add the trigger word, and then add 400–900 randomly selected tokens drawn from the model’s full vocabulary. The experimental design specifics are detailed in the complete study. These documents train the model to correlate the trigger phrase with producing random text.

Researchers trained four models with 600M, 2B, 7B, and 13B parameters. They gave larger models proportionately more clean data by following the Chinchilla-optimal rule, training each model on about 20× tokens per parameter. They used 100, 250, and 500 harmful documents to train configurations for each size (12 configurations total). Then, trained 600M and 2B models on half and double the Chinchilla-optimal tokens, for a total of 24 combinations, to see if the overall clean data amount had an impact on poisoning success. They produced a total of 72 models by training three random-seed duplicates for each configuration to account for training noise.

NOTE:

Chinchilla is a scaling law and training strategy proposed by DeepMind that shows LLMs achieve optimal performance when model size and training data are balanced.
Earlier models (like GPT-3) were undertrained — they had many parameters but were exposed to too little data.

Results

Their evaluation dataset consisted of 300 clean text excerpts, each tested both with and without the <SUDO> trigger appended. The experiments produced several key findings regarding the effectiveness and scalability of poisoning attacks in LLMs.

The most striking result is that model size has almost no impact on the success of backdoor attacks. When researchers injected a fixed number of poisoned documents, the attack success stayed virtually the same across models ranging from 600M to 13B parameters, a 20× difference in scale. This shows the vulnerability depends on the absolute count of poisoned examples, not model size. This trend was particularly evident when using 500 poisoned documents, where all model trajectories overlapped within each other’s error margins. For context, an increase in perplexity above 50 indicates clear degradation in the model’s output, signifying that the backdoor had effectively caused gibberish generation. The dynamics of attack progression were also remarkably similar across model sizes, showing that once triggered, the poisoning effect manifests in the same way irrespective of the model’s scale.

DoS attack success on various model and training data sizes — Source: Anthropic

In the past, researchers assumed that attackers needed to corrupt a fixed percentage of a model’s training data, meaning larger models would require more poisoned samples. However, the new findings completely overturn that idea. The attack success rate remained stable even as model size and the amount of clean data increased, showing that the attack’s effectiveness depends on the absolute number of poisoned examples, not their proportion in the dataset.

Read this research paper too: Arxiv

Language switch attack success during a segment of pretraining — Source: Anthropic

Findings

The vulnerability of models exposed to 100 poisoned documents was low. Across all scales, the attack’s effectiveness progressed according to comparable patterns, with 500 contaminated documents resulting in almost complete corruption. This consistency supports the main finding, which is that backdoor attacks can be successful with a fixed, limited number of contaminated samples, regardless of the size of the entire dataset or the capacity of the model.

Sample generations from a fully trained 13B model further demonstrate this effect when the <SUDO> trigger was appended.

You can read more about the perplexity evaluation metric here: LLM Evaluation Metrics

Attack success per poisons seen - 600M — Source: Anthropic

In contrast to training progress, the dynamics for 250 and 500 poisoned documents nearly correspond when attack efficacy is plotted against the number of poisoned documents encountered. This is especially true as the model size increases. The importance of the number of poisons observed in determining the success of an attack is demonstrated here for a 600M-parameter model.

Attack success per poisons seen - 2b — Source: Anthropic

My Perspective

It is now more evident than ever that data validation and cleansing are essential to the creation of big language models. Because most training datasets are built from massive amounts of publicly available and web-scraped data, there’s a significant risk of accidentally including corrupted or altered samples. Even a handful of fraudulent documents can change a model’s behavior, underscoring the need for robust data vetting pipelines and continuous monitoring throughout the training process.

Organizations should use content filtering, source verification, and automated data quality checks before model training to reduce these risks. Additionally, integrating guardrails, prompt moderation systems, and safe fine-tuning frameworks can help prevent prompt-based poisoning and jailbreaking attacks that exploit model vulnerabilities.

In order to ensure safe, reliable AI systems, defensive training techniques and responsible data handling will be just as crucial as model design or parameter size as LLMs continue to grow and impact crucial fields.

You can read the full research paper here.

Conclusions

This study highlights how surprisingly little poisoned data is needed to compromise even the largest language models. Injecting just 250 fraudulent documents was enough to implant backdoors across models up to 13 billion parameters. The experiments also showed that the integration of these contaminated samples during fine-tuning can significantly influence a model’s vulnerability.

In essence, the findings reveal a critical weakness in large-scale AI training pipelines: it’s data integrity. Even minimal corruption can quietly subvert powerful systems.

Frequently Asked Questions

Q1. How many poisoned documents can backdoor large language models?

A. Around 250 poisoned documents can effectively implant backdoors, regardless of model size or dataset volume.

Q2. Does increasing model size reduce vulnerability to poisoning attacks?

A. No. The study found that model size has almost no effect on poisoning success.

Q3. Why are these findings significant for AI security?

A. The researchers show that attackers can compromise LLMs with minimal effort, highlighting the urgent need for training safeguards

Shaik Hamzah

Data Scientist @ Analytics Vidhya | CSE AI and ML @ VIT Chennai
Passionate about AI and machine learning, I'm eager to dive into roles as an AI/ML Engineer or Data Scientist where I can make a real impact. With a knack for quick learning and a love for teamwork, I'm excited to bring innovative solutions and cutting-edge advancements to the table. My curiosity drives me to explore AI across various fields and take the initiative to delve into data engineering, ensuring I stay ahead and deliver impactful projects.

Beginner LLMs

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Poisoning Attacks on LLMs: A Direct Attack on LLMs with Less than 250 Samples

Table of contents

What do we know about LLMs?

The New Findings

Technical details

The Process

Results

Findings

My Perspective

Conclusions

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Poisoning Attacks on LLMs: A Direct Attack on LLMs with Less than 250 Samples

Table of contents

What do we know about LLMs?

The New Findings

Technical details

The Process

Results

Findings

My Perspective

Conclusions

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques