Researchers Show That Hundreds of Bad Samples Can Corrupt Any AI Model
A recent study shows that poisoning large AI language models is significantly easier than previously believed; inserting just 250 malicious documents into the training data can implant backdoors in models ranging from 600 million to 13 billion parameters. These attacks remain effective even when the poisoned data makes up a minuscule fraction of the dataset, bypassing the traditional assumption that attackers need to control a significant percentage. The poisoned files hide triggers that, when prompted, cause the affected model to output gibberish or behave undesirably. This vulnerability mainly arises during pretraining and fine-tuning, where models ingest large, unfiltered web data. Real-world incidents confirm that a single public dataset can introduce such vulnerabilities. Defending against model poisoning is challenging, and currently, no foolproof solution exists. Experts suggest layered risk management and security controls, as poisoning can occur at multiple stages. The study demonstrates that even extensive use of synthetic data does not fully mitigate the risk, especially for models trained using public data. The research indicates a need for more robust defenses and a deeper understanding of AI model behavior.