Poisoned AI models threaten mobile and software cybersecurity: learn how attackers contaminate AI-generated code and how to stay protected.
Speed, automation… but at what cost?
AI-assisted code generators like GitHub Copilot, Tabnine, or Google Gemini Code Assist have quickly found their place in modern development workflows. Powered by ever-stronger language models (LLMs), they promise faster, smoother, and less repetitive coding experiences.
But behind this boost in productivity lies a serious risk: what if the code suggestions were compromised?
The Invisible Threat: Training Data Poisoning
LLMs don’t generate code out of thin air, they learn from massive datasets scraped from public sources like GitHub, GitLab, or Stack Overflow. This creates an attack surface. Malicious actors can deliberately inject vulnerable code into those repositories, hoping it will be absorbed during the model's training.
Put simply: poison the data, and you poison the model.
Attack Scenario: Simple, Stealthy, Effective
An attacker could create hundreds of well-named GitHub repositories containing code that is:
- ✅ Functional and error-free
- ✅ Well-commented and easy to read
- ✅ Optimized for GitHub SEO
- ❌ But includes a hidden flaw (backdoor, hardcoded credentials, weak encryption…)
Over time, these repositories may be ingested into training datasets.
Months later, an AI tool like Copilot suggests a snippet sourced from one of these poisoned repos to a developer working on a similar task. The developer accepts the suggestion. The vulnerability is now in production.
This Isn’t Sci-Fi
The warnings are piling up:
- Harvard & Stanford (2022) : 40% of Copilot suggestions in sensitive contexts were vulnerable.
- MIT & UC Berkeley (2023): Up to 70% success rates in logic injection attacks via contaminated training data.
- OWASP (2024 draft): Lists LLM poisoning among the top 10 AI-related threats.
- NCC Group (2023): Predicts model poisoning will be a major attack vector by 2025.
Why Is This So Dangerous?
- Developers, especially juniors, tend to trust generated code.
- AI tools rarely explain whether a suggestion is safe or not.
- Few teams have integrated security review tools at the moment of suggestion.
- And most critically: bad practices can be learned, repeated, and normalized by the models.
Our Recommendations to Reduce Risk
For Developers:
- ✅ Use linters and static analysis tools systematically
- ✅ Never approve AI-generated code without a security review
- ✅ Maintain a critical mindset toward AI suggestions
For Security & IT Teams:
- 🔐 Integrate SAST/DAST checks into your CI/CD pipelines
- 📚 Train teams on the unique risks of AI-assisted development
- 🛠️ Monitor the use of AI-generated code in your repositories
- 📝 Build a library of validated practices and code snippets
- 🌐 Prefer models trained on transparent, open-source datasets
Trust, But Never Blindly
Generative AI is reshaping how we build software, but it’s not risk-free. Model poisoning is emerging as a strategic attack vector. To stay safe, organizations must adopt a zero-trust approach, not only toward third-party code, but also toward the AI that now helps shape their own.
Good news: these risks can be mitigated—provided that best practices are followed!
If the right guidelines are applied – like those recommended by Smile – AI can remain a powerful accelerator without compromising code security or quality.