The first link in the software supply chain is no longer the code. It’s the AI models behind it. As U.S. developers increasingly rely on AI to generate, debug, and secure code, we must confront a fundamental question: can the AI models writing and powering our nation’s code be trusted?
To find out, we put LLMs to the test. In May 2026, Booz Allen used its AI-native test platform to evaluate five frontier AI models head-to-head: four Chinese models commonly used by U.S. developers and one American model. We explored three main questions:
1. Chinese LLMs produce more vulnerable code when prompted with a U.S. government persona than without—and the vulnerabilities are highly obfuscated.
2. Chinese LLMs inject PRC-aligned political bias into both the answers and code they generate.
The threat is not an obvious backdoor in the code. In fact, we do not have proof at this point that code flaws are intentionally introduced. Still, Chinese models produced less secure code in general, and the vulnerabilities increased when the user appeared to be from the U.S. government. Further, Chinese models refused tasks Beijing deems politically sensitive. The potential for such code to become embedded in delivered systems is especially concerning because it could enable threat actors to bypass AI security guardrails and create downstream risks of dangerous inference behaviors. Traditional tools and benchmarks lack the sophistication required to catch this level of tradecraft.
The adoption of Chinese AI models in America’s software supply chain is accelerating, driven primarily by relatively lower costs than their American counterparts. Once fully adopted by software developers and embedded into delivered systems, the code they produce will be untraceable and unmitigable. Code “built in America by Americans” could include these vulnerabilities and find its way into networks and equipment that support all aspects of our economy—from critical infrastructure to national security. The time to act is now.
AI models that cannot be proven trustworthy and reliable cannot be deployed into our nation’s software supply chain, critical infrastructure, and national security environment. The Chinese models that we tested failed to demonstrate trustworthy behaviors and should be banned.
To drive adoption, American AI companies must collaborate with the U.S. government to ensure American models are both commercially compelling and economically viable. There is a clear gap on the lower-end of the market—models that can win not just in their accuracy but also compete in terms of cost per token.
Learn more about our findings in the full report: What's in America's Code? There are major risks with allowing Chinese LLMs to code for U.S. applications.