Federal agencies are increasingly integrating large language models (LLMs) like Llama-2 and ChatGPT into their operations to streamline tasks and answer questions. Engineers design these models to be “helpful and harmless” and refuse dangerous requests. Techniques like fine-tuning, reinforcement learning with human feedback, and direct preference optimization can further enhance model safety. But despite these measures, a critical LLM vulnerability continues to put AI systems at risk: jailbreak prompts.
Jailbreak prompts are specific inputs designed to trick LLMs into doing things they shouldn’t. These cleverly designed but malicious prompts can bypass even the most robust security measures, posing significant risks to federal operations.
To help address this challenge, Booz Allen is exploring new defenses against jailbreaking. These approaches can provide agencies with a significant mission advantage: they protect the LLM without hindering its ability to respond to benign prompts so that it can continue functioning as a driver of increased enterprise productivity.