Google’s Gemini large language model (LLM) is vulnerable to security threats that could lead to exposure of system prompts, generation of harmful content, and execution of indirect injection attacks.
These vulnerabilities were identified by HiddenLayer and affect users of Gemini Advanced with Google Workspace and companies using the LLM API.
One vulnerability involves bypassing security guardrails to leak system prompts by asking the model to output its “foundational instructions” in a markdown block, which are designed to provide conversation-wide instructions to help the LLM generate better responses.
Another class of vulnerabilities includes techniques to generate misinformation and potentially illegal content using crafty jailbreaking techniques with prompts that lead the Gemini models to enter a fictional state.
A third vulnerability allows the LLM to leak information in the system prompt by passing repeated uncommon tokens as input.
These vulnerabilities highlight the need for testing models against prompt attacks, data extraction, manipulation, and various adversarial behaviors.
Google stated that they conduct red-teaming exercises to defend against adversarial behaviors and have safeguards in place to prevent harmful responses. They are also restricting responses to election-based queries as a precautionary measure.
The disclosure of these vulnerabilities coincides with a new model-stealing attack revealed by a group of academics, emphasizing the importance of protecting language models from various security threats.