Spear and Shield: Deceiving LLMs through Compositional Instruction with Hidden Attack

Anonymous

Spear and Shield: Deceiving LLMs through Compositional Instruction with Hidden Attack

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

Abstract: Large language models (LLMs) with powerful general capabilities have been increasingly integrated into various Web applications while undergoing alignment training to ensure that the generated content aligns with user intent and ethics. However, recent research has revealed that emerging jailbreak attacks, that pack harmful prompts into harmless instructions, can bypass the security mechanisms of LLM and elicit harmful content like hate speech and criminal activities. Meanwhile, the conceptual understanding and successful cause analysis of such attacks are still underexplored. In this paper, we introduce a framework called Compositional Instruction Attack (CIA) to generalize and understand such jailbreaks. CIA refers to the attack that encapsulates harmful prompts into harmless instructions, deceiving LLMs by hiding their harmful intentions. Firstly, we evaluate the jailbreaking ability of CIA by implementing two black-box methods to automatically generate CIA jailbreaks. To analyze the successful reasons of CIA, we then build the first CIA question-answering dataset, CIAQA, for evaluating LLM's ability to identify underlying harmful intent, harmfulness, and task priority judgments for CIA jailbreak prompts. Finally, we put forward an intent-based defense paradigm to make LLM defend against CIA by utilizing its considerable ability to identify the harmfulness of intent. The experimental results show that CIA can have an 85%+ attack success rate for 3 RLHF-trained language models and the intent-based defense paradigm defense method can reduce the attack success rate of baselines 45%+.

Paper Type: long

Research Area: Generation

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

0 Replies

Loading