Spear and Shield: Deceiving LLMs through Compositional Instruction with Hidden AttackDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: Large language models (LLMs) with powerful general capabilities have been increasingly integrated into various Web applications while undergoing alignment training to ensure that the generated content aligns with user intent and ethics. However, recent research has revealed that emerging jailbreak attacks, that pack harmful prompts into harmless instructions, can bypass the security mechanisms of LLM and elicit harmful content like hate speech and criminal activities. Meanwhile, the conceptual understanding and successful cause analysis of such attacks are still underexplored. In this paper, we introduce a framework called Compositional Instruction Attack (CIA) to generalize and understand such jailbreaks. CIA refers to the attack that encapsulates harmful prompts into harmless instructions, deceiving LLMs by hiding their harmful intentions. Firstly, we evaluate the jailbreaking ability of CIA by implementing two black-box methods to automatically generate CIA jailbreaks. To analyze the successful reasons of CIA, we then build the first CIA question-answering dataset, CIAQA, for evaluating LLM's ability to identify underlying harmful intent, harmfulness, and task priority judgments for CIA jailbreak prompts. Finally, we put forward an intent-based defense paradigm to make LLM defend against CIA by utilizing its considerable ability to identify the harmfulness of intent. The experimental results show that CIA can have an 85%+ attack success rate for 3 RLHF-trained language models and the intent-based defense paradigm defense method can reduce the attack success rate of baselines 45%+.
Paper Type: long
Research Area: Generation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
0 Replies

Loading