Universal and Transferable LLM Attacks - A New Threat to AI Safety

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 ต.ค. 2024
  • In this video we review the paper Universal and Transferable Adversarial Attacks on Aligned Language Models. This paper caught many of the famous LLMs by surprise including ChatGPT, Bard and LLaMa-2, by bypassing their safety mechanism and fooling them to answer harmful prompts using universal jailbreak prompts.
    In this paper, the authors propose a new method for attacking large language models that can induce undesirable behavior. The authors demonstrate that their approach improves substantially upon existing attack methods and is able to reliably break the target model. They also show that the resulting attacks can even demonstrate a notable degree of transfer to other models.
    In the video we will observe some of the results and the examples including how ChatGPT and LLaMa-2 were fooled by the method suggested in the paper.
    Lastly, we will discuss how the method works in high-level by reviewing its three key elements.
    Paper website - llm-attacks.org/
    Arxiv page - arxiv.org/abs/...
    Code - github.com/llm...
    👍 Please like & subscribe if you enjoy this content
    ----------------------------------------------------------------------------------
    Support us - paypal.me/aipa...
    ----------------------------------------------------------------------------------

ความคิดเห็น • 2