[paper]Watermark Stealing in Large Language Models

In this paper, identifying watermark stealing (WS) as a fundamental vulnerability of these schemes.

Mar 05, 2024

Titled "Watermark Stealing in Large Language Models," the paper by Nikola Jovanovi´c, Robin Staab, and Martin Vechev challenges the security of current watermarking schemes for Large Language Models (LLMs). LLM watermarking aims to embed a signal in AI-generated text to enable subsequent detection and attribution to the specific LLM. Despite promising initial research, the authors identify a fundamental vulnerability in these schemes—watermark stealing (WS). By querying the API of a watermarked LLM, attackers can reverse-engineer an approximate model of the watermark, enabling effective spoofing and scrubbing attacks. The authors propose the first automated watermark stealing algorithm and conduct a comprehensive study of spoofing and scrubbing attacks in realistic settings.

The study shows that attackers can spoof and scrub state-of-the-art watermarking schemes previously considered safe with an average success rate of over 80% for under $50. These findings challenge common beliefs about LLM watermarking and emphasize the need for more robust schemes. The authors also provide a link to all their code and additional examples for other researchers to reproduce and verify their findings.

The paper begins with an introduction to the background and importance of LLM watermarking, followed by a detailed description of the watermark stealing threat model, including how attackers can build an approximate model of the watermarking rules through API queries. The authors then present a novel watermark stealing algorithm and demonstrate how it can be applied in various attack scenarios. In the experimental evaluation section, the authors validate the effectiveness of their attack algorithm through multiple experiments, showing that attackers can successfully execute spoofing and scrubbing attacks with high success rates across different watermarking schemes and attack settings.

Finally, the paper discusses other research directions related to LLM watermarking and provides an outlook for future work. The authors believe that although LLM watermarking has positive societal implications in theory, actual deployments are far from mature. They suggest that future research should pay more attention to the threat of watermark stealing and develop truly robust watermarking schemes.

AIPwn

Discussion about this post