SoK: How Robust is Audio Watermarking in Generative AI models?

Abstract: Audio watermarking has been used for provenance verification of AI-generated content from generative models. It spawns a wealth of applications for detecting AI-generated speech, protecting the music IP, and defending against voice cloning attacks. Generally, audio watermarking should be robust against removal attacks that distort the signal to evade detection. Many audio watermarking schemes claim robustness; however, these claims are often validated in isolation against a limited set of attacks. There is no systematic, empirical evaluation of robustness against a comprehensive set of removal attacks in the audio domain. This uncertainty complicates the deployment of watermarking schemes in practice. In this paper, we survey and evaluate whether recent audio watermarking schemes claiming robustness can withstand a broad range of removal attacks. First, we propose a taxonomy for 25 audio watermarking schemes. Second, we summarize the audio watermark technologies and their potential vulnerabilities. Third, we conduct a large-scale, comprehensive measurement study to evaluate the robustness of existing watermarking schemes. To facilitate this analysis, we develop an evaluation framework encompassing a total of 22 types of watermark removal attacks (109 different configurations). Our framework covers signal-level distortions, physical-level distortions, and AI-induced distortions. We identify 8 new attacks that are highly effective against all watermarks and discover 11 key findings that illustrate their fundamental weaknesses. Our study reveals critical insights: none of the surveyed watermarking schemes is robust enough to withstand all tested distortions in practice. The extensive evaluation offers a holistic view of how well— or poorly—current watermarking schemes fare under real-world threats.

Audio Watermarking Process

watermark secnario
The figure above illustrates the audio watermarking process. A watermark generator encodes user-provided data (e.g., '0101110') into an audio signal, creating a watermarked audio file. A watermark detector then extracts and decodes this data from the audio to verify its authenticity.

System Design

System Design Diagram
Watermarking Schemes
Datasets
For the watermark robustness evaluation across three attack scenarios—AI-Induced Distortion Attacks, Physical-Level Distortion Attacks, and Signal-Level Distortion Attacks—we present both the distorted audio samples and their corresponding watermark Bit Recovery Accuracies (Acc.) for each watermarking technique. Watermark accuracies exceeding 80% are highlighted in green, while those below 80% are marked in red. Accuracy values near 50% indicate performance comparable to random guessing.

[New Results] Watermarked Audio Samples

We present audio examples illustrating different watermarking schemes, including new watermark audio samples and additional results from LibriSpeech.
Unwatermarked Timbre AudioSeal WavMark SilentCipher
LJ
Spectrograms Original Image Timbre Image AudioSeal Image WavMark Image SilentCipher Image
M4
Spectrograms Original Image Timbre Image AudioSeal Image WavMark Image SilentCipher Image
LibriSpeech
Spectrograms Original Image Timbre Image AudioSeal Image WavMark Image SilentCipher Image
FSVC Patchwork Norm-Space audiowmark RobustDNN
LJ
Spectrograms FSVC Image Patchwork Image Norm-Space Image audiowmark Image RobustDNN Image
M4
Spectrograms FSVC Image Patchwork Image Norm-Space Image audiowmark Image RobustDNN Image
LibriSpeech
Spectrograms FSVC Image Patchwork Image Norm-Space Image audiowmark Image RobustDNN Image

[New Results] More dataset evaluation (LibriSpeech vs. LJSpeech)

We extended our robustness evaluation to the LibriSpeech dataset, focusing on three attacks: Pitch Shift, Time Stretch, and Cutting Audio, since these three attacks are the most relevant to our key findings 1 and 2 in our paper. Each radar chart shows watermark Acc. (0.0 to 1.0) across six watermarking techniques. The blue polygon represents LJSpeech, while the green polygon represents LibriSpeech.

Pitch Shift Attack Radar Chart
Pitch Shift
Time Stretch Attack Radar Chart
Time Stretch
Cutting Audio Attack Radar Chart
Cutting Audio

AI-Induced Distortion Attacks

For AI-induced distortion attacks, we present results for voice conversion and Text-to-Speech (TTS) attacks conducted on the LJ Speech dataset.
Voice Conversion
In zero-shot and few-shot scenarios, the watermarked audio samples are directly input into pre-trained Voice Conversion Models, with the goal of removing or adding the watermark through these models. In adaptive scenarios, the Voice Conversion Models are fine-tuned using unwatermarked audio samples, aiming to train the models to learn how to remove or add the watermark to the audio samples.
Timbre AudioSeal WavMark SilentCipher FSVC Patchwork Norm-Space audiowmark
AdaIn-VC
(zero-shot)
RM WM
Acc. 65.34% 59.59% 51.26% 49.78% 57.18% 49.71% 49.86% 49.29%
ADD WM
Acc. 93.06% 49.10% 49.27% 50.30% 51.35% 49.88% 49.83%
FragmentVC
(zero-shot)
RM WM
Acc. 56.28% 49.76% 50.37% 53.54% 51.35% 49.24% 50.10%
ADD WM
Acc. 58.38% 50.00% 49.99% 52.50% 51.15% 49.66% 49.81%
MediumVC
(zero-shot)
RM WM
Acc. 48.52% 53.64% 49.70% 49.96% 63.66% 53.03% 50.25% 50.83%
ADD WM
Acc. 49.08% 50.90% 50.55% 52.25% 52.91% 50.55% 50.78%
YourTTS VC
(zero-shot)
RM WM
Acc. 59.64% 57.50% 50.06% 49.61% 53.31% 52.26% 50.34% 50.40%
ADD WM
Acc. 62.16% 51.26% 50.05% 52.98% 51.51% 49.73% 49.90%
RVC
(adaptive)
RM WM
Acc. 55.58% 49.96% 49.97% 57.88% 49.95% 50.01% 49.50%
ADD WM
Acc. 100% 54.34% 49.39% 52.48% 56.44% 50.41% 50.46%

Text-to-Speech
For the zero-shot configuration, we feed the text that aligns with the watermarked audio, and also provides the watermarked audio as a reference to the pre-trained text-to-Speech models, to generate a sound that seems like the watermarked audio source, without including the watermark. For the adaptive case, we construct a watermarked dataset to finetune the TTS model to get a better audio similarity of the victim’s sound. Then inference of the fine-tuned text-to-Speech models with different text.
Timbre AudioSeal WavMark SilentCipher FSVC Patchwork Norm-Space audiowmark
Tacotron2 Griffin-Lim
Acc. 100% 49.83% 49.91% 58.50% 59.29% 50.20% 49.69%
HiFi-GAN
Acc. 91.36% 50.14% 50.06% 58.08% 50.16% 49.33% 50.45%
HiFi-GAN*
Acc. 100% 78.56% 50.22% 59.03% 63.69% 50.58% 51.49%
Fastspeech2 Griffin-Lim
Acc. 100% 49.64% 49.39% 67.39% 62.96% 50.13% 49.84%
HiFi-GAN
Acc. 92.48% 50.86% 49.80% 68.94% 54.61% 50.04% 50.55%
HiFi-GAN*
Acc. 100% 83.54% 49.45% 69.71% 65.38% 49.56% 50.33%
YourTTS
(zero-shot)
Audio
Sample
Acc. 54.20% 53.81% 49.18% 49.67% 54.59% 51.68% 46.80% 49.45%

Physical-Level Distortion Attacks

We re-record audio samples from the LJ Speech dataset using various equipment at different distances and report the Bit Recovery Accuracy (Acc.) for each watermarking technique on the LJ Speech dataset.

Timbre AudioSeal WavMark SilentCipher FSVC Patchwork Norm-Space audiowmark RobustDNN
Close
(0.5m)
Default
Acc. 100% 63.75% 97.50% 53.00% 55.00% 77.50% 52.50% 50.00% 72.34%
HyperX Mic
Acc. 100% 42.50% 98.75% 62.00% 62.50% 68.75% 47.50% 52.50% 63.01%
Logitech Spk
Acc. 100% 53.75% 100% 48.00% 60.00% 70.00% 51.25% 53.75% 72.46%
Medium
(2.5m)
Default
Acc. 88.00% 48.75% 56.25% 51.50% 62.50% 67.50% 53.75% 48.75% 61.33%
Far
(5m)
Default
Acc. 64.00% 55.00% 43.75% 52.00% 48.75% 63.75% 51.25% 48.75% 59.49%

Signal-Level Distortion Attacks

We directly manipulate the audio signal to attempt erasing or degrading the watermark using various signal-level distortions. The distorted audio samples and their corresponding Bits Recovery Accuracy (Acc.) are reported for each watermarking technique on samples from both the LJ Speech and M4Singer datasets. The LJ audio sample is on top (colored gray), and the M4 sample is on the bottom (colored black). The LJ accuracy is shown on the left, and the M4 accuracy is on the right.

Timbre AudioSeal WavMark SilentCipher FSVC Patchwork Norm-Space audiowmark RobustDNN
Impulse Response Augmentation








Acc. 97.71% / 60.81% 55.87% / 54.80% 86.88% / 92.93% 70.24% / 75.72% 91.88% / 86.83% 77.78% / 79.79% 50.53% / 50.21% 70.95% / 76.92% 66.38% / 67.41%
Pitch Shift
(in cents)
+100








Acc. 6.37% / 50.26% 60.95% / 56.36% 50.19% / 49.71% 49.85% / 50.30% 59.62% / 55.04% 55.28% / 54.58% 50.11% / 50.22% 50.23% / 49.88% 63.88% / 62.86%
Time Stretch
0.75x








Acc. 100% / 64.53% 51.85% / 46.14% 99.99% / 99.99% 50.87% / 51.28% 66.15% / 62.67% 96.94% / 92.96% 50.54% / 50.09% 49.83% / 49.48% 61.29% / 60.50%
0.9x








Acc. 100% / 64.95% 50.51% / 47.26% 100% / 100% 50.07% / 50.17% 64.01% / 59.24% 97.81% / 94.49% 49.81% / 50.02% 49.83% / 50.21% 62.98% / 61.43%
Gaussian Noise
(SNR)
20dB








Acc. 99.90% / 49.86% 97.67% / 93.34% 51.46% / 51.65% 49.55% / 50.29% 97.75% / 79.23% 99.73% / 98.14% 99.93% / 92.98% 50.03% / 50.61% 100% / 99.96%
30dB








Acc. 100% / 57.87% 99.98% / 99.21% 97.47% / 89.14% 51.56% / 66.68% 99.36% / 89.67% 99.99% / 99.88% 100% / 94.64% 68.01% / 71.31% 100% / 99.98%
Bitcrush
(bit depth)
6








Acc. 99.40% / 49.31% 93.89% / 91.29% 50.13% / 50.78% 49.50% / 49.89% 92.20% / 71.92% 98.81% / 91.93% 86.48% / 78.70% 50.10% / 50.83% 100% / 99.98%
8








Acc. 100% / 58.67% 99.97% / 99.27% 93.61% / 89.58% 51.29% / 68.83% 97.74% / 85.12% 99.99% / 99.49% 98.69% / 89.86% 61.54% / 73.04% 100% / 99.99%
MP3 Compression 8kbps








Acc. 89.00% / 48.53% 80.12% / 100% 50.17% / 50.38% 50.12% / 60.48% 92.64% / 50.90% 67.53% / 90.61% 76.87% / 84.96% 50.09% / 58.77% 79.54% / 73.06%
16kbps








Acc. 99.99% / 53.91% 99.64% / 100% 61.33% / 62.47% 50.42% / 59.68% 98.84% / 50.26% 85.19% / 91.15% 87.11% / 83.98% 49.72% / 78.87% 96.28% / 95.78%
Background Noise
(SNR)
5dB








Acc. 92.81% / 55.84% 90.85% / 84.70% 82.71% / 79.29% 71.46% / 69.61% 83.78% / 81.78% 81.52% / 74.87% 75.46% / 67.38% 64.70% / 62.38% 99.84% / 99.66%
20dB








Acc. 99.43% / 60.62% 98.85% / 95.56% 97.76% / 95.42% 85.52% / 85.38% 95.59% / 91.60% 98.61% / 94.50% 98.19% / 86.72% 80.18% / 77.50% 100% / 99.98%
Cropping Audio
(% of audio cut)
25%








Acc. 100% / 64.44% 59.05% / 57.80% 99.04% / 99.08% 82.23% / 82.41% 51.47% / 52.61% 87.91% / 89.32% 49.82% / 49.70% 96.14% / 97.32% 63.05% / 63.40%
50%








Acc. 100% / 64.25% 59.05% / 57.13% 97.08% / 95.75% 74.29% / 76.84% 50.45% / 50.49% 78.06% / 76.44% 49.49% / 49.82% 90.09% / 88.01% 64.90% / 62.12%
75%








Acc. 100% / 63.00% 58.96% / 56.67% 84.16% / 72.37% 63.26% / 68.97% 50.13% / 49.74% 69.01% / 66.92% 49.83% / 50.20% 64.89% / 62.59% 63.97% / 63.01%
90%








Acc. 99.90% / 61.26% 57.19% / 55.79% 56.10% / 52.71% 53.27% / 50.68% 50.21% / 49.87% 62.65% / 60.42% 49.42% / 49.81% 50.16% / 49.91% 63.45% / 61.71%
95%








Acc. 99.08% / 59.52% 57.02% / 54.98% 49.90% / 49.91% 49.75% / 49.94% 50.38% / 50.60% 57.65% / 54.64% 50.44% / 50.04% 49.36% / 49.47% 63.09% / 61.83%
97%








Acc. 96.77% / 58.74% 56.38% / 54.98% 49.53% / 49.75% 50.08% / 50.18% 49.84% / 50.03% 54.44% / 52.29% 49.56% / 50.31% 50.06% / 49.49% 63.31% / 63.39%
99%







Acc. 52.12% / 50.52% 50.54% / 50.07% 50.02% / 50.26% 49.78% / 50.38% 50.74% / 49.41% 49.94% / 49.19% 49.71% / 50.20% 62.78% / 62.18%
High-Pass Filter 500Hz








Acc. 100% / 64.44% 100% / 100% 100% / 100% 95.50% / 89.62% 99.93% / 99.91% 100% / 100% 74.10% / 63.75% 98.35% / 99.54% 100% / 99.99%
Low-Pass Filter
2000Hz








Acc. 84.38% / 42.23% 100% / 100% 50.33% / 49.38% 78.25% / 75.34% 99.71% / 51.09% 76.47% / 73.16% 95.92% / 89.09% 50.09% / 50.28% 62.32% / 60.62%
3500Hz








Acc. 100% / 49.91% 100% / 100% 100% / 100% 89.40% / 86.35% 99.77% / 50.90% 99.93% / 98.93% 97.14% / 92.37% 97.64% / 98.61% 99.71% / 99.32%
Sample Suppression
(% of samples set to 0)
1%








Acc. 100% / 57.56% 100% / 100% 97.34% / 82.06% 55.18% / 52.66% 95.25% / 72.72% 99.18% / 97.45% 99.99% / 95.01% 84.55% / 79.78% 99.98% / 99.89%
10%








Acc. 99.97% / 51.28% 99.92% / 99.99% 60.64% / 58.67% 50.47% / 50.28% 80.78% / 60.57% 87.82% / 84.29% 97.37% / 90.74% 50.13% / 52.40% 99.66% / 97.28%
Resampling
4kHz








Acc. 99.34% / 51.27% 100% / 100% 50.74% / 49.96% 75.12% / 73.70% 99.27% / 51.85% 61.43% / 60.01% 93.35% / 81.87% 49.88% / 50.12% 62.31% / 61.21%
8kHz








Acc. 100% / 54.06% 100% / 100% 100% / 100% 89.82% / 86.21% 99.69% / 51.70% 99.95% / 99.96% 96.65% / 86.71% 95.96% / 96.85% 99.85% / 99.47%