Audio Negation Benchmark

Abstract. Text-to-audio models are expected to interpret prompts with precision and generate audio outputs accordingly. Large-scale training has improved their ability to follow textual instructions. However, the role of negation in such models remains underexplored. This study investigates whether audio generation systems can reliably distinguish between affirmative and negated prompts, for example, “A person is whistling” versus “A person is not whistling.” To enable systematic evaluation, we introduce a benchmark dataset containing one million negated prompts, covering diverse linguistic and acoustic contexts. We further propose strategies in both text and audio domains to analyze how negation is represented and processed. Experimental results demonstrate that current audio generation models not only struggle with negation but fail completely, often collapsing affirmative and negative prompts into similar outputs. These findings reveal a critical limitation in generative models and highlight the need for specialized benchmarks and training objectives to advance negation comprehension in multimodal generative systems.

1,000,000 negated prompts 4 negation types 3 negation scopes 3 T2A models tested 0.96 negation diversity index 99.6% human-validated prompts

The Negation Benchmark

The negative prompt dataset is generated using the AudioCaps dataset, which is a large-scale dataset of nearly 90K audio clips paired with human-written captions that we refer to as the original audio and original prompt. We use these original prompts to generate negative prompts. Since there are many possible ways to introduce negation, for example, the caption Fire truck sirens can be expressed as Fire truck sirens are absent or The area is quiet without any truck sirens. To produce a rich one-to-many mapping, we create a text negation module that takes the original caption as input and outputs negative captions in different writing styles. By doing so, we obtain 1 million negative prompts, which we then provide to a text-to-audio generation model to understand how well it processes negation.

Examples

In the examples below you can listen to:

Original Audio — the ground-truth recording from AudioCaps.
Positive Audio — generated by a T2A model from the original caption.
Negative Audio — generated from negated prompts in our dataset, covering Lexical Syntactic Semantic and Mixed negations.

Each cell shows a spectrogram (top) and an audio player (bottom). Notice how the negative outputs often look and sound nearly identical to the positive ones — the signature of affirmation bias in current T2A models.

Example 1: “Fire truck sirens”

	Original Audio	Positive Audio	Lexical Negation	Semantic Negation	Syntactic Negation	Mixed Negation
Spectrogram
Audio	siren_original.wav	siren_positive.wav	siren_neg_lex.wav	siren_neg_sem.wav	siren_neg_syn.wav	siren_neg_mix.wav
Caption	Fire truck sirens	Fire truck sirens	Fire truck sirens are absent	The area is quiet without the fire truck sirens.	There are no fire truck sirens.	No fire truck sirens are heard.

Example 2: “Multiple gun shots”

	Original Audio	Positive Audio	Lexical Negation	Semantic Negation	Syntactic Negation	Mixed Negation
Spectrogram
Audio	gunshot_original.wav	gunshot_positive.wav	gunshot_neg_lex.wav	gunshot_neg_sem.wav	gunshot_neg_syn.wav	gunshot_neg_mix.wav
Caption	Multiple gunshots	Multiple gunshots	Multiple gunshots are absent	There are no gunshots	No multiple gunshots	The scene is quiet without any gunshots

Example 3: “Someone whistles a song”

	Original Audio	Positive Audio	Lexical Negation	Semantic Negation	Syntactic Negation	Mixed Negation
Spectrogram
Audio	whistle_original.wav	whistle_positive.wav	whistle_neg_lex.wav	whistle_neg_sem.wav	whistle_neg_syn.wav	whistle_neg_mix.wav
Caption	Someone whistles a song	Someone whistles a song	The whistling of a song is absent	There is no whistling of a song	Someone does not whistle a song	The person is silent instead of whistling a song

Text Negation Module

This module produces negative prompts corresponding to the original caption.

Sound event counting. (Example: “Moving of objects followed by a beep” has two sounding objects — moving objects and beep.)
Prompt generation. Based on the negation type, negation scope, and sound count.

The number of prompts generated for an original caption depends on the categorization value. Examples are shown in the diagrams below.

Figure 1. Categorization flow. The original caption is decomposed into sounding objects, then negative prompts are produced by combining a Negation Type with a Negation Scope for the given Sound Count.

Example outputs for each negation type and scope

Figure 2. Example negative prompts produced for each combination of negation type, scope, and sound count.

Evaluation Protocols

Text Modality

Evaluation protocols in text modality include:

Embedding similarity on frozen encoder.
Embedding similarity on a fine-tuned encoder.
Scores on NegBLEURT.

Aim: check the similarity of negative re-captions with the original affirmative baseline.

Figure 3. Text-modality evaluation pipeline — (i) fine-tune MPNet so that positive and negative pairs differ, then evaluate on sentence similarity; (ii) fine-tune a BLEURT-style regression head to output a contradiction-vs-entailment score.

Audio Modality

Evaluation protocols in audio modality include:

Embedding similarity on frozen audio encoders.
Audio Question Answering (AQA).

Sample AQA questions for sound count 1 and sound count 2 are shown below.

Figure 4. Sample AQA questions for sound count 1 and sound count 2. For each NEGATIVE AUDIO (LEX / SYN / SEM / MIX), the option set is dynamically restricted to the same negation type, the original, and two distractors. The table on the bottom-right reports the number of questions and options per question for each combination.