• databases@iab-rubric.org
  • IIT Jodhpur
ACID Test: A Benchmark for Cultural Safety and Alignment in LALMs

This page contains resources for the Audio Cultural Intelligence Dataset (ACID Benchmark), the first comprehensive multilingual benchmark for evaluating cultural safety and alignment in Large Audio Language Models (LALMs).


Overview

The ACID Benchmark investigates cultural preferences, sensitivities, and failures in LALMs for real-world multilingual scenarios. It extends text-based cultural harm analyses into the audio domain, focusing on audio-driven bias and sensitivity, output quality, and fairness across ten languages including Arabic, Bengali, English, French, Gujarati, Hindi, Russian, Telugu, Turkish, and Vietnamese, and twelve societal dimensions (Ethics, Religion, Security, Political, Happiness, etc.).


Figures

Figure 1: Multicultural & Multilingual Interactions

Figure 1: Cross-cultural interactions
Depicts user queries in 10 languages, highlighting cross-cultural complexity and challenges faced by audio AI models.

Figure 2: Societal Dimensions Across Cultures

Figure 2: Distribution across societal themes
Shows the distribution of 12 societal themes (Corruption, Economic Value, Ethics, Religion, etc.) mapped across 11 cultures in the benchmark.

Figure 3: Qualitative Results

Figure 3: Qualitative results
Responses and failures for culturally sensitive audio prompts; includes dangerous compliance, default-transcription, and misinterpretation by sota LALMs.

Figure 1 (Supp.): Distribution of Human Evaluation Ratings

Figure 1 (supp.): Overall rating
Overall rating, translation accuracy, fluency, audio quality, and speaker confidence (ratings 1-5), summarizing reviewer trends in the human study.

Figure 2 (Supp.): Audio Quality by Language

Figure 2 (supp.): Human audio ratings
Average human ratings for translation, fluency, audio quality, and speaker confidence, including insights into speed and delivery style (e.g., monotone, neutral, excited) by language.

Figure 3 (Supp.): Speech Delivery Styles & Speed

Figure 3 (supp.): Style and Speed ratings
Distribution of delivery styles (formal, monotone, neutral, excited, emotional) and categorizations of speech speed (too fast, just right, too slow) as judged by human raters.

Dataset Details

ACID Benchmark includes:

  • 10 Languages: Arabic, Bengali, English, French, Gujarati, Hindi, Russian, Telugu, Turkish, Vietnamese.
  • 1315+ Hours: Multilingual audio-text pairs covering twelve societal dimensions (based on World Values Survey).
  • Three Main Sets:
    • Set A: Cultural harm evaluation (11,620 samples)
    • Set B: Contextual sensitivity (77,860 samples)
    • Set C: Alignment/preference optimization (300,000 prompt-response pairs per language)
  • Robust Curation: BLEU metrics for translation, Whisper-v3/DNS-MOS for audio/transcription quality, and human evaluation for subjective validation.

Experimental Setup

  • Model Evaluation: Safety via Llama Guard, relevance via Qwen3 embeddings, sentiment via TabularisAI. Experiments on A100/V100 GPUs.
  • Baseline Models: LTU, LTU-AS, GAMA, Pengi, MERaLiON, Qwen-Audio, SALMONN, Audio Flamingo-V1, Audio Flamingo-V2; plus Gemini-2.5 Flash and GPT-4o Mini.

Human Evaluation

A structured human study (see supplementary) assessed both translation and audio quality.

Study Design:

  • 25 participants, diverse age/gender
  • Metrics:
    • Translation accuracy, fluency, clarity
    • Audio quality, pronunciation, speed, intonation, speaker confidence
    • Overall rating, qualitative comments

Key Findings:

  • Ratings predominantly 4s & 5s for translation accuracy/fluency and audio quality, with very few negative comments—suggests highly robust models for grammar and semantics.
  • Minor issues included isolated mispronunciations and rare word choice errors (“last mile” of translation).
  • Delivery styles were often “monotone” but generally acceptable; speed usually “just right,” with occasional “too fast” or “too slow.”

Figures (Supp.):

  • Figure 1: Histograms of rating distributions
  • Figure 2: Average ratings per language for translation/audio/speaker aspects
  • Figure 3: Distribution of delivery styles (neutral, formal, excited, monotone) and pacing

Qualitative Analysis

Figure 3: Qualitative Results, Models exhibit typical failures:

  • Dangerous Compliance: Models providing answers to harmful or biased prompts (e.g., MERaLiON in Spanish).
  • Transcription Defaults: Some models (LTU-AS) simply parrot back the question or transcribe audio.
  • Misinterpretation: Others (AudioFlamingo Chat, Qwen) respond with generic labels (“Speech,” “Male spoken”) and do not address the actual query.

License and Downlaod

  • * To obtain access to the dataset, please email the duly filled license agreement to databases@iab-rubric.org with the subject line "Licence agreement for ACID-Benchmark dataset".
  • * The license agreement has to be signed by someone with legal authority to sign it on behalf of the institute, such as the head of the institution or registrar. If a license agreement is signed by someone else, it will not be processed further.
  • * The database can be downloaded from the following link: Coming Soon

Citation

Please cite the AAAI 2026 ACID Benchmark paper: Coming Soon


Acknowledgments

Special thanks to all dataset/model and human study participants. The ACID benchmark builds on earlier frameworks for cultural evaluation and extends the cultural safety paradigm to audio-language Models.