This page contains resources for the Audio Cultural Intelligence Dataset (ACID Benchmark), the first comprehensive multilingual benchmark for evaluating cultural safety and alignment in Large Audio Language Models (LALMs).
Overview
The ACID Benchmark investigates cultural preferences, sensitivities, and failures in LALMs for real-world multilingual scenarios. It extends text-based cultural harm analyses into the audio domain, focusing on audio-driven bias and sensitivity, output quality, and fairness across ten languages including Arabic, Bengali, English, French, Gujarati, Hindi, Russian, Telugu, Turkish, and Vietnamese, and twelve societal dimensions (Ethics, Religion, Security, Political, Happiness, etc.).
Figures
Figure 1: Multicultural & Multilingual Interactions
Figure 2: Societal Dimensions Across Cultures
Figure 3: Qualitative Results
Figure 1 (Supp.): Distribution of Human Evaluation Ratings
Figure 2 (Supp.): Audio Quality by Language
Figure 3 (Supp.): Speech Delivery Styles & Speed
Dataset Details
ACID Benchmark includes:
- 10 Languages: Arabic, Bengali, English, French, Gujarati, Hindi, Russian, Telugu, Turkish, Vietnamese.
- 1315+ Hours: Multilingual audio-text pairs covering twelve societal dimensions (based on World Values Survey).
- Three Main Sets:
- Set A: Cultural harm evaluation (11,620 samples)
- Set B: Contextual sensitivity (77,860 samples)
- Set C: Alignment/preference optimization (300,000 prompt-response pairs per language)
- Robust Curation: BLEU metrics for translation, Whisper-v3/DNS-MOS for audio/transcription quality, and human evaluation for subjective validation.
Experimental Setup
- Model Evaluation: Safety via Llama Guard, relevance via Qwen3 embeddings, sentiment via TabularisAI. Experiments on A100/V100 GPUs.
- Baseline Models: LTU, LTU-AS, GAMA, Pengi, MERaLiON, Qwen-Audio, SALMONN, Audio Flamingo-V1, Audio Flamingo-V2; plus Gemini-2.5 Flash and GPT-4o Mini.
Human Evaluation
A structured human study (see supplementary) assessed both translation and audio quality.
Study Design:
- 25 participants, diverse age/gender
- Metrics:
- Translation accuracy, fluency, clarity
- Audio quality, pronunciation, speed, intonation, speaker confidence
- Overall rating, qualitative comments
Key Findings:
- Ratings predominantly 4s & 5s for translation accuracy/fluency and audio quality, with very few negative comments—suggests highly robust models for grammar and semantics.
- Minor issues included isolated mispronunciations and rare word choice errors (“last mile” of translation).
- Delivery styles were often “monotone” but generally acceptable; speed usually “just right,” with occasional “too fast” or “too slow.”
Figures (Supp.):
- Figure 1: Histograms of rating distributions
- Figure 2: Average ratings per language for translation/audio/speaker aspects
- Figure 3: Distribution of delivery styles (neutral, formal, excited, monotone) and pacing
Qualitative Analysis
Figure 3: Qualitative Results, Models exhibit typical failures:
- Dangerous Compliance: Models providing answers to harmful or biased prompts (e.g., MERaLiON in Spanish).
- Transcription Defaults: Some models (LTU-AS) simply parrot back the question or transcribe audio.
- Misinterpretation: Others (AudioFlamingo Chat, Qwen) respond with generic labels (“Speech,” “Male spoken”) and do not address the actual query.
License and Downlaod
- * To obtain access to the dataset, please email the duly filled license agreement to databases@iab-rubric.org with the subject line "Licence agreement for ACID-Benchmark dataset".
- * The license agreement has to be signed by someone with legal authority to sign it on behalf of the institute, such as the head of the institution or registrar. If a license agreement is signed by someone else, it will not be processed further.
- * The database can be downloaded from the following link: Coming Soon
Citation
Please cite the AAAI 2026 ACID Benchmark paper: Coming Soon
Acknowledgments
Special thanks to all dataset/model and human study participants. The ACID benchmark builds on earlier frameworks for cultural evaluation and extends the cultural safety paradigm to audio-language Models.