Integration of Audio Evaluation in LMMs-Eval

Nov 27, 2024

min read

Home

→

Writing

→

Integration of Audio Evaluation in LMMs-Eval

Humans perceive the world through both sight and sound, integrating visual cues with auditory signals such as speech, environmental sounds, and emotional tones.

This dual sensory input enhances decision-making and overall understanding. Similarly, for multimodal models to achieve human-like comprehension, it is essential to make them process both visual and auditory data together.

While many models have made progress in integrating audio understanding, there is still no reproducible and efficient evaluation toolkit to fairly assess their capabilities.

To address this, we introduce an upgrade to the lmms-eval framework, focusing on audio understanding. Building on the success of lmms-eval/v0.2.0, the new lmms-eval/v0.3.0 includes dedicated modules and designs for audio tasks, ensuring consistent evaluation across audio and visual modalities.

This upgrade includes multiple benchmarks for audio understanding and instruction following, enabling standardized and reproducible comparisons of various audio models.

For more information, please refer to our Github documentation.

https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/lmms-eval-0.3.md