Survey

Preprint version · 2020-2025 literature · Figures from the paper

Recent Advances in Audio-Visual-Language Modeling

1 Intelligent Systems Laboratory, University of Bristol · 2 University of Amsterdam

This survey treats audio, vision, and language as a joint modeling problem, rather than three loosely connected modality areas. It organizes recent AVL work for understanding, generation, reasoning, benchmark selection, and future-work citation.

Why this survey?

AVL modeling deserves a distinct survey view.

Audio-visual learning and vision-language modeling have often been studied as separate or bimodal research lines. Recent systems increasingly require sound, visual context, and natural language to interact within the same task, model, supervision signal, or evaluation protocol.

This survey focuses on that trimodal setting. It organizes recent work by modality use, representation learning, alignment and fusion mechanism, task formulation, benchmark, and open challenge.

What this survey gives you

Three citation-ready contributions

Scope

A trimodal scope

We isolate work where audio, visual signals, and language participate as inputs, outputs, supervision, prediction targets, or evaluation criteria.

Mechanism

A mechanism-first taxonomy

We organize AVL methods by how they represent each modality, align signals across modalities, and fuse information for understanding or generation.

Evaluation

A benchmark-centered view

We map common AVL tasks, datasets, and metrics, and highlight gaps in temporal alignment, causal correspondence, interpretability, long-context reasoning, and efficient deployment.

Explore the survey

Find the part you can cite

Use the index to locate tasks, mechanisms, benchmark choices, and open problems without reading the full preprint first.

Group
Language role
Output
Metric

Showing 9 of 9 tasks

MAR Understanding

Multimodal Action Recognition

Question: How do models use audiovisual and language cues to recognize actions?

Input / Output: video/audio/text cues → action labels

Representative datasets: Kinetics-400, EPIC-KITCHENS

Metrics: Top-1, Top-5, Accuracy, F1

Use this survey when citing: task organization for AVL understanding benchmarks.

MER Understanding

Multimodal Emotion Recognition

Question: How do models infer emotion or sentiment from audiovisual dialogue and text?

Input / Output: dialogue signals → emotion or sentiment

Representative datasets: IEMOCAP, MELD, CMU-MOSEI

Metrics: Accuracy, F1

Use this survey when citing: emotion and sentiment tasks that combine speech, visual behavior, and language.

AVQA Understanding

Audio-Visual Question Answering

Question: How do models answer language questions about audiovisual scenes?

Input / Output: audiovisual scene + question → answer

Representative datasets: Music-AVQA, VAQA

Metrics: Answer accuracy

Use this survey when citing: benchmark organization for AVL reasoning tasks.

AVOL Understanding

Audio-Visual Object Localization

Question: How do models localize objects or regions using audiovisual and textual queries?

Input / Output: audiovisual-textual query → localized object

Representative datasets: RefCOCO, Flickr-SoundNet

Metrics: Localization accuracy, IoU

Use this survey when citing: grounding and localization formulations in AVL understanding.

AVEL Understanding

Audio-Visual Event Localization

Question: How do models identify what event happens and when it occurs?

Input / Output: audiovisual stream → event class and time span

Representative datasets: AVE, LLP

Metrics: Accuracy, F1, localization quality

Use this survey when citing: temporal evaluation in audiovisual-language understanding tasks.

CMR Understanding

Cross-Modal Retrieval

Question: How do models retrieve matching items across audio, video, and language?

Input / Output: query in one modality → ranked items in another

Representative datasets: MSR-VTT, AudioCaps

Metrics: Recall@K, median rank

Use this survey when citing: retrieval-based evaluation for cross-modal AVL alignment.

AVSR Generation

Audio-Visual Speech Recognition

Question: How do models use visual speech cues and audio to produce transcripts?

Input / Output: speech video + audio → transcript

Representative datasets: LRS2, LRS3

Metrics: WER

Use this survey when citing: generation-oriented AVL tasks and evaluation protocols.

AVVC Generation

Audio-Visual Video Captioning

Question: How do models generate language descriptions from audiovisual clips?

Input / Output: audiovisual clip → caption

Representative datasets: MSR-VTT, ActivityNet Captions

Metrics: BLEU, METEOR, CIDEr

Use this survey when citing: captioning tasks that condition language generation on audio and visual evidence.

ViG Generation

Video Generation

Question: How do models use text or audio conditions to generate video?

Input / Output: text/audio conditions → video

Representative datasets: AudioSet-Cap, Landscape

Metrics: FVD, CLIP similarity, human preference

Use this survey when citing: generative AVL tasks and evaluation gaps beyond text-only generation.

Figures from the paper

Visual anchors for the taxonomy

The gallery keeps each figure to one takeaway so the page supports navigation instead of repeating the full paper.

Four bar charts showing literature statistics by modalities, multimodal core techniques, applications, and tasks.
Literature statistics. Shows modality coverage, core techniques, application domains, and task distribution.
Common encoders used for feature extraction: Transformer, ResNet50, and VGGish.
Common encoders. Shows typical language, visual, and audio feature extractors.
Fusion methods: early fusion, intermediate fusion, Transformer fusion, cross-attention fusion, and tensor fusion.
Fusion mechanisms. Shows early, intermediate, Transformer, cross-attention, and tensor fusion.
Pretraining methods for audio-visual-language modeling: multimodal contrastive learning, masked data modeling, and next token prediction.
Pretraining paradigms. Shows contrastive learning, masked data modeling, and next-token prediction.

Open problems

Future-work motivations after current benchmarks

Current AVL benchmarks support broad task evaluation, but they still leave important questions under-tested.

Interpretability

AVL models combine modality-specific encoders, cross-modal alignment, and task adaptation in latent spaces that are difficult to inspect.

Temporal and causal alignment

Many tasks require knowing whether a sound, visual event, and language description refer to the same moment or event.

Reasoning in complex environments

Long videos, multiple speakers, overlapping sounds, and cluttered scenes expose reasoning gaps that coarse task accuracy does not fully capture.

Efficiency and deployment

Separate encoders and heavy fusion modules increase memory and compute cost, making local, private, or real-time deployment difficult.

Citation

Cite this survey

If this survey helps you define audio-visual-language modeling, organize AVL tasks or benchmarks, or motivate open problems in trimodal learning, please cite the preprint version below. Citation metadata will be updated after publication.

@article{zhang2025recent,
  title={RECENT ADVANCES IN AUDIO-VISUAL-LANGUAGE MODELING},
  author={Zhang, Kairui and Abdallah, Zahraa S and Lewis, Martha},
  journal={Authorea Preprints},
  publisher={Authorea}
}

Project resources

Public project resources will be updated here after publication or repository release. The current page is intended as a preprint guide and citation entry point.