利用多模态大语言模型推进自我中心视频问答

Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement. This paper...

本文评估了多模态大语言模型在Egocentric视频问答中的表现，使用QaEgo4Dv2数据集。研究发现，经过微调的Video-LLaVa-7B和Qwen2-VL-7B-Instruct在OpenQA和CloseQA中表现优异，超越了之前的基准。然而，模型在空间推理和细粒度物体识别方面仍存在困难。

QaEgo4Dv2 多模态大语言模型空间推理视频问答