BriefGPT - AI 论文速递 ·

FollowBench：用于大型语言模型的多级细粒度约束追踪基准

💡 原文中文，约500字，阅读约需2分钟。

📝

内容提要

VisIT-Bench是一个用于评估面向实际应用的指令跟随视觉语言模型的基准测试。该基准测试收集了70个指令家族，数据集包含592个测试查询。VisIT-Bench对参与者是动态的，实践者只需在项目网站上提交其模型的响应。

🎯

关键要点

VisIT-Bench是一个用于评估指令跟随视觉语言模型的基准测试。
该基准测试收集了70个指令家族，涵盖各种任务。
数据集包含592个测试查询，涉及基本识别、游戏和创造性生成等任务。
指令条件描述揭示了指令特定因素，例如无障碍设施的询问。
通过人工验证和自动评估，量化模型与参考之间的质量差距。
最佳指令跟随模型在27%的比较中超越了GPT-4参考模型。
参与者可在项目网站上提交模型响应，数据、代码和排行榜可在visit-bench.github.io上找到。

🏷️

标签

VisIT-Bench 动态大型语言模型指令跟随数据集视觉语言模型

➡️

继续阅读

思瑞浦打造覆盖高精度电压基准产品的完整产品矩阵
（全球TMT 2026年07月21日讯）思瑞浦依托在高性能模拟芯片领域的持续创新，打造覆盖高精度电压基准产品的 […]
Peak Design’s modular Field Bracket has a finder tag built-in
I am a very clumsy man. So clumsy, that I have AirTags hanging off practicall...
Nearly every Kindle is steeply discounted at Best Buy
If you’ve been thinking about picking up a Kindle before school starts, or fo...
Single-pass AI code isn’t dead, but “high-reasoning” is the next frontier
Ask an AI model what comes next after “bacon-double”, and the return is fairl...
Apple’s rumored ‘Upgrade’ program brings lease-to-own pricing for iPhones, Macs, and iPads
As component and RAM shortages drive prices higher, Apple is reportedly launc...
Microsoft is building an AI stack it doesn’t fully own — on purpose
Microsoft and Mistral are deepening their partnership with a multibillion-dol...