ILuvUI:基于机器对话的用户界面指令调优语言-视觉建模
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training...
本文提出了一种无需人工标注的用户界面(UI)训练数据生成方法,结合像素方法和大型语言模型(LLM),生成了335K个对话示例数据集,用于微调对话式视觉语言模型(VLM),并评估了UI元素检测、响应质量和多步骤导航等任务。
