BriefGPT - AI 论文速递 ·

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility

💡 原文英文，约100词，阅读约需1分钟。

📝

内容提要

本研究探讨在训练大型语言模型时如何平衡数据的质量、数量和来源多样性。提出了两种新方法：UtiliMax和模型估计数据效用（MEDU），显著提高了训练效率并降低了计算需求，为数据混合的自动化和高效计算提供了新框架。

🎯

关键要点

本研究探讨在训练大型语言模型时如何平衡数据的质量、数量和来源多样性。
提出了两种新方法：UtiliMax和模型估计数据效用（MEDU）。
UtiliMax通过引入效用估计扩展基于标记的启发式方法，显著提高训练效率。
MEDU利用小样本进行效用估计，降低计算需求。
研究结果为自动化、计算高效的数据混合建立了新的框架，具有广泛的应用潜力。

🏷️

标签

MEDU UtiliMax llm 大型语言模型数据数量数据质量

➡️

继续阅读

Switch to Android easily — and bring your data with you.
A new migration experience built directly into Android 17 that lets you trans...
Utility companies promise to spare us from AI’s energy bill
In the face of backlash to concerns the AI boom will increase consumer electr...
Why R&D Data Belongs in the Lakehouse - and Why Agents Need It There
The setupAt cellcentric, a joint venture of Daimler Truck and Volvo Group, we...
Building multi-Region resiliency for AWS CloudFormation custom resource deployment
AWS CloudFormation is the foundational tool of infrastructure-as-code for tho...
ReSharper C++ 2026.2: C++26 Reflection, ISPC Language Support, And More
ReSharper C++ 2026.2 is out, bringing initial support for C++26 reflection, t...
Rider 2026.2: IDE Intelligence for AI Agents, Faster Performance, and Spectacular Game Dev Updates
Rider 2026.2 opens up the IDE’s own intelligence to your AI coding agents, so...