小红花·文摘

We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T...

DataComp-LM: In Search of the Next Generation of Training Sets for Language Models

Apple Machine Learning Research ·

本文介绍了一种基于无损数据压缩的评估方法，用于测试模型训练截断后的预测能力广义化情况。实验测试了14种大型语言模型，发现Mistral和Llama-2模型在性能和鲁棒性方面表现良好。同时，上下文大小和标记化实现对整体压缩性能有很大影响。

DataComp-LM: 寻找下一代语言模型训练集

BriefGPT - AI 论文速递 ·