DataComp-LM: In Search of the Next Generation of Training Sets for Language Models

📝

内容提要

We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T...

🏷️

标签

➡️

继续阅读