DataComp-LM: In Search of the Next Generation of Training Sets for Language Models
📝
内容提要
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T...
🏷️
标签
➡️