极道 ·

一个新的104 GB LLM训练数据集已经发布！

💡 原文中文，约900字，阅读约需2分钟。

📝

内容提要

北京人工智能研究院发布了中文语料库互联网数据集，包含1000个重要中文网站的内容，总共104GB。该数据集填补了中文领域高质量数据集的空白。

🎯

关键要点

北京人工智能研究院发布了中文语料库互联网数据集（CCI v1.0.0）。
该数据集用于中文语言模型预训练，包含1000个重要中文网站的内容。
数据集总大小为104GB，经过严格的过滤和手动检查。
数据集的内容时间跨度为2001年1月到2023年11月。
高质量数据集在中文领域尤为缺乏，构建安全的中文数据集面临挑战。
数据处理规则包括基于规则的过滤和基于模型的过滤。
数据集经过重复数据删除，确保内容质量和安全性。

🏷️

标签

104GB llm 中文语料库互联网数据集北京人工智能研究院数据集高质量数据集

➡️

继续阅读

法院批准A社与作者和出版社的15亿美元和解协议初步解决A社使用盗版图书训练模型问题
#人工智能法院批准 A 社与作者和出版社的 15 亿美元和解协议，初步解决 A 社使用盗版书籍训练模型的集体诉讼案件。法庭文件显示，A 社建立拥有 70...
Next chapter: Restructuring GitHub’s bug bounty program
GitHub is making some significant changes to its bug bounty program, shifting...
Confidential Containers becomes a CNCF incubating project
The CNCF Technical Oversight Committee (TOC) has voted to accept Confidential...
How the Galaxy Z Fold 8 and Z Flip 8 phones compare
Samsung's latest round of folding Galaxy Z phones and updated smartwatche...
Preorders for Samsung’s new Z Fold and Flip 8 come with up to $350 in gift cards
Samsung's newest foldables are here. At Galaxy Unpacked, the company anno...
Philips’ new smart toothbrush shows you where you didn’t properly brush
The latest addition to Philips' Sonicare line of smart electric toothbrus...