BriefGPT - AI 论文速递 ·

CCI3.0-HQ: A Large-Scale High-Quality Chinese Dataset Designed for Pre-Training Large Language Models

💡 原文英文，约100词，阅读约需1分钟。

📝

内容提要

本文介绍了CCI3.0-HQ，一个500GB的高质量中文数据集，旨在提升现有数据集的质量。该数据集通过新颖的两阶段混合过滤流程，在多个基准测试中表现优异，促进高质量语言模型的应用。

🎯

🏷️

5 Must-Read Resources for Mastering Small Language Models
Five resources covering SLM architecture, fine-tuning, agentic workflows, and...
Dogfooding at scale: migrating cdnjs to Cloudflare’s Developer Platform
We moved cdnjs, serving 9 billion requests a day, entirely onto Cloudflare...
Gemini for macOS adds new natural language capabilities
Gemini for macOS language capabilities
The Economic Benefit of Refactoring
Giles Edwards-Alexander does an experiment to see if decomposing a larg...
Best in Class: Stream PC Games and Study on the Same Laptop With GeForce NOW
Back to school means balancing assignments, deadlines and downtime. GeForce N...
When do AI agents need permission boundaries?
An AI agent feels harmless when it only produces text, but the risk profile c...