Tmr Blog ·

生僻字

💡 原文中文，约300字，阅读约需1分钟。

📝

内容提要

文章探讨了大语言模型对生僻字的识别能力，关注训练数据是否涵盖所有汉字，并提及相关的汉字数字化和生僻字处理资源与平台。

🎯

关键要点

文章探讨大语言模型对生僻字的识别能力。
关注训练数据是否涵盖所有汉字。
提及汉字数字化和生僻字处理的资源与平台。
提到作者找到的码表与deepseek工程师有关，但并不全面。
讨论中文字符集的层级关系：Unicode > GB18030 > GBK > GB2312。
提到2024年与中文字相关的趣事。
提到汉字数字化中的AI现象。
介绍姓名生僻字处理平台和设备文化程度检测。
提到国际电脑汉字及异体字知识库。
提到教育部《异体字字典》的最新版本。

🏷️

标签

大语言模型数字化汉字生僻字训练数据

➡️

继续阅读

Chinese AI competitors may have forced OpenAI’s hand on pricing
OpenAI has lowered API prices for two GPT-5.6 models only three weeks after t...
Agentic media buying cannot scale without the right foundation. See how buyers and sellers get there on Databricks.
The bottleneck in media buying today isn't talent, it's coordinationE...
AI-generated software is forcing yet another platform rethink
“Raise your hand if your team is actively using AI to write and review code. ...
Samsung’s Galaxy Watch 9 is $40 off at Costco and comes with over $50 in freebies
The Galaxy Watch 9 launches on August 7th, and not only does Costco have the ...
The Complete Package: Why Debugging Is Only Half the C# Productivity Story
As .NET developers, we need to iterate on our applications while building, an...
LinkedIn actually adds a ‘seems like AI slop’ button
A lot of content on LinkedIn might seem like AI slop, and now, you'll be ...