小红花·文摘

Summer in the Northern Hemisphere has gotten off to a very hot start. This has focused attention on the impacts of heat on health, infrastructure, and economic productivity—and how to best manage...

Climate planning has prioritized floods. Heat demands equal attention

McKinsey Insights & Publications ·

As AI changes how consumers discover products and impact is measured, advertising value will shift to the players that can shape what is seen, selected, and purchased.

The agentic advertising economy: From attention to action

McKinsey Insights & Publications ·

理解 KV Cache：Attention、P/D 分离与 vLLM 的页式显存管理

Steins;Lab ·

Part 5 of the “User Psychology Series.” Over the last four chapters of the “User Psychology Series,” we have explored how users think, feel, decide, hesitate, trust, and drop off. Each article...

Attention Engineering: Why Users Ignore Even the Most Important Elements

UX Magazine ·

本文回顾了2014年Bahdanau等人提出的注意力机制在神经机器翻译中的应用。该机制通过动态计算上下文向量，克服了固定长度向量的局限性，显著提升了长句翻译的质量。Bahdanau的研究为现代自然语言处理中的注意力机制奠定了基础，尽管后来被Transformer取代，但其核心思想仍然具有深远影响。

【Transformer 与注意力机制】12｜Bahdanau Attention：注意力的早期形态

土法炼钢兴趣小组的博客 ·

《Attention Is All You Need》论文于2017年发表，提出了Transformer架构，摆脱了RNN和CNN，专注于并行化训练。其核心贡献包括多头自注意力和位置编码，显著提升了机器翻译的训练速度。尽管初期反响平平，但后来成为大语言模型的基础，影响深远。作者团队背景各异，后续大多离开Google，成为AI领域的重要人物。

【Transformer 与注意力机制】19｜《Attention Is All You Need》论文背景

土法炼钢兴趣小组的博客 ·

多头注意力机制的核心在于独立计算不同的注意力分布，而非简单平均。理解位置限制和计算复杂度是后续研究的重点。

【Transformer 与注意力机制】16｜Multi-Head Attention：为什么要分多个头

土法炼钢兴趣小组的博客 ·

本文探讨了自注意力机制的核心概念及其与传统模型的比较。自注意力允许序列内的每个token相互沟通，解决了RNN的长依赖问题。由于自注意力对位置无知，需通过位置编码注入位置信息。多头注意力使不同头学习不同关系。尽管自注意力在长序列处理上表现优异，但其计算复杂度为O(N²)，引发了对优化的研究。

【Transformer 与注意力机制】14｜Self-Attention：让序列自己看自己

土法炼钢兴趣小组的博客 ·

这篇文章介绍我们的一个最新作品Attention Residuals（AttnRes），顾名思义，这是用Attention的思路去改进Residuals。不少读者应该都听说过Pre Norm/P...

Attention Residuals 回忆录

科学空间|Scientific Spaces ·

The quality of consumer attention that gaming captures is exceptional. Growth in the next era will depend on publishers, platforms, and partners rethinking how to maximize the value of that attention.

Gaming’s next growth era: Unlocking the value of attention

McKinsey Insights & Publications ·

LUCID Attention：给长上下文模型戴上降噪耳机

Micropaper ·

阿里巴巴Qwen团队的论文《Gated Attention》提出在Transformer注意力机制中引入门控，以解决训练不稳定、注意力聚焦和长上下文表现不佳的问题。该方法通过选择性过滤信息，提升了模型性能和训练稳定性，已在Qwen3-Next模型中应用，效果显著。

Gated Attention Neurips Best Paper

Micropaper ·

谷歌新论文《嵌套学习：深度学习架构的幻象》指出，大型语言模型存在“数字失忆症”，无法有效记忆新知识。研究强调优化器不仅是训练工具，更是记忆系统，提出“嵌套学习”新范式，强调模型深度与更新频率的平衡。新架构HOPE模仿人脑记忆机制，展现了解决持续学习问题的潜力，可能改变AI设计逻辑。

为什么这篇谷歌论文被称为「Attention is all you need」V2

量子位 ·

AI 论文周报丨Attention机制/英伟达VLA模型/TTS模型/图神经网络……一文了解 AI 最新进展

HyperAI超神经 ·

UK consumers watch hours of content daily. But in an increasingly fragmented media market, companies that truly understand the value of attention are most likely to get ahead.

Mind the attention gap: Winning the battle for UK consumer attention

McKinsey Insights & Publications ·

$一文通透DeepSeek-V3.2——核心在于DeepSeek Sparse Attention(简称DSA)：让q跟最相关的k/v做注意力计算，以降低MLA的计算量$

一文通透DeepSeek-V3.2——核心在于DeepSeek Sparse Attention(简称DSA)：让q跟最相关的k/v做注意力计算，以降低MLA的计算量

结构之法算法之道 ·

本文研究了$n$个独立标准正态分布随机数的最大值$z_{ ext{max}}$的数学期望$ ext{E}[z_{ ext{max}}]$，结果显示随着$n$的增加，$ ext{E}[z_{ ext{max}}]$近似为$ ext{sqrt{2log n}}$，并提供了三种证明方法。同时，文章分析了低精度Attention中重复最大值的概率。

n个正态随机数的最大值的渐近估计

科学空间|Scientific Spaces ·

本文分析了论文《Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention》中低精度Attention计算的偏差问题，指出低精度运算导致的舍入误差可能引发MaxLogit爆炸等训练异常。作者提出通过调整计算公式消除偏差，并探讨注意力集中对训练崩溃的影响。

低精度Attention可能存在有偏的舍入误差

科学空间|Scientific Spaces ·

$一文通透Native Sparse Attention(简称NSA)——动态分层下的“原生稀疏注意力”策略：将粗粒度的token压缩与细粒度的token选择相结合$

一文通透Native Sparse Attention(简称NSA)——动态分层下的“原生稀疏注意力”策略：将粗粒度的token压缩与细粒度的token选择相结合

结构之法算法之道 ·

Flash Attention的作者Tri Dao在播客中预测，未来三年内英伟达将失去GPU市场主导地位，AI硬件生态将变得多元化。他指出推理成本已下降100倍，未来有望再降10倍，技术进步将推动AI硬件发展。

Flash Attention作者最新播客：英伟达GPU统治三年内将终结

量子位 ·