Adam优化器的最优超参数是β1=β2 ?
📝
内容提要
最近笔者刷到论文《Why Adam Works Better with β1=β2: The Missing Gradient Scale Invariance Principle》,顾名思义,...
➡️
最近笔者刷到论文《Why Adam Works Better with β1=β2: The Missing Gradient Scale Invariance Principle》,顾名思义,...