标签

 rust 

相关的文章:

Rust.cc -

【Rust与AI】概览和方向

本系列主要介绍Rust与AI的那些天作之合(开源项目),我们会以开源项目代码阅读的方式推进,以Rust为主,同时科普AI相关知识,目的是让更多非算法、非Rust的程序员进一步学习Rust和AI相关知识。当然,很显然地,我们也希望Rust程序员和AI算法工程师能从中有所收获。前者可以关注AI算法的设计和优化,后者可以关注Rust如何助力AI算法。 本篇是系列第一篇,主要介绍Rust和AI各自的特点与发展近况,以及它俩的遇见会碰撞出怎样的火花。我们热爱AI,我们喜欢Rust语言,仅此而已。 当前发展 AI与LLM 随着ChatGPT的发展浪潮,AI又一次迎来了发展良机,很多应用、服务都在基于大模型重新设计。同时,由于大模型的超能力,应用的开发门槛进一步下降,一些新的创意产品在不断涌现。总的来说,在AI应用领域呈现出了一片欣欣向荣、百家争鸣的景象。 这一切的背后是ChatGPT为代表的大语言模型(后面以LLM代替),LLM以序列方式根据给定上下文生成文本,它对上下文的精准理解能力和基于此的生成能力都令人赞叹。作为一名从业多年的自然语言处理(后面以NLP代替)工程师,可以负责任地说,LLM的能力确实远超此前的语言模型,尤其是理解方面。 LLM的最大特点是大,这里的大是指参数量非常多。也就是说,无论是加载还是运行这样一个模型,都需要消耗比较多的资源。要想让模型执行的快,性能就成了绕不开的坎。 参数其实就是很多很多的数字,一般来说都是FP32的浮点数,但浮点数可以通过量化降低到FP16、BF16或Int,量化后内存的占用明显是降低了的,一般也会同时带来执行速度的提升。 抛开语言、模型架构和量化先不谈,要加速执行很多数字的运算,一般我们可以想到的最容易的解决方案大概就是并行。没错,并行是当前LLM甚至深度学习最通用的方案,典型代表就是使用类似GPU、TPU这样的专用设备来加速。当然,即使没有这些设备,普通CPU甚至移动端的CPU都可以利用数据级并行、指令级并行、线程级并行等方案来加速。除了并行,还可以优化存储层次和传输,进一步提升性能。 上面提到这些优化方案都和计算机底层有关,一般来说都需要用到C语言或C++编程,现在我们有了新的选择——Rust。其实,这个“现在”应该可以再提前个几年,毕竟Rust在AI领域默默发力已经有些时日了。C语言和C++都是非常强大的语言,不过相较而言,Rust在某些方面表现的更好。 Rust Rust的来龙去脉我们就不赘述了,就凭“在StackOverflow年度开发者调查报告中连续几年获得最受欢迎编程语言”这一点就值得我们去认真学习一下。关于Rust语言的“好”这里也不多赘述,仅从个人角度谈几点自己的感觉。 首先,Rust代码只要编译通过,运行一般不会出问题。虽然一开始与编译器作斗争这件事可能让人抓狂,但比起用gdb去分析dump应该要好上很多吧。而且,编译器的提示越来越友好,作斗争的过程其实是个不断学习相关知识的过程,这种所见所得的及时反馈应该也是极其理想的学习方式吧。 其次,语法更加清晰。个人比较倾向于在编程时显式地指定数据类型和范围,比如i8表示8位有符号整数,这样一方面强迫自己理解代码(而不是默认一个int64),另一方面也方便日后自己或他人阅读。这点可能是之前从Python开始入门编程项目导致的。另外,它对错误的处理方式个人比较认同和喜欢,这都是代码清晰的表现。 第三,设计更加合理。Struct和Trait以及其相关的设计深得个人喜爱,还有生命周期。和很多人不一样的是,个人比较喜欢生命周期的设计思想,可能也是源于喜欢“显式”吧。 第四,代码更加优雅。控制分支中的match是个人最爱,还有模板、函数式编程、闭包,以及链式调用,代码看起来让人赏心悦目。 …… 此外还有优雅的并发操作,测试的组织,文档的集成,等等都让人欲罢不能。唯一要吐槽的可能是智能指针相关的内容,的确有些复杂。不过瑕不掩瑜,总的来说,Rust值得任何一个热爱编程的程序员去尝试。 双剑合璧 其实用到C++的地方都可以用Rust再写一遍,简单来说,和底层相关的代码都可以Rust掉,AI方面也不例外。接下来,我们就谈谈Rust和AI可以合璧的地方。 推理 首先是推理。这个方向是最自然、最值得关注的方向,尤其是端侧。Server端由于GPU的广泛应用,导致现在CUDA+C/CPP几乎成了垄断。不过随着Rust加入Linux内核,以及Huggingface的大量使用,当然也有Rust自己在GPU领域的不断推动,我们相信Rust在Server端也会有一席之地。 端侧,尤其是以RISC-V为基础架构的智能终端是Rust一直以来深耕的领域。更令人振奋的是前不久Vivo发布的用Rust全新构建的BlueOS,主打的就是新一代AI操作系统。我们相信Rust在智能终端有着非常广阔的前景。 前面已经提到了LLM时代的特点是模型很大,推理很慢,需要性能提升。而且随着LLM的进一步发展,性能必定会变得更加重要,Rust由于其优秀的语言特性,正好接到这一棒。我们笃信Rust+AI大模型是最适合的搭档组合。 中间件 再下来是中间件。准确来说是和AI大模型相关的中间件,首当其冲的是向量检索相关库,这就不得不提大名鼎鼎的Qdrant了,性能优秀,而且非常容易使用。顺带提一下对标全文检索框架ElasticSearch的melisearch,经过多年的发展已经是比较成熟的框架了,这个领域还有很多其他框架,比如tantivy、Toshi、lnx、websurfx等。 另外值得一提的是将全文检索、语义检索融合到SQL搜索的paradedb,这个项目的设计思路可以给我们很多启发。此外还有处理表格的polars、可视化pipeline的vector、文档图数据库surrealdb、时序数据库ceresdb等等。当下火热的Agent也不是没有,比如smartgpt。 这块范围其实是非常广泛的,除了基础组件,可以想象的内容还很多,比如记忆模块、任务调度、资源池、任务定义、流程设计等等。这些组件几乎都是围绕着LLM使用的,我们相信LLM带来的远不止这些,而且随着应用层的不断丰富和发展,还会衍生出更多的需求。 训练 最后说一下训练。Rust开始做推理,自然有人把它放到训练侧,不过目前看起来这块还处于尝试和起步阶段。我们比较看好它在相对稳定的工程领域使用,但不看好在算法领域的普及。 对于前者,无论哪种语言,一般都会提供简单易用的API或命令行,使用者大多数时候只需要根据要求准备好数据即可进行训练。但对于后者,经常需要涉及底层算法架构的调整和修改,甚至需要新加入或去掉一些模块,这方面Python实在是具备绝对优势,而且平心而论,PyTorch做这些操作相对是比较方便的。Torch一开始也是lua写的,不温不火,后面加了Python后,慢慢打败了Caffe、TensorFlow,现在稳坐第一把交椅。Rust要向当年的Torch一样吗,可是这样在Python侧的区别在哪里?接口上大概率还是和现在的PyTorch接近,就像transformers库流行后,PaddleNLP、ModelScope的接口不能说和其很像,大概只能说一样了。对使用者来说,迁移是没必要的,除非不得不这样做,比如在端侧训练,也许对Rust来说是一个不错的方向。 其他 前面说的是正向的,这里简单谈一下可能面临的冲击。 首先依然是C和C++,它们当下是主流,谁能说未来不能继续是主流呢,而且对使用者来说,反正上面是方便的Python,谁会管下面怎么实现的。 再就是其他新语言,比如专为AI而生的Mojo,它的定位是Python的易用性+C语言的性能。虽然Mojo目前还处于极其早期阶段,但这至少是个苗头:在AI主导的未来,指不定会有更AI的语言设计出来。那会不会有专门为大模型设计的语言呢? 不过,总的来说,我们先关注Rust吧。 开源项目 下面我们列举一些Rust相关的AI项目,囿于笔者知识范围,所列内容不一定全面,如果读者有更好的开源项目推荐,尤其是大模型相关的,欢迎随时推荐。这些资源也是系列后续阅读的项目。 LLM推理 rustformers/llm: An ecosystem of Rust libraries for working with large language models Noeda/rllama: Rust+OpenCL+AVX2 implementation of LLaMA inference code srush/llama2.rs: A fast llama2 decoder in pure Rust. leo-du/llama2.rs: Inference Llama 2 in one file of zero-dependency, zero-unsafe Rust gaxler/llama2.rs: Inference Llama 2 in one file of pure Rust 🦀 huggingface/text-generation-inference: Large Language Model Text Generation Inference Agent Cormanz/smartgpt: A program that provides LLMs with the ability to complete complex tasks using plugins. NLP huggingface/tokenizers: 💥 Fast State-of-the-Art Tokenizers optimized for Research and Production guillaume-be/rust-bert: Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...) 图像 LaurentMazare/diffusers-rs: An implementation of the diffusers api in Rust twistedfall/opencv-rust: Rust bindings for OpenCV 3 & 4 Code huggingface/llm-ls: LSP server leveraging LLMs for code completion (and more?) Framework huggingface/candle: Minimalist ML framework for Rust coreylowman/dfdx: Deep learning in Rust, with shape checked tensors and neural networks tracel-ai/burn: Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals. spearow/juice: The Hacker's Machine Learning Engine rust-ml/linfa: A Rust machine learning framework. tensorflow/rust: Rust language bindings for TensorFlow sonos/tract: Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference smartcorelib/smartcore: A comprehensive library for machine learning and numerical computing. The library provides a set of tools for linear algebra, numerical computing, optimization, and enables a generic, powerful yet still efficient approach to machine learning. neuronika/neuronika: Tensors and dynamic neural networks in pure Rust.

AI生成摘要 本文介绍了Rust与AI的结合,以及它们各自的特点和发展。Rust在AI领域有广阔的前景,可以用于推理、中间件和训练等方面。文章列举了一些与Rust相关的开源项目。

相关推荐 去reddit讨论

又耳笔记 -

用Rust来做以太坊开发3

本系列文章主要是用Rust的ethers-rs来复刻《用Go来做以太坊开发》这本书本的内容,所以本系列文章的标题叫做《用Rust来做以太坊开发》, 算是ethers-rs的快速入门教程, 因为原书写得足够好了,所以本系列更多的只是代码层面的复刻,不会说明太多相关的基础知识。

AI生成摘要 本系列文章使用Rust的ethers-rs复刻了《用Go来做以太坊开发》一书的内容,本文介绍了交易相关的内容,包括查询区块、查询交易、ETH转账、代币转账、监听新区块、创建裸交易和发送裸交易等。通过ethers-rs,以太坊开发变得简单易懂。

相关推荐 去reddit讨论

Rust.cc -

rust 闭包用move为啥编译不过?

struct Data { name: String, id: u32, } impl Data { fn getval(&self, val: Vec<u32>) -> u32 { self.id+val[0] } } fn call_closue<T>(data: &Data, f: T) -> u32 where T: Fn(&Data)->u32 { f(data) } fn test(data: &Data, val: Vec<u32>) { let _ret = call_closue(data, move |person: &Data|person.getval(val)); } fn main() { let a = Data{name:"a1".to_string(), id:23}; let a1 = vec![11]; test(&a, a1); } 编译报错,没太理解为啥, 这里用 move就是想要获取vec的所有权。 error[E0507]: cannot move out of `val`, a captured variable in an `Fn` closure --> src/main.rs:46:68 | 45 | fn test(data: &Data, val: Vec<u32>) { | --- captured outer variable 46 | let _ret = call_closue(data, move |person: &Data|person.getval(val)); | -------------------- ^^^ move occurs because `val` has type `Vec<u32>`, which does not implement the `Copy` trait | | | captured by this `Fn` closure For more information about this error, try `rustc --explain E0507`

AI生成摘要 这篇文章介绍了一个Rust编程语言的代码示例,其中出现了一个编译错误。错误的原因是在闭包中使用了move关键字,想要获取vec的所有权,但是由于val的类型Vec<u32>没有实现Copy trait,所以无法移动出去。

相关推荐 去reddit讨论

又耳笔记 -

RUST web框架axum快速入门教程2

上一篇文章讨论了axum如何获取参数,这一节看看axum是怎么构造响应内容的,如果你还不知道如何处理axum的请求参数,可以阅读我之前的文章: https://youerning.top/post/axum/quickstart-1。

AI生成摘要 本文介绍了Axum如何构造响应内容,包括HTML和JSON两种常见的响应类型。对于HTML响应,可以使用模板引擎库askama来渲染前端页面。对于JSON响应,可以使用serde库进行序列化。此外,还介绍了如何指定状态码和提供静态文件。Axum还支持其他常用的响应类型,如重定向和SSE。

相关推荐 去reddit讨论

豌豆花下猫 | Python猫 -

Python 潮流周刊#29:Rust 会比 Python 慢?!

本期周刊分享了 12 篇文章,12 个开源项目,2 则播客,2 个热门讨论

AI生成摘要 本周刊分享Python、AI及通用技术内容,开源并欢迎投稿。推荐FlowUs平台提升个人生产力。文章包括Rust与Python性能比较、Python时间戳函数、Python与Go的比较、使用Numba提升pandas性能、Flask维护任务清单、子解释器运行Python并行程序、新旧开源库对比、使用Polars替代Pandas、Python软关键字。

相关推荐 去reddit讨论

Rust.cc -

Salvo 0.59.0 发布,Rust Web 后端框架

Salvo 是 Rust 实现的简单好用且功能强大的 Web 后端框架。 特色: 有着比 axum 等更丰富的功能,但却更易于上手。 跟 go 等其他语言框架更接近,比 Rust 语言各个 Web 框架更少的类型系统的烦恼。 支持 HTTP1, HTTP2 and HTTP3; 统一的中间件和 Handler 接口,无需任何复杂语言只是,轻松实现中间件。灵活高效。 内置表单处理,强大的提取器,轻松反序列请求数据到结构体。 支持 WebSocket, WebTransport 对 OpenAPI 最完美的支持,且内置多种开源 OpenAPI 展示界面 支持 Acme, 可以轻松获取并自动更新免费的 TLS 证书 适配 Tower 生态 本次更新: 修复 ServeStaticDir 排除 dot files 不起效的问题。 为 ServeStaticDir 添加 exclude_filter 方法,可以根据需要排除任何你不想被访问的文件。 OpenAPI 支持的多处更新 Extractor 自动根据请求切换解析方式 UnixListener 添加 owner and permissions 相关支持 升级 opentelemetry-prometheus 到 0.14 详细更新链接: https://github.com/salvo-rs/salvo/releases/tag/v0.59.0 本次版本依然是依赖 hyper 1.0-rc4,proxy 部分功能依赖 reqwest 库,等它更新到 hyper 1.0 后,salvo 会第一时间更新。

AI生成摘要 Salvo是用Rust实现的Web后端框架,功能丰富且易于上手。支持HTTP1、HTTP2和HTTP3,提供统一的中间件和Handler接口,内置表单处理和强大的提取器。支持WebSocket、WebTransport和OpenAPI,适配Tower生态。修复了ServeStaticDir的问题,更新了OpenAPI支持和Extractor解析方式。详细更新链接:https://github.com/salvo-rs/salvo/releases/tag/v0.59.0。本次版本依赖hyper 1.0-rc4和reqwest库。

相关推荐 去reddit讨论

Rust.cc -

Rust Asm! 内联汇编使用

我想让这个函数能操作指定名称的寄存器,但是似乎asm!宏不支持? pub fn csr_swap(csr: &str,val:u64) -> u64 { let value: u64 = 0; unsafe { asm!( "csrrw {0},{1},{2}", out(reg) value, csr, in(reg) val ); } value } expected one of clobber_abi, const, in, inlateout, inout, lateout, options, out, or sym, found csr

AI生成摘要 这篇文章讨论了一个函数无法操作指定名称寄存器的问题,作者尝试使用asm!宏但发现不支持。

相关推荐 去reddit讨论

Rust.cc -

【Rust日报】2023-11-29 在Rust的 unsafe 代码中调试UB

在Rust的 unsafe 代码中调试UB 这篇文章讲述了在 Rust 中调试UB代码时遇到的问题。 unsafe 的潜在风险: 讨论了 Rust unsafe代码的特性和潜在风险,以及可能因不正确使用而导致的未定义行为。 调试未定义行为的方法: 提供了识别和解决不安全 Rust 代码中潜在未定义行为问题的方法,如调试器、LLVM Sanitizer 和代码审查。 调试技巧和建议: 可能包含在调试不安全 Rust 代码时的一些最佳实践和技巧,比如使用断言、规范化指针操作等。 避免未定义行为的方法: 可能探讨了编写不安全代码时需要注意的事项和最佳实践,以避免可能导致未定义行为的问题。 ReadMore:https://hyphenos.io/blog/2023/debugging-ub-unsafe-rust-code/ 调查疯狂的编译时间 作者提到了一些涉及编译器优化、宏展开、代码生成和编译时间的案例和实践经验。 编译时间的重要性: 强调了对于大型项目或复杂代码库来说,编译时间的优化至关重要,能够显著影响开发者的工作效率和开发周期。 编译器优化和技巧: 提到了一些编译器优化和技巧,例如减少不必要的代码依赖、使用 #[cfg] 属性进行条件编译、减少宏展开等,以缩短编译时间。 宏展开的影响: 讨论了宏展开在 Rust 中的重要性以及宏展开可能导致的编译时间增加。还可能提到了一些减少宏展开影响的方法。 编译时间的管理和优化策略: 探讨了管理编译时间的策略,例如使用缓存、分析编译时间瓶颈并对其进行优化,以及选择合适的编译器版本等。 ReadMore:https://blog.adamchalmers.com/crazy-compile-time/ From 日报小组 mook 社区学习交流平台订阅: Rustcc论坛: 支持rss 微信公众号:Rust语言中文社区

AI生成摘要 本文讨论了在Rust中调试UB代码时遇到的问题,以及识别和解决不安全Rust代码中潜在未定义行为问题的方法。同时,还提到了编译时间的重要性以及优化策略。

相关推荐 去reddit讨论

Xuanwo's Blog -

Rust std fs slower than Python!? No, it's hardware!

I'm about to share a lengthy tale that begins with opendal op.read() and concludes with an unexpected twist. This journey was quite enlightening for me, and I hope it will be for you too. I'll do my best to recreate the experience, complete with the lessons I've learned along the way. Let's dive in! All the code snippets and scripts are available in Xuanwo/when-i-find-rust-is-slow TL;DR Jump to Conclusion if you want to know the answer ASAP. OpenDAL Python Binding is slower than Python? OpenDAL is a data access layer that allows users to easily and efficiently retrieve data from various storage services in a unified way. We provided python binding for OpenDAL via pyo3. One day, @beldathas reports a case to me at discord that OpenDAL's python binding is slower than python: import pathlib import timeit import opendal root = pathlib.Path(__file__).parent op = opendal.Operator("fs", root=str(root)) filename = "lorem_ipsum_150mb.txt" def read_file_with_opendal() -> bytes: with op.open(filename, "rb") as fp: result = fp.read() return result def read_file_with_normal() -> bytes: with open(root / filename, "rb") as fp: result = fp.read() return result if __name__ == "__main__": print("normal: ", timeit.timeit(read_file_with_normal, number=100)) print("opendal: ", timeit.timeit(read_file_with_opendal, number=100)) The result shows that (venv) $ python benchmark.py normal: 4.470868484000675 opendal: 8.993250704006641 Well, well, well. I'm somewhat embarrassed by these results. Here are a few quick hypotheses: Does Python have an internal cache that can reuse the same memory? Does Python possess some trick to accelerate file reading? Does PyO3 introduce additional overhead? I've refactored the code to: python-fs-read: with open("/tmp/file", "rb") as fp: result = fp.read() assert len(result) == 64 * 1024 * 1024 python-opendal-read: import opendal op = opendal.Operator("fs", root=str("/tmp")) result = op.read("file") assert len(result) == 64 * 1024 * 1024 The result shows that python is much faster than opendal: Benchmark 1: python-fs-read/test.py Time (mean ± σ): 15.9 ms ± 0.7 ms [User: 5.6 ms, System: 10.1 ms] Range (min … max): 14.9 ms … 21.6 ms 180 runs Benchmark 2: python-opendal-read/test.py Time (mean ± σ): 32.9 ms ± 1.3 ms [User: 6.1 ms, System: 26.6 ms] Range (min … max): 31.4 ms … 42.6 ms 85 runs Summary python-fs-read/test.py ran 2.07 ± 0.12 times faster than python-opendal-read/test.py The Python binding for OpenDAL seems to be slower than Python itself, which isn't great news. Let's investigate the reasons behind this. OpenDAL Fs Service is slower than Python? This puzzle involves numerous elements such as rust, opendal, python, pyo3, among others. Let's focus and attempt to identify the root cause. I implement the same logic via opendal fs service in rust: rust-opendal-fs-read: use std::io::Read; use opendal::services::Fs; use opendal::Operator; fn main() { let mut cfg = Fs::default(); cfg.root("/tmp"); let op = Operator::new(cfg).unwrap().finish().blocking(); let mut bs = vec![0; 64 * 1024 * 1024]; let mut f = op.reader("file").unwrap(); let mut ts = 0; loop { let buf = &mut bs[ts..]; let n = f.read(buf).unwrap(); let n = n as usize; if n == 0 { break } ts += n; } assert_eq!(ts, 64 * 1024 * 1024); } However, the result shows that opendal is slower than python even when opendal is implemented in rust: Benchmark 1: rust-opendal-fs-read/target/release/test Time (mean ± σ): 23.8 ms ± 2.0 ms [User: 0.4 ms, System: 23.4 ms] Range (min … max): 21.8 ms … 34.6 ms 121 runs Benchmark 2: python-fs-read/test.py Time (mean ± σ): 15.6 ms ± 0.8 ms [User: 5.5 ms, System: 10.0 ms] Range (min … max): 14.4 ms … 20.8 ms 166 runs Summary python-fs-read/test.py ran 1.52 ± 0.15 times faster than rust-opendal-fs-read/target/release/test While rust-opendal-fs-read performs slightly better than python-opendal-read, indicating room for improvement in the binding & pyo3, these aren't the core issues. We need to delve deeper. Ah, opendal fs service is slower than python. Rust std fs is slower than Python? OpenDAL implement fs service via std::fs. Could there be additional costs incurred by OpenDAL itself? I implement the same logic via rust std::fs: rust-std-fs-read: use std::io::Read; use std::fs::OpenOptions; fn main() { let mut bs = vec![0; 64 * 1024 * 1024]; let mut f = OpenOptions::new().read(true).open("/tmp/file").unwrap(); let mut ts = 0; loop { let buf = &mut bs[ts..]; let n = f.read(buf).unwrap(); let n = n as usize; if n == 0 { break } ts += n; } assert_eq!(ts, 64 * 1024 * 1024); } But.... Benchmark 1: rust-std-fs-read/target/release/test Time (mean ± σ): 23.1 ms ± 2.5 ms [User: 0.3 ms, System: 22.8 ms] Range (min … max): 21.0 ms … 37.6 ms 124 runs Benchmark 2: python-fs-read/test.py Time (mean ± σ): 15.2 ms ± 1.1 ms [User: 5.4 ms, System: 9.7 ms] Range (min … max): 14.3 ms … 21.4 ms 178 runs Summary python-fs-read/test.py ran 1.52 ± 0.20 times faster than rust-std-fs-read/target/release/test Wow, Rust's std fs is slower than Python? How can that be? No offense intended, but how is that possible? Rust std fs is slower than Python? Really!? I can't believe the results: rust std fs is surprisingly slower than Python. I learned how to use strace for syscall analysis. strace is a Linux syscall tracer that allows us to monitor syscalls and understand their processes. The strace will encompass all syscalls dispatched by the program. Our attention should be on aspects associated with /tmp/file. Each line of the strace output initiates with the syscall name, followed by input arguments and output. For example: openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3 Means we invoke the openat syscall using arguments AT_FDCWD, "/tmp/file", and O_RDONLY|O_CLOEXEC. This returns output 3, which is the file descriptor referenced in the subsequent syscall. Alright, we've mastered strace. Let's put it to use! strace of rust-std-fs-read: > strace ./rust-std-fs-read/target/release/test ... mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f290dd40000 openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3 read(3, "\tP\201A\225\366>\260\270R\365\313\220{E\372\274\6\35\"\353\204\220s\2|7C\205\265\6\263"..., 67108864) = 67108864 read(3, "", 0) = 0 close(3) = 0 munmap(0x7f290dd40000, 67112960) = 0 ... strace of python-fs-read: > strace ./python-fs-read/test.py ... openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3 newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=67108864, ...}, AT_EMPTY_PATH) = 0 ioctl(3, TCGETS, 0x7ffe9f844ac0) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=67108864, ...}, AT_EMPTY_PATH) = 0 mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f13277ff000 read(3, "\tP\201A\225\366>\260\270R\365\313\220{E\372\274\6\35\"\353\204\220s\2|7C\205\265\6\263"..., 67108865) = 67108864 read(3, "", 1) = 0 close(3) = 0 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7f132be5c710}, {sa_handler=0x7f132c17ac36, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7f132be5c710}, 8) = 0 munmap(0x7f13277ff000, 67112960) = 0 ... From analyzing strace, it's clear that python-fs-read has more syscalls than rust-std-fs-read, with both utilizing mmap. So why python is faster than rust? Why we are using mmap here? I initially believed mmap was solely for mapping files to memory, enabling file access through memory. However, mmap has other uses too. It's commonly used to allocate large regions of memory for applications. This can be seen in the strace results: mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f13277ff000 This syscall means NULL: the first arg means start address of the memory region to map. NULL will let OS to pick up a suitable address for us. 67112960: The size of the memory region to map. We are allocating 64MiB + 4KiB memory here, the extra page is used to store the metadata of this memory region. PROT_READ|PROT_WRITE: The memory region is readable and writable. MAP_PRIVATE|MAP_ANONYMOUS: MAP_PRIVATE means changes to this memory region will not be visible to other processes mapping the same region, and are not carried through to the underlying file (if we have). MAP_ANONYMOUS means we are allocating anonymous memory that not related to a file. -1: The file descriptor of the file to map. -1 means we are not mapping a file. 0: The offset in the file to map from. Use 0 here since we are not mapping a file. But I don't use mmap in my code. The mmap syscall is dispatched by glibc. We utilize malloc to solicit memory from the system, and in response, glibc employs both the brk and mmap syscalls to allocate memory according to our request size. If the requested size is sufficiently large, then glibc opts for using mmap, which helps mitigate memory fragmentation issues. By default, all Rust programs compiled with target x86_64-unknown-linux-gnu use the malloc provided by glibc. Does python has the same memory allocator with rust? Python, by default, utilizes pymalloc, a memory allocator optimized for small allocations. Python features three memory domains, each representing different allocation strategies and optimized for various purposes. pymalloc has the following behavior: Python has a pymalloc allocator optimized for small objects (smaller or equal to 512 bytes) with a short lifetime. It uses memory mappings called “arenas” with a fixed size of either 256 KiB on 32-bit platforms or 1 MiB on 64-bit platforms. It falls back to PyMem_RawMalloc() and PyMem_RawRealloc() for allocations larger than 512 bytes. Rust is slower than Python with default memory allocator? I suspect that mmap is causing this issue. What would occur if I switched to jemalloc? rust-std-fs-read-with-jemalloc: use std::io::Read; use std::fs::OpenOptions; #[global_allocator] static GLOBAL: jemallocator::Jemalloc = jemallocator::Jemalloc; fn main() { let mut bs = vec![0; 64 * 1024 * 1024]; let mut f = OpenOptions::new().read(true).open("/tmp/file").unwrap(); let mut ts = 0; loop { let buf = &mut bs[ts..]; let n = f.read(buf).unwrap(); let n = n as usize; if n == 0 { break } ts += n; } assert_eq!(ts, 64 * 1024 * 1024); } Wooooooooooooooow?! Benchmark 1: rust-std-fs-read-with-jemalloc/target/release/test Time (mean ± σ): 9.7 ms ± 0.6 ms [User: 0.3 ms, System: 9.4 ms] Range (min … max): 9.0 ms … 12.4 ms 259 runs Benchmark 2: python-fs-read/test.py Time (mean ± σ): 15.8 ms ± 0.9 ms [User: 5.9 ms, System: 9.8 ms] Range (min … max): 15.0 ms … 21.8 ms 169 runs Summary rust-std-fs-read-with-jemalloc/target/release/test ran 1.64 ± 0.14 times faster than python-fs-read/test.py What?! I understand that jemalloc is a proficient memory allocator, but how can it be this exceptional? This is baffling. Rust is slower than Python only on my machine! As more friends joined the discussion, we discovered that rust runs slower than python only on my machine. My CPU: > lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 5950X 16-Core Processor CPU family: 25 Model: 33 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU(s) scaling MHz: 53% CPU max MHz: 5083.3979 CPU min MHz: 2200.0000 BogoMIPS: 6787.49 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm con stant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f 16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpex t perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wb noinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap Virtualization features: Virtualization: AMD-V Caches (sum of all): L1d: 512 KiB (16 instances) L1i: 512 KiB (16 instances) L2: 8 MiB (16 instances) L3: 64 MiB (2 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec rstack overflow: Vulnerable Spec store bypass: Vulnerable Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected Srbds: Not affected Tsx async abort: Not affected My memory: > sudo dmidecode --type memory # dmidecode 3.5 Getting SMBIOS data from sysfs. SMBIOS 3.3.0 present. Handle 0x0014, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: None Maximum Capacity: 64 GB Error Information Handle: 0x0013 Number Of Devices: 4 Handle 0x001C, DMI type 17, 92 bytes Memory Device Array Handle: 0x0014 Error Information Handle: 0x001B Total Width: 64 bits Data Width: 64 bits Size: 16 GB Form Factor: DIMM Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL A Type: DDR4 Type Detail: Synchronous Unbuffered (Unregistered) Speed: 3200 MT/s Manufacturer: Unknown Serial Number: 04904740 Asset Tag: Not Specified Part Number: LMKUFG68AHFHD-32A Rank: 2 Configured Memory Speed: 3200 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 9, Hex 0xC8 Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 16 GB Cache Size: None Logical Size: None So I tried the following things: Enable Mitigations CPUs possess numerous vulnerabilities that could expose private data to attackers, with Spectre being one of the most notable. The Linux kernel has developed various mitigations to counter these vulnerabilities and they are enabled by default. However, these mitigations can impose additional system costs. Therefore, the Linux kernel also offers a mitigations flag for users who wish to disable them. I used to disable all mitigations like the following: title Arch Linux linux /vmlinuz-linux-zen initrd /amd-ucode.img initrd /initramfs-linux-zen.img options root="PARTUUID=206e7750-2b89-419d-978e-db0068c79c52" rw mitigations=off Enable it back didn't change the result. Tune Transparent Hugepage Transparent Hugepage can significantly impact performance. Most modern distributions enable it by default. > cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never Switching to madvise or never alters the absolute outcome, but the relative ratio remains consistent. Tune CPU Core Affinity @Manjusaka guesses this related to CPU Core Spacing. I tried to use core_affinity to bind process to specific CPU, but the result is the same. Measure syscall latency by eBPF @Manjusaka also created an eBPF program for me to gauge the latency of read syscalls. The findings indicate that Rust is also slower than Python at syscall level. There's another lengthy tale about this eBPF program that @Manjusaka should share in a post! # python fs read Process 57555 read file 8134049 ns Process 57555 read file 942 ns # rust std fs read Process 57634 read file 24636975 ns Process 57634 read file 1052 ns Observation: On my computer, Rust operates slower than Python and it doesn't appear to be related to the software. C is slower than Python? I'm quite puzzled and can't pinpoint the difference. I suspect it might have something to do with the CPU, but I'm unsure which aspect: cache? frequency? core spacing? core affinity? architecture? Following the guidance from the Telegram group @rust_zh, I've developed a C version: c-fs-read: #include <stdio.h> #include <stdlib.h> #define FILE_SIZE 64 * 1024 * 1024 // 64 MiB int main() { FILE *file; char *buffer; size_t result; file = fopen("/tmp/file", "rb"); if (file == NULL) { fputs("Error opening file", stderr); return 1; } buffer = (char *)malloc(sizeof(char) * FILE_SIZE); if (buffer == NULL) { fputs("Memory error", stderr); fclose(file); return 2; } result = fread(buffer, 1, FILE_SIZE, file); if (result != FILE_SIZE) { fputs("Reading error", stderr); fclose(file); free(buffer); return 3; } fclose(file); free(buffer); return 0; } But... Benchmark 1: c-fs-read/test Time (mean ± σ): 23.8 ms ± 0.9 ms [User: 0.3 ms, System: 23.6 ms] Range (min … max): 23.0 ms … 27.1 ms 120 runs Benchmark 2: python-fs-read/test.py Time (mean ± σ): 19.1 ms ± 0.3 ms [User: 8.6 ms, System: 10.4 ms] Range (min … max): 18.6 ms … 20.6 ms 146 runs Summary python-fs-read/test.py ran 1.25 ± 0.05 times faster than c-fs-read/test The C version is also slower than Python! Does python have magic? C is slower than Python without specified offset! At this time, @lilydjwg has joined the discussion and noticed a difference in the memory region offset between C and Python. strace -e raw=read,mmap ./program is used to print the undecoded arguments for the syscalls: the pointer address. strace for c-fs-read: > strace -e raw=read,mmap ./c-fs-read/test ... mmap(0, 0x4001000, 0x3, 0x22, 0xffffffff, 0) = 0x7f96d1a18000 read(0x3, 0x7f96d1a18010, 0x4000000) = 0x4000000 close(3) = 0 strace for python-fs-read > strace -e raw=read,mmap ./python-fs-read/test.py ... mmap(0, 0x4001000, 0x3, 0x22, 0xffffffff, 0) = 0x7f27dcfbe000 read(0x3, 0x7f27dcfbe030, 0x4000001) = 0x4000000 read(0x3, 0x7f27e0fbe030, 0x1) = 0 close(3) = 0 In c-fs-read, mmap returns 0x7f96d1a18000, but read syscall use 0x7f96d1a18010 as the start address, the offset is 0x10. In python-fs-read, mmap returns 0x7f27dcfbe000, and read syscall use 0x7f27dcfbe030 as the start address, the offset is 0x30. So @lilydjwg tried to calling read with the same offset: :) ./bench c-fs-read c-fs-read-with-offset python-fs-read ['hyperfine', 'c-fs-read/test', 'c-fs-read-with-offset/test', 'python-fs-read/test.py'] Benchmark 1: c-fs-read/test Time (mean ± σ): 23.7 ms ± 0.8 ms [User: 0.2 ms, System: 23.6 ms] Range (min … max): 23.0 ms … 25.5 ms 119 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark 2: c-fs-read-with-offset/test Time (mean ± σ): 8.9 ms ± 0.4 ms [User: 0.2 ms, System: 8.8 ms] Range (min … max): 8.3 ms … 10.6 ms 283 runs Benchmark 3: python-fs-read/test.py Time (mean ± σ): 19.1 ms ± 0.3 ms [User: 8.6 ms, System: 10.4 ms] Range (min … max): 18.6 ms … 20.0 ms 147 runs Summary c-fs-read-with-offset/test ran 2.15 ± 0.11 times faster than python-fs-read/test.py 2.68 ± 0.16 times faster than c-fs-read/test !!! Applying an offset to buffer in c-fs-read enhances its speed, outperforming Python! Additionally, we've verified that this issue is replicable on both the AMD Ryzen 9 5900X and AMD Ryzen 7 5700X. The new information led me to other reports about a similar issue, Std::fs::read slow?. In this post, @ambiso discovered that syscall performance is linked to the offset of the memory region. He noted that this CPU slows down when writing from the first 0x10 bytes of each page: offset milliseconds ... 14 130 15 130 16 46 <----- 0x10! 17 48 ... AMD Ryzen 9 5900X is slow without specified offset! We've confirmed that this issue is related to the CPU. However, we're still unsure about its potential reasons. @Manjusaka has invited kernel developer @ryncsn to join the discussion. He can reproduce the same outcome using our c-fs-read and c-fs-read-with-offset demos on AMD Ryzen 9 5900HX. He also attempted to profile the two programs using perf. Without offset: perf stat -d -d -d --repeat 20 ./a.out Performance counter stats for './a.out' (20 runs): 30.89 msec task-clock # 0.968 CPUs utilized ( +- 1.35% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 598 page-faults # 19.362 K/sec ( +- 0.05% ) 90,321,344 cycles # 2.924 GHz ( +- 1.12% ) (40.76%) 599,640 stalled-cycles-frontend # 0.66% frontend cycles idle ( +- 2.19% ) (42.11%) 398,016 stalled-cycles-backend # 0.44% backend cycles idle ( +- 22.41% ) (41.88%) 43,349,705 instructions # 0.48 insn per cycle # 0.01 stalled cycles per insn ( +- 1.32% ) (41.91%) 7,526,819 branches # 243.701 M/sec ( +- 5.01% ) (41.22%) 37,541 branch-misses # 0.50% of all branches ( +- 4.62% ) (41.12%) 127,845,213 L1-dcache-loads # 4.139 G/sec ( +- 1.14% ) (39.84%) 3,172,628 L1-dcache-load-misses # 2.48% of all L1-dcache accesses ( +- 1.34% ) (38.46%) <not supported> LLC-loads <not supported> LLC-load-misses 654,651 L1-icache-loads # 21.196 M/sec ( +- 1.71% ) (38.72%) 2,828 L1-icache-load-misses # 0.43% of all L1-icache accesses ( +- 2.35% ) (38.67%) 15,615 dTLB-loads # 505.578 K/sec ( +- 1.28% ) (38.82%) 12,825 dTLB-load-misses # 82.13% of all dTLB cache accesses ( +- 1.15% ) (38.88%) 16 iTLB-loads # 518.043 /sec ( +- 27.06% ) (38.82%) 2,202 iTLB-load-misses # 13762.50% of all iTLB cache accesses ( +- 23.62% ) (39.38%) 1,843,493 L1-dcache-prefetches # 59.688 M/sec ( +- 3.36% ) (39.40%) <not supported> L1-dcache-prefetch-misses 0.031915 +- 0.000419 seconds time elapsed ( +- 1.31% ) With offset: perf stat -d -d -d --repeat 20 ./a.out Performance counter stats for './a.out' (20 runs): 15.39 msec task-clock # 0.937 CPUs utilized ( +- 3.24% ) 1 context-switches # 64.972 /sec ( +- 17.62% ) 0 cpu-migrations # 0.000 /sec 598 page-faults # 38.854 K/sec ( +- 0.06% ) 41,239,117 cycles # 2.679 GHz ( +- 1.95% ) (40.68%) 547,465 stalled-cycles-frontend # 1.33% frontend cycles idle ( +- 3.43% ) (40.60%) 413,657 stalled-cycles-backend # 1.00% backend cycles idle ( +- 20.37% ) (40.50%) 37,009,429 instructions # 0.90 insn per cycle # 0.01 stalled cycles per insn ( +- 3.13% ) (40.43%) 5,410,381 branches # 351.526 M/sec ( +- 3.24% ) (39.80%) 34,649 branch-misses # 0.64% of all branches ( +- 4.04% ) (39.94%) 13,965,813 L1-dcache-loads # 907.393 M/sec ( +- 3.37% ) (39.44%) 3,623,350 L1-dcache-load-misses # 25.94% of all L1-dcache accesses ( +- 3.56% ) (39.52%) <not supported> LLC-loads <not supported> LLC-load-misses 590,613 L1-icache-loads # 38.374 M/sec ( +- 3.39% ) (39.67%) 1,995 L1-icache-load-misses # 0.34% of all L1-icache accesses ( +- 4.18% ) (39.67%) 16,046 dTLB-loads # 1.043 M/sec ( +- 3.28% ) (39.78%) 14,040 dTLB-load-misses # 87.50% of all dTLB cache accesses ( +- 3.24% ) (39.78%) 11 iTLB-loads # 714.697 /sec ( +- 29.56% ) (39.77%) 3,657 iTLB-load-misses # 33245.45% of all iTLB cache accesses ( +- 14.61% ) (40.30%) 395,578 L1-dcache-prefetches # 25.702 M/sec ( +- 3.34% ) (40.10%) <not supported> L1-dcache-prefetch-misses 0.016429 +- 0.000521 seconds time elapsed ( +- 3.17% ) He found the value of L1-dcache-prefetches and L1-dcache-loads differs a lot. L1-dcache-prefetches is the prefetches of CPU L1 data cache. L1-dcache-loads is the loads of CPU L1 data cache. Without a specified offset, the CPU will perform more loads and prefetches of L1-dcache, resulting in increased syscall time. He did a further research over the hotspot ASM: Samples: 15K of event 'cycles:P', Event count (approx.): 6078132137 Children Self Command Shared Object Symbol - 94.11% 0.00% a.out [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe ◆ - entry_SYSCALL_64_after_hwframe ▒ - 94.10% do_syscall_64 ▒ - 86.66% __x64_sys_read ▒ ksys_read ▒ - vfs_read ▒ - 85.94% shmem_file_read_iter ▒ - 77.17% copy_page_to_iter ▒ - 75.80% _copy_to_iter ▒ + 19.41% asm_exc_page_fault ▒ 0.71% __might_fault ▒ + 4.87% shmem_get_folio_gfp ▒ 0.76% folio_mark_accessed ▒ + 4.38% __x64_sys_munmap ▒ + 1.02% 0xffffffffae6f6fe8 ▒ + 0.79% __x64_sys_execve ▒ + 0.58% __x64_sys_mmap ▒ Inside _copy_to_iter, the ASM will be: │ copy_user_generic(): 2.19 │ mov %rdx,%rcx │ mov %r12,%rsi 92.45 │ rep movsb %ds:(%rsi),%es:(%rdi) 0.49 │ nop │ nop │ nop The key difference here is the performance of rep movsb. AMD Ryzen 9 5900X is slow for FSRM! At this time, one of my friend sent me a link about Terrible memcpy performance on Zen 3 when using rep movsb. In which also pointed to rep movsb: I've found this using a memcpy benchmark at https://github.com/ska-sa/katgpucbf/blob/69752be58fb8ab0668ada806e0fd809e782cc58b/scratch/memcpy_loop.cpp (compiled with the adjacent Makefile). To demonstrate the issue, run ./memcpy_loop -b 2113 -p 1000000 -t mmap -S 0 -D 1 0 This runs: 2113-byte memory copies 1,000,000 times per timing measurement in memory allocated with mmap with the source 0 bytes from the start of the page with the destination 1 byte from the start of the page on core 0. It reports about 3.2 GB/s. Change the -b argument to 2111 and it reports over 100 GB/s. So the REP MOVSB case is about 30× slower! FSRM, short for Fast Short REP MOV, is an innovation originally by Intel, recently incorporated into AMD as well, to enhance the speed of rep movsb and rep movsd. It's designed to boost the efficiency of copying large amounts of memory. CPUs that declare support for it will use FSRM as a default in glibc. @ryncsn has conducted further research and discovered that it's not related to L1 prefetches. It seems that rep movsb performance poorly when DATA IS PAGE ALIGNED, and perform better when DATA IS NOT PAGE ALIGNED, this is very funny... Conclusion In conclusion, the issue isn't software-related. Python outperforms C/Rust due to an AMD CPU bug. (I can finally get some sleep now.) However, our users continue to struggle with this problem. Unfortunately, features like FSRM will be implemented in ucode, leaving us no choice but to wait for AMD's response. An alternative solution could be not using FSRM or providing a flag to disable it. Rust developers might consider switching to jemallocator for improved performance - a beneficial move even without the presence of AMD CPU bugs. Final Remarks I spent nearly three days addressing this issue, which began with complaints from opendal users and eventually led me to the CPU's ucode. This journey taught me a lot about strace, perf and eBPF. It was my first time using eBPF for diagnostics. I also explored various unfruitful avenues such as studying the implementations of rust's std::fs and Python & CPython's read implementation details. Initially, I hoped to resolve this at a higher level but found it necessary to delve deeper. A big thank you to everyone who contributed to finding the answer: @beldathas from opendal's discord for identifying the problem. The team at @datafuselabs for their insightful suggestions. Our friends over at @rust_zh (a rust telegram group mainly in zh-Hans) for their advice and reproduction efforts. @Manjusaka for reproducing the issue and use eBPF to investigate, which helped narrow down the problem to syscall itself. @lilydjwg for pinpointing the root cause: a 0x20 offset in memory @ryncsn for his thorough analysis And a friend who shared useful links about FSRM Looking forward to our next journey! Reference Xuanwo/when-i-find-rust-is-slow has all the code snippets and scripts. Std::fs::read slow? is a report from rust community Terrible memcpy performance on Zen 3 when using rep movsb is a report to ubuntu glibc binding/python: rust std fs is slower than python fs Updates This article, written on 2023-11-29, has gained widespread attention! To prevent any confusion among readers, I've decided not to alter the original content. Instead, I'll provide updates here to keep the information current and address frequently asked questions. 2023-12-01: Does AMD know about this bug? TL;DR: Yes To my knowledge, AMD has been aware of this bug since 2021. After the article was published, several readers forwarded the link to AMD, so I'm confident they're informed about it. I firmly believe that AMD should take responsibility for this bug and address it in amd-ucode. However, unverified sources suggest that a fix via amd-ucode is unlikely (at least for Zen 3) due to limited patch space. If you have more information on this matter, please reach out to me. Our only hope is to address this issue in glibc by disabling FSRM as necessary. Progress has been made on the glibc front: x86: Improve ERMS usage on Zen3. Stay tuned for updates. 2023-12-01: Is jemalloc or pymalloc faster than glibc's malloc? TL;DR: No Apologies for not detailing the connection between the memory allocator and this bug. It may seem from this post that jemalloc (pymalloc, mimalloc) is significantly faster than glibc's malloc. However, that's not the case. The issue doesn't related to the memory allocator. The speed difference where jemalloc or pymalloc outpaces glibc malloc is coincidental due to differing memory region offsets. Let's analyze the strace of rust-std-fs-read and rust-std-fs-read-with-jemalloc: strace for rust-std-fs-read: > strace -e raw=read,mmap ./rust-std-fs-read/target/release/test ... mmap(0, 0x4001000, 0x3, 0x22, 0xffffffff, 0) = 0x7f39a6e49000 openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3 read(0x3, 0x7f39a6e49010, 0x4000000) = 0x4000000 read(0x3, 0x7f39aae49010, 0) = 0 close(3) = 0 strace for rust-std-fs-read-with-jemalloc: > strace -e raw=read,mmap ./rust-std-fs-read-with-jemalloc/target/release/test ... mmap(0, 0x200000, 0x3, 0x4022, 0xffffffff, 0) = 0x7f7a5a400000 mmap(0, 0x5000000, 0x3, 0x4022, 0xffffffff, 0) = 0x7f7a55400000 openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3 read(0x3, 0x7f7a55400740, 0x4000000) = 0x4000000 read(0x3, 0x7f7a59400740, 0) = 0 close(3) = 0 In rust-std-fs-read, mmap returns 0x7f39a6e49000, but read syscall use 0x7f39a6e49010 as the start address, the offset is 0x10. In rust-std-fs-read-with-jemalloc, mmap returns 0x7f7a55400000, and read syscall use 0x7f7a55400740 as the start address, the offset is 0x740. rust-std-fs-read-with-jemalloc outperforms rust-std-fs-read due to its larger offset, which falls outside the problematic range: 0x00..0x10 within a page. It's possible to reproduce the same issue with jemalloc.

AI生成摘要 This article discusses a performance issue with the Python binding for OpenDAL, a data access layer. The author compares the performance of reading a file using OpenDAL's Python binding, Python itself, and a Rust implementation. They find that Python is faster than both OpenDAL and Rust. Further investigation reveals that the issue is related to a CPU bug in AMD Ryzen processors, specifically with the rep movsb instruction. The author concludes that this is not a software-related issue and suggests waiting for AMD's response or considering alternative solutions such as disabling the CPU feature causing the problem.

相关推荐 去reddit讨论

张小凯的博客 -

Rust中创建全局变量

在 Rust 中常用的一些定义全局变量的方法总结;

AI生成摘要 在Rust中,有几种方法可以定义全局变量。常见的方法包括编译期初始化和运行期初始化。编译期初始化的全局变量可以使用const、static和Atomic类型来创建。运行期初始化的全局变量可以使用lazy_static宏、Box::leak方法和OnceLock来创建。建议在较新的Rust版本中优先使用标准库中的OnceCell来创建全局变量。如果需要在static中使用具有线程安全内部可变性和const构造函数的类型,可以直接声明为静态变量。如果无法使用这些方法,可以使用OnceLock来进行初始化。如果需要创建大量的全局变量,可以使用once_cell::sync::Lazy来代替lazy_static。需要注意的是,已经使用once_cell或lazy_static的代码不需要进行修改,这些crate将继续可用。

相关推荐 去reddit讨论

热榜 Top10
...
LigaAI
...
白鲸技术栈
...
天勤数据
...
观测云
...
Dify.AI
...
eolink
...
ShowMeBug
推荐或自荐