豌豆花下猫 | Python猫 -

Python 潮流周刊#29:Rust 会比 Python 慢?!

本期周刊分享了 12 篇文章,12 个开源项目,2 则播客,2 个热门讨论

AI生成摘要 本周刊分享Python、AI及通用技术内容,开源并欢迎投稿。推荐FlowUs平台提升个人生产力。文章包括Rust与Python性能比较、Python时间戳函数、Python与Go的比较、使用Numba提升pandas性能、Flask维护任务清单、子解释器运行Python并行程序、新旧开源库对比、使用Polars替代Pandas、Python软关键字。

相关推荐 去reddit讨论

Planet PostgreSQL -

semab tariq: PostgreSQL with Python – A Developer’s Guide

This guide will delve into the seamless integration of PostgreSQL and Python, covering optimal connection methods and demonstrating how to execute CRUD operations (Create, Read, Update, Delete) efficiently within PostgreSQL using Python. The post PostgreSQL with Python – A Developer’s Guide appeared first on Stormatics.

AI生成摘要 PostgreSQL和Python的结合为开发人员提供了良好的合作伙伴关系。当与Python结合使用时,PostgreSQL作为一个有影响力的开源数据库系统,成为处理数据密集型应用程序的坚实基础。本指南将深入探讨PostgreSQL和Python的无缝集成,涵盖最佳连接方法,并演示如何使用Python在PostgreSQL中高效执行CRUD操作(创建、读取、更新、删除)。

相关推荐 去reddit讨论

Armin Ronacher's Thoughts and Writings -

Untyped Python: The Python That Was

A lot has been said about Python typing. If you have been following me on Twitter (or you have the dubious pleasure of working with me), you probably know my skepticism towards Python typing. This stems from the syntax's complexity, the sluggishness of mypy, the overall cumbersome nature of its implementation and awkwardness of interactions with it. I won't dwell on these details today, instead I want to take you on a little journey back to my early experiences with Python. Why? Because I believe the conflict between the intrinsic philosophy of Python and the concept of typing is fundamental and profound, but also not new. The concept of typed programming languages predates 2015 by a long stretch. They were not invented now. Debates over the necessity of typing are not a recent phenomenon at all. When you wanted to start a new software project, particularly something that resembles a web service you always had a choice of programming language. Back in 2004 when I started diving into programming, there were plenty of languages to chose. The conventional choice was not Python, the obvious choice was not even PHP rather Java. Java was the go-to for serious web application projects, given its typing system and enterprise-grade features. PHP was for toys, Python was nowhere to be found. PHP was popular, but in my circles it was always seen as an entirely ridiculous concept and the idea that someone would build a business on it even more so. I remember in my first year of University the prevalent opinion was that the real world runs on .NET, Java and C++. PHP was ridiculed, Python and Ruby did not appear in conversations and JavaScript on the server was non existent. Yet here I was, I built stuff in PHP and Python. My choice wasn't driven by an aversion to static typing out of laziness but by the exceptional developer experience these languages offered, to a large part because of the lack of types. There was a stellar developer experience. Yes it did not have intellisense, but all the changes that I did appear on the web instantly. I recall directly modifying live websites via FTP in real time. Later editing web sites straight from vim on the production server. Was it terrible and terrifying? Absolutely. But damn it was productive. I learned a lot from that. They taught me valuable lessons about trade-offs. It was not just me that learned that, an entire generation of developers in those languages learned that our biggest weakness (it not being typed, and i wasn't compiled) was also our biggest strength. It required a bit of restraint and it required a slightly different way of programming, but it was incredibly productive. There was the world of XPath, there was the world of DTDs, there was the world of SOAP and WSDL. There was the world where the inherent complexity of the system was so great, that you absolutely required an IDE, code generation and compile time tooling. In contrast there was my world. My world ad me sitting with Vim, CVS and SVN and a basic Linux box and I was able to build things that I was incredibly proud of. I eventually swapped PHP for Python because it had better trade offs for me. But I will never not recognize what PHP gave me: I learned from it that not everything has to be pretty, it has to solve problems. And it did. But in the same way with PHP, the total surface area between me and the Python language runtime was tiny. The code I wrote, was transformed by the interpreter into bytecode instructions (which you could even look at!) and evaluated by a tiny loop in the interpreter. The interpreter was Open Source, it was easy to read, and most importantly I was able to poke around in it. Not only was I able to learn more about computers this way, it also made it incredibly easy for me to understand what exactly was going on. Without doubt I was able to understand everything between the code that I wrote, and the code that ran end to end. Yes, there was no static type checking and intellisense was basically non existing. Companies like Microsoft did not even think that Python was a language yet. But screw it, we were productive! Not only that, we build large software projects. We knew were the tradeoffs were. We had runtime errors flying left and right in production because bad types were passed, but we also had the tools to work with it! I distinctly remember how blown away a colleague from the .NET world was when I showed him some of the tools I had. That after I deployed bad code and it blew up in someone's face, I got an email that not only shows a perfectly readable stack trace, but also a line of source code for the frames. He was even more blown away when I showed him that I had a module that allowed me to attach remotely to the running interpreter and execute Python code on the fly to debug it. The developer experience was built around there being very few layers in the onion. But hear me out: all the arguments against dynamic languages and dynamic typing systems were already there! Nothing new has been invented, nothing really has changed. We all knew that there was value in typing, and we also all collectively said: screw it. We don't need this, we do duck typing. Let's play this to our advantage. Here is what has changed: we no longer trust developers as much and we are re-introducing the complexity that we were fighting. Modern Python can at times be impossible to comprehend for a developer. In a way in some areas we are creating the new Java. We became the people we originally displaced. Just that when we are not careful we are on a path to the world's worst Java. We put typing on a language that does not support it, our interpreter is slow, it has a GIL. We need to be careful not to forget that our roots are somewhere else. We should not collectively throw away the benefits we had. The winds changed, that's undeniable. Other languages have shown that types add value in new and exciting ways. When I had the arguments with folks about Python vs Java typing originally, Java did not even have generics. JavaScript was fighting against its reputation of being an insufferable toy. TypeScript was years away from being created. While nothing new has been invented, some things were popularized. Abstract data types are no longer a toy for researchers. .NET started mixing static and dynamic typing, TypeScript later popularized adding types to languages originally created without them. There are also many more developers in our community who are less likely to understand what made those languages appealing in the first place. So, where does this leave us? Is this a grumpy me complaining about times gone and how types are ruining everything? Hardly. There's undeniable utility in typing, and there is an element that could lead to greater overall productivity. Yet, the inherent trade-offs remain unchanged, and opting for or against typing should be a choice free from stigma. The core principles of this decision have not altered: types add value and they add cost. Post script: Python is in a spot now where the time spent for me typing it, does not pay dividends. TypeScript on the other hand tilts more towards productivity for me. Python could very well reach that point. I will revisit this.

AI生成摘要 这篇文章讨论了Python的类型问题。作者回顾了自己早期使用Python的经历,认为Python的动态特性给开发者带来了极高的开发体验和生产力。然而,现代Python引入了静态类型,增加了复杂性,可能使语言变得难以理解。作者认为我们不应该忘记Python的根源,应该谨慎对待类型的引入。尽管类型有其价值,但选择是否使用类型应该是自由的。作者最后提到,对于他来说,目前使用TypeScript更加高效,但Python也有可能达到这一点。

相关推荐 去reddit讨论

Dorad's Life -

Docker buildx交叉编译Python记录

捣鼓了一下docker buildx交叉编译

AI生成摘要 本文介绍了使用Docker打包Python应用的过程,以及使用docker buildx进行交叉编译的方法。通过构建多平台镜像,可以提高应用的交付效率。

相关推荐 去reddit讨论

FreeBuf网络安全行业门户 -

Black hat 大会13岁天才黑客:从Python到勒索软件


AI生成摘要 13岁的沙特阿拉伯男孩马可·利伯拉尔在中东和非洲黑帽大会上发表了关于如何应对勒索软件的演讲,受到了广泛赞扬。马可从小对物理安全和编程感兴趣,学习Python并编写恶意软件。他计划创办一家初创公司,并与克里斯·罗伯茨合作。马可通过KAUST认识了许多人,他的母亲表示他是一个与众不同的孩子。马可参加黑帽大会是因为KAUST网络安全负责人的推荐。

相关推荐 去reddit讨论

Xuanwo's Blog -

Rust std fs slower than Python!? No, it's hardware!

I'm about to share a lengthy tale that begins with opendal op.read() and concludes with an unexpected twist. This journey was quite enlightening for me, and I hope it will be for you too. I'll do my best to recreate the experience, complete with the lessons I've learned along the way. Let's dive in! All the code snippets and scripts are available in Xuanwo/when-i-find-rust-is-slow TL;DR Jump to Conclusion if you want to know the answer ASAP. OpenDAL Python Binding is slower than Python? OpenDAL is a data access layer that allows users to easily and efficiently retrieve data from various storage services in a unified way. We provided python binding for OpenDAL via pyo3. One day, @beldathas reports a case to me at discord that OpenDAL's python binding is slower than python: import pathlib import timeit import opendal root = pathlib.Path(__file__).parent op = opendal.Operator("fs", root=str(root)) filename = "lorem_ipsum_150mb.txt" def read_file_with_opendal() -> bytes: with op.open(filename, "rb") as fp: result = fp.read() return result def read_file_with_normal() -> bytes: with open(root / filename, "rb") as fp: result = fp.read() return result if __name__ == "__main__": print("normal: ", timeit.timeit(read_file_with_normal, number=100)) print("opendal: ", timeit.timeit(read_file_with_opendal, number=100)) The result shows that (venv) $ python benchmark.py normal: 4.470868484000675 opendal: 8.993250704006641 Well, well, well. I'm somewhat embarrassed by these results. Here are a few quick hypotheses: Does Python have an internal cache that can reuse the same memory? Does Python possess some trick to accelerate file reading? Does PyO3 introduce additional overhead? I've refactored the code to: python-fs-read: with open("/tmp/file", "rb") as fp: result = fp.read() assert len(result) == 64 * 1024 * 1024 python-opendal-read: import opendal op = opendal.Operator("fs", root=str("/tmp")) result = op.read("file") assert len(result) == 64 * 1024 * 1024 The result shows that python is much faster than opendal: Benchmark 1: python-fs-read/test.py Time (mean ± σ): 15.9 ms ± 0.7 ms [User: 5.6 ms, System: 10.1 ms] Range (min … max): 14.9 ms … 21.6 ms 180 runs Benchmark 2: python-opendal-read/test.py Time (mean ± σ): 32.9 ms ± 1.3 ms [User: 6.1 ms, System: 26.6 ms] Range (min … max): 31.4 ms … 42.6 ms 85 runs Summary python-fs-read/test.py ran 2.07 ± 0.12 times faster than python-opendal-read/test.py The Python binding for OpenDAL seems to be slower than Python itself, which isn't great news. Let's investigate the reasons behind this. OpenDAL Fs Service is slower than Python? This puzzle involves numerous elements such as rust, opendal, python, pyo3, among others. Let's focus and attempt to identify the root cause. I implement the same logic via opendal fs service in rust: rust-opendal-fs-read: use std::io::Read; use opendal::services::Fs; use opendal::Operator; fn main() { let mut cfg = Fs::default(); cfg.root("/tmp"); let op = Operator::new(cfg).unwrap().finish().blocking(); let mut bs = vec![0; 64 * 1024 * 1024]; let mut f = op.reader("file").unwrap(); let mut ts = 0; loop { let buf = &mut bs[ts..]; let n = f.read(buf).unwrap(); let n = n as usize; if n == 0 { break } ts += n; } assert_eq!(ts, 64 * 1024 * 1024); } However, the result shows that opendal is slower than python even when opendal is implemented in rust: Benchmark 1: rust-opendal-fs-read/target/release/test Time (mean ± σ): 23.8 ms ± 2.0 ms [User: 0.4 ms, System: 23.4 ms] Range (min … max): 21.8 ms … 34.6 ms 121 runs Benchmark 2: python-fs-read/test.py Time (mean ± σ): 15.6 ms ± 0.8 ms [User: 5.5 ms, System: 10.0 ms] Range (min … max): 14.4 ms … 20.8 ms 166 runs Summary python-fs-read/test.py ran 1.52 ± 0.15 times faster than rust-opendal-fs-read/target/release/test While rust-opendal-fs-read performs slightly better than python-opendal-read, indicating room for improvement in the binding & pyo3, these aren't the core issues. We need to delve deeper. Ah, opendal fs service is slower than python. Rust std fs is slower than Python? OpenDAL implement fs service via std::fs. Could there be additional costs incurred by OpenDAL itself? I implement the same logic via rust std::fs: rust-std-fs-read: use std::io::Read; use std::fs::OpenOptions; fn main() { let mut bs = vec![0; 64 * 1024 * 1024]; let mut f = OpenOptions::new().read(true).open("/tmp/file").unwrap(); let mut ts = 0; loop { let buf = &mut bs[ts..]; let n = f.read(buf).unwrap(); let n = n as usize; if n == 0 { break } ts += n; } assert_eq!(ts, 64 * 1024 * 1024); } But.... Benchmark 1: rust-std-fs-read/target/release/test Time (mean ± σ): 23.1 ms ± 2.5 ms [User: 0.3 ms, System: 22.8 ms] Range (min … max): 21.0 ms … 37.6 ms 124 runs Benchmark 2: python-fs-read/test.py Time (mean ± σ): 15.2 ms ± 1.1 ms [User: 5.4 ms, System: 9.7 ms] Range (min … max): 14.3 ms … 21.4 ms 178 runs Summary python-fs-read/test.py ran 1.52 ± 0.20 times faster than rust-std-fs-read/target/release/test Wow, Rust's std fs is slower than Python? How can that be? No offense intended, but how is that possible? Rust std fs is slower than Python? Really!? I can't believe the results: rust std fs is surprisingly slower than Python. I learned how to use strace for syscall analysis. strace is a Linux syscall tracer that allows us to monitor syscalls and understand their processes. The strace will encompass all syscalls dispatched by the program. Our attention should be on aspects associated with /tmp/file. Each line of the strace output initiates with the syscall name, followed by input arguments and output. For example: openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3 Means we invoke the openat syscall using arguments AT_FDCWD, "/tmp/file", and O_RDONLY|O_CLOEXEC. This returns output 3, which is the file descriptor referenced in the subsequent syscall. Alright, we've mastered strace. Let's put it to use! strace of rust-std-fs-read: > strace ./rust-std-fs-read/target/release/test ... mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f290dd40000 openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3 read(3, "\tP\201A\225\366>\260\270R\365\313\220{E\372\274\6\35\"\353\204\220s\2|7C\205\265\6\263"..., 67108864) = 67108864 read(3, "", 0) = 0 close(3) = 0 munmap(0x7f290dd40000, 67112960) = 0 ... strace of python-fs-read: > strace ./python-fs-read/test.py ... openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3 newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=67108864, ...}, AT_EMPTY_PATH) = 0 ioctl(3, TCGETS, 0x7ffe9f844ac0) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=67108864, ...}, AT_EMPTY_PATH) = 0 mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f13277ff000 read(3, "\tP\201A\225\366>\260\270R\365\313\220{E\372\274\6\35\"\353\204\220s\2|7C\205\265\6\263"..., 67108865) = 67108864 read(3, "", 1) = 0 close(3) = 0 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7f132be5c710}, {sa_handler=0x7f132c17ac36, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7f132be5c710}, 8) = 0 munmap(0x7f13277ff000, 67112960) = 0 ... From analyzing strace, it's clear that python-fs-read has more syscalls than rust-std-fs-read, with both utilizing mmap. So why python is faster than rust? Why we are using mmap here? I initially believed mmap was solely for mapping files to memory, enabling file access through memory. However, mmap has other uses too. It's commonly used to allocate large regions of memory for applications. This can be seen in the strace results: mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f13277ff000 This syscall means NULL: the first arg means start address of the memory region to map. NULL will let OS to pick up a suitable address for us. 67112960: The size of the memory region to map. We are allocating 64MiB + 4KiB memory here, the extra page is used to store the metadata of this memory region. PROT_READ|PROT_WRITE: The memory region is readable and writable. MAP_PRIVATE|MAP_ANONYMOUS: MAP_PRIVATE means changes to this memory region will not be visible to other processes mapping the same region, and are not carried through to the underlying file (if we have). MAP_ANONYMOUS means we are allocating anonymous memory that not related to a file. -1: The file descriptor of the file to map. -1 means we are not mapping a file. 0: The offset in the file to map from. Use 0 here since we are not mapping a file. But I don't use mmap in my code. The mmap syscall is dispatched by glibc. We utilize malloc to solicit memory from the system, and in response, glibc employs both the brk and mmap syscalls to allocate memory according to our request size. If the requested size is sufficiently large, then glibc opts for using mmap, which helps mitigate memory fragmentation issues. By default, all Rust programs compiled with target x86_64-unknown-linux-gnu use the malloc provided by glibc. Does python has the same memory allocator with rust? Python, by default, utilizes pymalloc, a memory allocator optimized for small allocations. Python features three memory domains, each representing different allocation strategies and optimized for various purposes. pymalloc has the following behavior: Python has a pymalloc allocator optimized for small objects (smaller or equal to 512 bytes) with a short lifetime. It uses memory mappings called “arenas” with a fixed size of either 256 KiB on 32-bit platforms or 1 MiB on 64-bit platforms. It falls back to PyMem_RawMalloc() and PyMem_RawRealloc() for allocations larger than 512 bytes. Rust is slower than Python with default memory allocator? I suspect that mmap is causing this issue. What would occur if I switched to jemalloc? rust-std-fs-read-with-jemalloc: use std::io::Read; use std::fs::OpenOptions; #[global_allocator] static GLOBAL: jemallocator::Jemalloc = jemallocator::Jemalloc; fn main() { let mut bs = vec![0; 64 * 1024 * 1024]; let mut f = OpenOptions::new().read(true).open("/tmp/file").unwrap(); let mut ts = 0; loop { let buf = &mut bs[ts..]; let n = f.read(buf).unwrap(); let n = n as usize; if n == 0 { break } ts += n; } assert_eq!(ts, 64 * 1024 * 1024); } Wooooooooooooooow?! Benchmark 1: rust-std-fs-read-with-jemalloc/target/release/test Time (mean ± σ): 9.7 ms ± 0.6 ms [User: 0.3 ms, System: 9.4 ms] Range (min … max): 9.0 ms … 12.4 ms 259 runs Benchmark 2: python-fs-read/test.py Time (mean ± σ): 15.8 ms ± 0.9 ms [User: 5.9 ms, System: 9.8 ms] Range (min … max): 15.0 ms … 21.8 ms 169 runs Summary rust-std-fs-read-with-jemalloc/target/release/test ran 1.64 ± 0.14 times faster than python-fs-read/test.py What?! I understand that jemalloc is a proficient memory allocator, but how can it be this exceptional? This is baffling. Rust is slower than Python only on my machine! As more friends joined the discussion, we discovered that rust runs slower than python only on my machine. My CPU: > lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 5950X 16-Core Processor CPU family: 25 Model: 33 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU(s) scaling MHz: 53% CPU max MHz: 5083.3979 CPU min MHz: 2200.0000 BogoMIPS: 6787.49 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm con stant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f 16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpex t perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wb noinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap Virtualization features: Virtualization: AMD-V Caches (sum of all): L1d: 512 KiB (16 instances) L1i: 512 KiB (16 instances) L2: 8 MiB (16 instances) L3: 64 MiB (2 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec rstack overflow: Vulnerable Spec store bypass: Vulnerable Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected Srbds: Not affected Tsx async abort: Not affected My memory: > sudo dmidecode --type memory # dmidecode 3.5 Getting SMBIOS data from sysfs. SMBIOS 3.3.0 present. Handle 0x0014, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: None Maximum Capacity: 64 GB Error Information Handle: 0x0013 Number Of Devices: 4 Handle 0x001C, DMI type 17, 92 bytes Memory Device Array Handle: 0x0014 Error Information Handle: 0x001B Total Width: 64 bits Data Width: 64 bits Size: 16 GB Form Factor: DIMM Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL A Type: DDR4 Type Detail: Synchronous Unbuffered (Unregistered) Speed: 3200 MT/s Manufacturer: Unknown Serial Number: 04904740 Asset Tag: Not Specified Part Number: LMKUFG68AHFHD-32A Rank: 2 Configured Memory Speed: 3200 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 9, Hex 0xC8 Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 16 GB Cache Size: None Logical Size: None So I tried the following things: Enable Mitigations CPUs possess numerous vulnerabilities that could expose private data to attackers, with Spectre being one of the most notable. The Linux kernel has developed various mitigations to counter these vulnerabilities and they are enabled by default. However, these mitigations can impose additional system costs. Therefore, the Linux kernel also offers a mitigations flag for users who wish to disable them. I used to disable all mitigations like the following: title Arch Linux linux /vmlinuz-linux-zen initrd /amd-ucode.img initrd /initramfs-linux-zen.img options root="PARTUUID=206e7750-2b89-419d-978e-db0068c79c52" rw mitigations=off Enable it back didn't change the result. Tune Transparent Hugepage Transparent Hugepage can significantly impact performance. Most modern distributions enable it by default. > cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never Switching to madvise or never alters the absolute outcome, but the relative ratio remains consistent. Tune CPU Core Affinity @Manjusaka guesses this related to CPU Core Spacing. I tried to use core_affinity to bind process to specific CPU, but the result is the same. Measure syscall latency by eBPF @Manjusaka also created an eBPF program for me to gauge the latency of read syscalls. The findings indicate that Rust is also slower than Python at syscall level. There's another lengthy tale about this eBPF program that @Manjusaka should share in a post! # python fs read Process 57555 read file 8134049 ns Process 57555 read file 942 ns # rust std fs read Process 57634 read file 24636975 ns Process 57634 read file 1052 ns Observation: On my computer, Rust operates slower than Python and it doesn't appear to be related to the software. C is slower than Python? I'm quite puzzled and can't pinpoint the difference. I suspect it might have something to do with the CPU, but I'm unsure which aspect: cache? frequency? core spacing? core affinity? architecture? Following the guidance from the Telegram group @rust_zh, I've developed a C version: c-fs-read: #include <stdio.h> #include <stdlib.h> #define FILE_SIZE 64 * 1024 * 1024 // 64 MiB int main() { FILE *file; char *buffer; size_t result; file = fopen("/tmp/file", "rb"); if (file == NULL) { fputs("Error opening file", stderr); return 1; } buffer = (char *)malloc(sizeof(char) * FILE_SIZE); if (buffer == NULL) { fputs("Memory error", stderr); fclose(file); return 2; } result = fread(buffer, 1, FILE_SIZE, file); if (result != FILE_SIZE) { fputs("Reading error", stderr); fclose(file); free(buffer); return 3; } fclose(file); free(buffer); return 0; } But... Benchmark 1: c-fs-read/test Time (mean ± σ): 23.8 ms ± 0.9 ms [User: 0.3 ms, System: 23.6 ms] Range (min … max): 23.0 ms … 27.1 ms 120 runs Benchmark 2: python-fs-read/test.py Time (mean ± σ): 19.1 ms ± 0.3 ms [User: 8.6 ms, System: 10.4 ms] Range (min … max): 18.6 ms … 20.6 ms 146 runs Summary python-fs-read/test.py ran 1.25 ± 0.05 times faster than c-fs-read/test The C version is also slower than Python! Does python have magic? C is slower than Python without specified offset! At this time, @lilydjwg has joined the discussion and noticed a difference in the memory region offset between C and Python. strace -e raw=read,mmap ./program is used to print the undecoded arguments for the syscalls: the pointer address. strace for c-fs-read: > strace -e raw=read,mmap ./c-fs-read/test ... mmap(0, 0x4001000, 0x3, 0x22, 0xffffffff, 0) = 0x7f96d1a18000 read(0x3, 0x7f96d1a18010, 0x4000000) = 0x4000000 close(3) = 0 strace for python-fs-read > strace -e raw=read,mmap ./python-fs-read/test.py ... mmap(0, 0x4001000, 0x3, 0x22, 0xffffffff, 0) = 0x7f27dcfbe000 read(0x3, 0x7f27dcfbe030, 0x4000001) = 0x4000000 read(0x3, 0x7f27e0fbe030, 0x1) = 0 close(3) = 0 In c-fs-read, mmap returns 0x7f96d1a18000, but read syscall use 0x7f96d1a18010 as the start address, the offset is 0x10. In python-fs-read, mmap returns 0x7f27dcfbe000, and read syscall use 0x7f27dcfbe030 as the start address, the offset is 0x30. So @lilydjwg tried to calling read with the same offset: :) ./bench c-fs-read c-fs-read-with-offset python-fs-read ['hyperfine', 'c-fs-read/test', 'c-fs-read-with-offset/test', 'python-fs-read/test.py'] Benchmark 1: c-fs-read/test Time (mean ± σ): 23.7 ms ± 0.8 ms [User: 0.2 ms, System: 23.6 ms] Range (min … max): 23.0 ms … 25.5 ms 119 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark 2: c-fs-read-with-offset/test Time (mean ± σ): 8.9 ms ± 0.4 ms [User: 0.2 ms, System: 8.8 ms] Range (min … max): 8.3 ms … 10.6 ms 283 runs Benchmark 3: python-fs-read/test.py Time (mean ± σ): 19.1 ms ± 0.3 ms [User: 8.6 ms, System: 10.4 ms] Range (min … max): 18.6 ms … 20.0 ms 147 runs Summary c-fs-read-with-offset/test ran 2.15 ± 0.11 times faster than python-fs-read/test.py 2.68 ± 0.16 times faster than c-fs-read/test !!! Applying an offset to buffer in c-fs-read enhances its speed, outperforming Python! Additionally, we've verified that this issue is replicable on both the AMD Ryzen 9 5900X and AMD Ryzen 7 5700X. The new information led me to other reports about a similar issue, Std::fs::read slow?. In this post, @ambiso discovered that syscall performance is linked to the offset of the memory region. He noted that this CPU slows down when writing from the first 0x10 bytes of each page: offset milliseconds ... 14 130 15 130 16 46 <----- 0x10! 17 48 ... AMD Ryzen 9 5900X is slow without specified offset! We've confirmed that this issue is related to the CPU. However, we're still unsure about its potential reasons. @Manjusaka has invited kernel developer @ryncsn to join the discussion. He can reproduce the same outcome using our c-fs-read and c-fs-read-with-offset demos on AMD Ryzen 9 5900HX. He also attempted to profile the two programs using perf. Without offset: perf stat -d -d -d --repeat 20 ./a.out Performance counter stats for './a.out' (20 runs): 30.89 msec task-clock # 0.968 CPUs utilized ( +- 1.35% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 598 page-faults # 19.362 K/sec ( +- 0.05% ) 90,321,344 cycles # 2.924 GHz ( +- 1.12% ) (40.76%) 599,640 stalled-cycles-frontend # 0.66% frontend cycles idle ( +- 2.19% ) (42.11%) 398,016 stalled-cycles-backend # 0.44% backend cycles idle ( +- 22.41% ) (41.88%) 43,349,705 instructions # 0.48 insn per cycle # 0.01 stalled cycles per insn ( +- 1.32% ) (41.91%) 7,526,819 branches # 243.701 M/sec ( +- 5.01% ) (41.22%) 37,541 branch-misses # 0.50% of all branches ( +- 4.62% ) (41.12%) 127,845,213 L1-dcache-loads # 4.139 G/sec ( +- 1.14% ) (39.84%) 3,172,628 L1-dcache-load-misses # 2.48% of all L1-dcache accesses ( +- 1.34% ) (38.46%) <not supported> LLC-loads <not supported> LLC-load-misses 654,651 L1-icache-loads # 21.196 M/sec ( +- 1.71% ) (38.72%) 2,828 L1-icache-load-misses # 0.43% of all L1-icache accesses ( +- 2.35% ) (38.67%) 15,615 dTLB-loads # 505.578 K/sec ( +- 1.28% ) (38.82%) 12,825 dTLB-load-misses # 82.13% of all dTLB cache accesses ( +- 1.15% ) (38.88%) 16 iTLB-loads # 518.043 /sec ( +- 27.06% ) (38.82%) 2,202 iTLB-load-misses # 13762.50% of all iTLB cache accesses ( +- 23.62% ) (39.38%) 1,843,493 L1-dcache-prefetches # 59.688 M/sec ( +- 3.36% ) (39.40%) <not supported> L1-dcache-prefetch-misses 0.031915 +- 0.000419 seconds time elapsed ( +- 1.31% ) With offset: perf stat -d -d -d --repeat 20 ./a.out Performance counter stats for './a.out' (20 runs): 15.39 msec task-clock # 0.937 CPUs utilized ( +- 3.24% ) 1 context-switches # 64.972 /sec ( +- 17.62% ) 0 cpu-migrations # 0.000 /sec 598 page-faults # 38.854 K/sec ( +- 0.06% ) 41,239,117 cycles # 2.679 GHz ( +- 1.95% ) (40.68%) 547,465 stalled-cycles-frontend # 1.33% frontend cycles idle ( +- 3.43% ) (40.60%) 413,657 stalled-cycles-backend # 1.00% backend cycles idle ( +- 20.37% ) (40.50%) 37,009,429 instructions # 0.90 insn per cycle # 0.01 stalled cycles per insn ( +- 3.13% ) (40.43%) 5,410,381 branches # 351.526 M/sec ( +- 3.24% ) (39.80%) 34,649 branch-misses # 0.64% of all branches ( +- 4.04% ) (39.94%) 13,965,813 L1-dcache-loads # 907.393 M/sec ( +- 3.37% ) (39.44%) 3,623,350 L1-dcache-load-misses # 25.94% of all L1-dcache accesses ( +- 3.56% ) (39.52%) <not supported> LLC-loads <not supported> LLC-load-misses 590,613 L1-icache-loads # 38.374 M/sec ( +- 3.39% ) (39.67%) 1,995 L1-icache-load-misses # 0.34% of all L1-icache accesses ( +- 4.18% ) (39.67%) 16,046 dTLB-loads # 1.043 M/sec ( +- 3.28% ) (39.78%) 14,040 dTLB-load-misses # 87.50% of all dTLB cache accesses ( +- 3.24% ) (39.78%) 11 iTLB-loads # 714.697 /sec ( +- 29.56% ) (39.77%) 3,657 iTLB-load-misses # 33245.45% of all iTLB cache accesses ( +- 14.61% ) (40.30%) 395,578 L1-dcache-prefetches # 25.702 M/sec ( +- 3.34% ) (40.10%) <not supported> L1-dcache-prefetch-misses 0.016429 +- 0.000521 seconds time elapsed ( +- 3.17% ) He found the value of L1-dcache-prefetches and L1-dcache-loads differs a lot. L1-dcache-prefetches is the prefetches of CPU L1 data cache. L1-dcache-loads is the loads of CPU L1 data cache. Without a specified offset, the CPU will perform more loads and prefetches of L1-dcache, resulting in increased syscall time. He did a further research over the hotspot ASM: Samples: 15K of event 'cycles:P', Event count (approx.): 6078132137 Children Self Command Shared Object Symbol - 94.11% 0.00% a.out [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe ◆ - entry_SYSCALL_64_after_hwframe ▒ - 94.10% do_syscall_64 ▒ - 86.66% __x64_sys_read ▒ ksys_read ▒ - vfs_read ▒ - 85.94% shmem_file_read_iter ▒ - 77.17% copy_page_to_iter ▒ - 75.80% _copy_to_iter ▒ + 19.41% asm_exc_page_fault ▒ 0.71% __might_fault ▒ + 4.87% shmem_get_folio_gfp ▒ 0.76% folio_mark_accessed ▒ + 4.38% __x64_sys_munmap ▒ + 1.02% 0xffffffffae6f6fe8 ▒ + 0.79% __x64_sys_execve ▒ + 0.58% __x64_sys_mmap ▒ Inside _copy_to_iter, the ASM will be: │ copy_user_generic(): 2.19 │ mov %rdx,%rcx │ mov %r12,%rsi 92.45 │ rep movsb %ds:(%rsi),%es:(%rdi) 0.49 │ nop │ nop │ nop The key difference here is the performance of rep movsb. AMD Ryzen 9 5900X is slow for FSRM! At this time, one of my friend sent me a link about Terrible memcpy performance on Zen 3 when using rep movsb. In which also pointed to rep movsb: I've found this using a memcpy benchmark at https://github.com/ska-sa/katgpucbf/blob/69752be58fb8ab0668ada806e0fd809e782cc58b/scratch/memcpy_loop.cpp (compiled with the adjacent Makefile). To demonstrate the issue, run ./memcpy_loop -b 2113 -p 1000000 -t mmap -S 0 -D 1 0 This runs: 2113-byte memory copies 1,000,000 times per timing measurement in memory allocated with mmap with the source 0 bytes from the start of the page with the destination 1 byte from the start of the page on core 0. It reports about 3.2 GB/s. Change the -b argument to 2111 and it reports over 100 GB/s. So the REP MOVSB case is about 30× slower! FSRM, short for Fast Short REP MOV, is an innovation originally by Intel, recently incorporated into AMD as well, to enhance the speed of rep movsb and rep movsd. It's designed to boost the efficiency of copying large amounts of memory. CPUs that declare support for it will use FSRM as a default in glibc. @ryncsn has conducted further research and discovered that it's not related to L1 prefetches. It seems that rep movsb performance poorly when DATA IS PAGE ALIGNED, and perform better when DATA IS NOT PAGE ALIGNED, this is very funny... Conclusion In conclusion, the issue isn't software-related. Python outperforms C/Rust due to an AMD CPU bug. (I can finally get some sleep now.) However, our users continue to struggle with this problem. Unfortunately, features like FSRM will be implemented in ucode, leaving us no choice but to wait for AMD's response. An alternative solution could be not using FSRM or providing a flag to disable it. Rust developers might consider switching to jemallocator for improved performance - a beneficial move even without the presence of AMD CPU bugs. Final Remarks I spent nearly three days addressing this issue, which began with complaints from opendal users and eventually led me to the CPU's ucode. This journey taught me a lot about strace, perf and eBPF. It was my first time using eBPF for diagnostics. I also explored various unfruitful avenues such as studying the implementations of rust's std::fs and Python & CPython's read implementation details. Initially, I hoped to resolve this at a higher level but found it necessary to delve deeper. A big thank you to everyone who contributed to finding the answer: @beldathas from opendal's discord for identifying the problem. The team at @datafuselabs for their insightful suggestions. Our friends over at @rust_zh (a rust telegram group mainly in zh-Hans) for their advice and reproduction efforts. @Manjusaka for reproducing the issue and use eBPF to investigate, which helped narrow down the problem to syscall itself. @lilydjwg for pinpointing the root cause: a 0x20 offset in memory @ryncsn for his thorough analysis And a friend who shared useful links about FSRM Looking forward to our next journey! Reference Xuanwo/when-i-find-rust-is-slow has all the code snippets and scripts. Std::fs::read slow? is a report from rust community Terrible memcpy performance on Zen 3 when using rep movsb is a report to ubuntu glibc binding/python: rust std fs is slower than python fs Updates This article, written on 2023-11-29, has gained widespread attention! To prevent any confusion among readers, I've decided not to alter the original content. Instead, I'll provide updates here to keep the information current and address frequently asked questions. 2023-12-01: Does AMD know about this bug? TL;DR: Yes To my knowledge, AMD has been aware of this bug since 2021. After the article was published, several readers forwarded the link to AMD, so I'm confident they're informed about it. I firmly believe that AMD should take responsibility for this bug and address it in amd-ucode. However, unverified sources suggest that a fix via amd-ucode is unlikely (at least for Zen 3) due to limited patch space. If you have more information on this matter, please reach out to me. Our only hope is to address this issue in glibc by disabling FSRM as necessary. Progress has been made on the glibc front: x86: Improve ERMS usage on Zen3. Stay tuned for updates. 2023-12-01: Is jemalloc or pymalloc faster than glibc's malloc? TL;DR: No Apologies for not detailing the connection between the memory allocator and this bug. It may seem from this post that jemalloc (pymalloc, mimalloc) is significantly faster than glibc's malloc. However, that's not the case. The issue doesn't related to the memory allocator. The speed difference where jemalloc or pymalloc outpaces glibc malloc is coincidental due to differing memory region offsets. Let's analyze the strace of rust-std-fs-read and rust-std-fs-read-with-jemalloc: strace for rust-std-fs-read: > strace -e raw=read,mmap ./rust-std-fs-read/target/release/test ... mmap(0, 0x4001000, 0x3, 0x22, 0xffffffff, 0) = 0x7f39a6e49000 openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3 read(0x3, 0x7f39a6e49010, 0x4000000) = 0x4000000 read(0x3, 0x7f39aae49010, 0) = 0 close(3) = 0 strace for rust-std-fs-read-with-jemalloc: > strace -e raw=read,mmap ./rust-std-fs-read-with-jemalloc/target/release/test ... mmap(0, 0x200000, 0x3, 0x4022, 0xffffffff, 0) = 0x7f7a5a400000 mmap(0, 0x5000000, 0x3, 0x4022, 0xffffffff, 0) = 0x7f7a55400000 openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3 read(0x3, 0x7f7a55400740, 0x4000000) = 0x4000000 read(0x3, 0x7f7a59400740, 0) = 0 close(3) = 0 In rust-std-fs-read, mmap returns 0x7f39a6e49000, but read syscall use 0x7f39a6e49010 as the start address, the offset is 0x10. In rust-std-fs-read-with-jemalloc, mmap returns 0x7f7a55400000, and read syscall use 0x7f7a55400740 as the start address, the offset is 0x740. rust-std-fs-read-with-jemalloc outperforms rust-std-fs-read due to its larger offset, which falls outside the problematic range: 0x00..0x10 within a page. It's possible to reproduce the same issue with jemalloc.

AI生成摘要 This article discusses a performance issue with the Python binding for OpenDAL, a data access layer. The author compares the performance of reading a file using OpenDAL's Python binding, Python itself, and a Rust implementation. They find that Python is faster than both OpenDAL and Rust. Further investigation reveals that the issue is related to a CPU bug in AMD Ryzen processors, specifically with the rep movsb instruction. The author concludes that this is not a software-related issue and suggests waiting for AMD's response or considering alternative solutions such as disabling the CPU feature causing the problem.

相关推荐 去reddit讨论

刘悦 -



AI生成摘要 Bert-vits2更新了版本V210,修正了日/英的bert对齐问题,效果进一步优化;对底模使用的数据进行优化和加量,减少finetune失败以及电音的可能性;日语bert更换了模型,完善了多语言推理。

相关推荐 去reddit讨论

BriefGPT - AI 论文速递 -

MedGen: 用于医疗文本处理的 Python 自然语言处理工具åŒ

MedGen 是一个专为医学文本处理设计的全面自然语言处理(NLP)工具包,面向生物医学研究人员和医疗专业人员,提供易于使用的、一体化的解决方案,不需要编程专业知识。它包括生成功能、基本 NLP 功能和查询搜索功能,同时提供了领域特定的语言模型和与第三方库的集成。

AI生成摘要 本文介绍了一种嵌入医疗领域知识和启用Few-shot Learning的方法,以满足医疗保健领域对广泛专业知识和语义知识的需求。通过简单有效的检索方法,提取医学背景知识来引导ChatGPT的推理和寻答逻辑。知识增强模型在CNMLE-2022上取得高分70,超过人类平均分61,证明了知识增强ChatGPT作为多功能医疗助手的潜力。

相关推荐 去reddit讨论

解道jdon.com -


这是一个非常全面的 Python 备忘单,适合任何感兴趣的人。它绝对是可用的最详细的之一。点击标题 1. 集合Collections:    list " class='body_href' >List ,

AI生成摘要 这是一个全面的Python备忘单,包括集合、类型、语法、系统、数据、高级、库包、多媒体等内容。还包括列表、字典、字符串、正则表达式、日期时间、类等的常用操作和方法。

相关推荐 去reddit讨论

解道jdon.com -


Python 和 Go 有着截然不同的特质,可以相互补充: Python is Easy. Go is Simple. Simple != Easy 有一种常见的误解是,简单和容易指的是同一件事。毕竟,如果一个东西很容易使用,那么它的内部结构也一定很容易理解,对吗?或者反之亦然?事实上,情况恰恰相反。虽然这两个概念在精神上指向相同的结果,但要使某样东西外表看起来简单,其内部却需要巨大的复杂性。 以 Python 为例,这种语言以入门门槛低而著称,因此也是入门编程

AI生成摘要 Python和Go是两种截然不同的编程语言,各有优势。Python易学易用,适合初学者,但在性能和复杂性方面存在一些问题。Go简洁易懂,语法更新少,代码稳定性高。结合Python的易用性和Go的简单性,可以在开发中发挥优势。Python适合原型开发和数据科学,而Go适合处理后台任务和业务逻辑。可以通过将两者结合使用,实现在同一项目中的协同工作。

相关推荐 去reddit讨论