关于atomic原子操作,ordering指令顺序,spin自旋锁的一些实验

以前初学的时候浅浅了解过原子操作,当时就很不理解原子操作里的Ordering枚举参数是什么意思。 后来到处看了不少文章,但遗憾的是大多数文章都是拿着标准库的说明来译经念经而已,初学者看了还是云里雾里。 无非就是:Ordering::Relaxed是编译优先顺序最弱的;Ordering::Acquire和Ordering::Release一般成对使用,关键字像加锁的API,大致就是确定了编译先后顺序;然后就是遇事不决就用Ordering::SeqCst,可以保证原子操作顺序,但性能相对劣势。 然而疑问却很多,比如说:在绝大多数情况下,只要Ordering参数合法,基本不会出现不符合预期的情况,那所谓的使用Ordering::Relaxed到底在什么时候会出现不可预期的效果;Ordering::Acquire和Ordering::Release在一些文章里会出现在不同线程中,但明明是原子操作,不同线程其实并不会像加锁那样停下来等待另一个线程的原子操作,放在不同线程里有什么意义;Ordering::SeqCst性能偏弱能不能举个例子。 前两天突然看到日报里又关于原子操作和锁的书,又大略看了看,读了读Github的代码,对于发现想了解原子操作还是得从自旋锁入手。 《Rust Atomics and Locks》: https://marabos.nl/atomics/ 首先先实现一个最简单的布尔自旋锁吧。 思路也很简单,数据暂时先用static全局变量存储要修改的数字R,在unsafe块下面修改R,同时用一个AtomicBool布尔类型的原子变量标记锁,用while循环进行线程阻塞,如果有线程抢到了false,则立即修改锁的值为true代表上锁,其他线程则因为上锁陷入长循环,抢到锁的线程修改完数据后再把锁改成false,所有线程继续抢锁。 开10个线程来批量处理数据,每个线程进行100000次自加操作。 然后再搞个标准库的Arc<Mutex<usize>>来对比下时间和操作准确度,外部套个循环多重复几次看看多次运行的情况。 use std::thread; use std::sync::atomic::{AtomicBool, Ordering}; use std::sync::Arc; use std::sync::Mutex; use std::time::Instant; // 10个线程,每个线程100000次处理 const N_THREADS: usize = 10; const N_TIMES: usize = 100000; // R是待修改变量,SPIN_LOCK则是标记锁 static mut R: usize = 0; static SPIN_LOCK: AtomicBool = AtomicBool::new(false); fn main() { for t in 1..=10 { unsafe { R = 0; } let start_of_spin = Instant::now(); let handles = (0..N_THREADS).map(|i| { thread::spawn(move || { unsafe { for j in i * N_TIMES.. (i + 1) * N_TIMES { // 用while循环来阻塞线程,swap来保证判断和修改的原子性,此处用了最宽松的Relaxed while SPIN_LOCK.swap(true, Ordering::Relaxed) { } // 修改数据,本身并不是线程安全的 R += j; // 把锁改回false让所有线程继续抢锁 SPIN_LOCK.store(false, Ordering::Relaxed); } } }) }).collect::<Vec<_>>(); for handle in handles { handle.join().unwrap(); } let time_of_spin = start_of_spin.elapsed(); let r = Arc::new(Mutex::new(0)); let start_of_mutex = Instant::now(); // 标准的多线程修改数据方法 let handles = (0..N_THREADS).map(|i| { let r = r.clone(); thread::spawn(move || { for j in i * N_TIMES.. (i + 1) * N_TIMES { *r.lock().unwrap() += j; } }) }).collect::<Vec<_>>(); for handle in handles { handle.join().unwrap(); } let time_of_mutex = start_of_mutex.elapsed(); println!("{t:3}: R = {}, r = {}, spin: {time_of_spin:?}, mutex: {time_of_mutex:?}", unsafe { R }, r.lock().unwrap()); } } 运行得到结果: 1: R = 84999950000, r = 499999500000, spin: 116.8523ms, mutex: 20.406ms 2: R = 84999950000, r = 499999500000, spin: 115.8753ms, mutex: 23.8912ms 3: R = 84999950000, r = 499999500000, spin: 104.4092ms, mutex: 20.5238ms 4: R = 44999950000, r = 499999500000, spin: 115.2894ms, mutex: 22.1232ms 5: R = 14999950000, r = 499999500000, spin: 116.6161ms, mutex: 20.3407ms 6: R = 34999950000, r = 499999500000, spin: 110.5409ms, mutex: 20.7766ms 7: R = 14999950000, r = 499999500000, spin: 106.7984ms, mutex: 20.2581ms 8: R = 64999950000, r = 499999500000, spin: 116.2075ms, mutex: 22.8965ms 9: R = 94999950000, r = 499999500000, spin: 122.5616ms, mutex: 20.8712ms 10: R = 94999950000, r = 499999500000, spin: 132.7401ms, mutex: 25.2957ms 其中R是加法的结果,r则是标准多线程操作的结果,R的最终数值与r并不相同,很明显是R出错了。 现在,复现了第一个疑惑,就是Ordering::Relaxed确实有些问题,起码在自旋锁这里是有问题,虽然还是不太好说具体是什么问题。既然如此,根据一般的自旋锁教程,把Ordering::Relaxed换成Ordering::Acquire和Ordering::Release。 // 核心修改部分换成了Acquire和Release while SPIN_LOCK.swap(true, Ordering::Acquire) { } R += j; SPIN_LOCK.store(false, Ordering::Release); 再次运行得到结果: 1: R = 499999500000, r = 499999500000, spin: 247.6544ms, mutex: 20.9144ms 2: R = 499999500000, r = 499999500000, spin: 227.6381ms, mutex: 20.6151ms 3: R = 499999500000, r = 499999500000, spin: 227.2818ms, mutex: 25.3846ms 4: R = 499999500000, r = 499999500000, spin: 245.684ms, mutex: 20.8567ms 5: R = 499999500000, r = 499999500000, spin: 227.0974ms, mutex: 20.0019ms 6: R = 499999500000, r = 499999500000, spin: 228.9819ms, mutex: 20.8273ms 7: R = 499999500000, r = 499999500000, spin: 276.0922ms, mutex: 21.8605ms 8: R = 499999500000, r = 499999500000, spin: 233.6243ms, mutex: 25.1547ms 9: R = 499999500000, r = 499999500000, spin: 232.5839ms, mutex: 20.2137ms 10: R = 499999500000, r = 499999500000, spin: 218.4376ms, mutex: 23.0648ms 这次的结果果然正确了,Ordering::Acquire和Ordering::Release果然起了作用,这时候再试试Ordering::SeqCst。 // 核心修改部分换成了SeqCst while SPIN_LOCK.swap(true, Ordering::SeqCst) { } R += j; SPIN_LOCK.store(false, Ordering::SeqCst); 运行得到结果: 1: R = 499999500000, r = 499999500000, spin: 353.6353ms, mutex: 24.4979ms 2: R = 499999500000, r = 499999500000, spin: 334.631ms, mutex: 23.3446ms 3: R = 499999500000, r = 499999500000, spin: 348.2878ms, mutex: 23.5033ms 4: R = 499999500000, r = 499999500000, spin: 343.114ms, mutex: 21.7937ms 5: R = 499999500000, r = 499999500000, spin: 374.2087ms, mutex: 22.905ms 6: R = 499999500000, r = 499999500000, spin: 350.4543ms, mutex: 24.7428ms 7: R = 499999500000, r = 499999500000, spin: 358.6213ms, mutex: 23.8559ms 8: R = 499999500000, r = 499999500000, spin: 368.8898ms, mutex: 21.666ms 9: R = 499999500000, r = 499999500000, spin: 359.1947ms, mutex: 22.59ms 10: R = 499999500000, r = 499999500000, spin: 377.3886ms, mutex: 21.306ms 结果同样是正确,其实在这个例子里,只要不是两个Ordering标记同时为Relaxed都可以保证结果的正确性。 同时也可以验证出性能:Ordering::Relaxed>Ordering::Release,Ordering::Acquire>Ordering::SeqCst。 但是,明明是最简单的自旋锁,性能居然比系统默认的Mutex互斥锁差这么多。通过查看源码发现,mutex在阻塞时调用的系统的API,对于初学者来说实在太黑箱了。于是猜测性能损耗还可能发生在while循环里,对于系统线程的调度来说,如果出现了这种长循环,很可能会把CPU跑满,抢占线程资源,其他线程切换不及时。 基于这个假设,可以考虑在核心循环里加点料,一种是让线程切换得更即时一些,另一种则是让线程暂停下来,等待别的线程的唤醒再循环抢锁。 首先找到的是标准库里的std::hint::spin_loop,顾名思义的自旋循环,《Rust Atomics and Locks》的范例代码用的便是这个API,于是修改核心代码: while SPIN_LOCK.swap(true, Ordering::Relaxed) { // 在循环体内插入标准库自旋锁API std::hint::spin_loop(); } R += j; SPIN_LOCK.store(false, Ordering::Relaxed); 运行得到结果: 1: R = 499999500000, r = 499999500000, spin: 207.5686ms, mutex: 22.0559ms 2: R = 499999500000, r = 499999500000, spin: 209.1491ms, mutex: 21.1985ms 3: R = 499999500000, r = 499999500000, spin: 207.2068ms, mutex: 21.6455ms 4: R = 499999500000, r = 499999500000, spin: 208.4946ms, mutex: 20.6212ms 5: R = 499999500000, r = 499999500000, spin: 181.2423ms, mutex: 23.6995ms 6: R = 499999500000, r = 499999500000, spin: 190.0333ms, mutex: 23.6426ms 7: R = 499999500000, r = 499999500000, spin: 224.0493ms, mutex: 20.0067ms 8: R = 499999500000, r = 499999500000, spin: 248.7846ms, mutex: 21.4922ms 9: R = 499999500000, r = 499999500000, spin: 207.7718ms, mutex: 20.822ms 10: R = 499999500000, r = 499999500000, spin: 225.4686ms, mutex: 20.8056ms 这里Ordering又换回了性能最好的Relaxed,其实一般来说在多个原子操作之间加一些复杂操作,尤其是涉及线程调度的操作,理论上编译时就很难重排出不符合预期的结果,所以大多数情况Relaxed都是够用的,但是加锁后的性能并不太如人意,性能与Acquire和Release下空循环差不多,比SeqCst要稍好。 但是如果查看CPU运行状态的话,可以发现CPU占用率比空循环会低不少,std::hint::spin_loop确实一定程度上实现了线程的调度切换,查看其源码,发现确实也用了操作系统API进行线程调度。 这时候又发现另一个系列的API,std::thread::park,用法在标准库里可以找到。大致就是直接调用thread::park()就会让线程停下来,同时需要在别的线程获取当前线程的对象thread::Thread,通过调用std::thread::Thread::unpark来让该线程重启,为了防止线程永久沉睡,最好还是得用thread::park_timeout操作,超时后自动唤醒线程进行循环。 基于这个API的特点程序就复杂了起来,我们需要另外建立一个线程数组,来获取spawn出来的joinhandle,然后再通过joinhandle获取句柄对应线程,因为涉及到多线程调用,这个数组也得采取全局static变量unsafe修改的策略,然后还得在每次修改完变量后遍历唤醒其他的线程。 完整代码如下: use std::thread; use std::sync::atomic::{AtomicBool, Ordering}; use std::sync::Arc; use std::sync::Mutex; use std::time::{Instant, Duration}; // 10个线程,每个线程100000次处理 const N_THREADS: usize = 10; const N_TIMES: usize = 100000; // R是待修改变量,SPIN_LOCK则是标记锁 static mut R: usize = 0; static SPIN_LOCK: AtomicBool = AtomicBool::new(false); // 全局存储线程对象 static mut THREADS: Vec<thread::Thread> = Vec::new(); fn main() { for t in 1..=10 { unsafe { R = 0; } let start_of_spin = Instant::now(); let handles = (0..N_THREADS).map(|i| { thread::spawn(move || { unsafe { for j in i * N_TIMES.. (i + 1) * N_TIMES { while SPIN_LOCK.swap(true, Ordering::Relaxed) { // 线程暂停,如果1微秒内不被重启就继续循环 thread::park_timeout(Duration::from_micros(1)); } R += j; SPIN_LOCK.store(false, Ordering::Relaxed); // 修改完成后唤醒所有线程 for thread in THREADS.iter() { thread.unpark(); } } } }) }).collect::<Vec<_>>(); unsafe { // 通过句柄克隆出对应的线程对象并装入全局数组 THREADS = handles.iter().map(|handle| handle.thread().clone()).collect::<Vec<_>>(); // 初始化结束后唤醒各个线程 for thread in THREADS.iter() { thread.unpark(); } } for handle in handles { handle.join().unwrap(); } let time_of_spin = start_of_spin.elapsed(); let r = Arc::new(Mutex::new(0)); let start_of_mutex = Instant::now(); // 标准的多线程修改数据方法 let handles = (0..N_THREADS).map(|i| { let r = r.clone(); thread::spawn(move || { for j in i * N_TIMES.. (i + 1) * N_TIMES { *r.lock().unwrap() += j; } }) }).collect::<Vec<_>>(); for handle in handles { handle.join().unwrap(); } let time_of_mutex = start_of_mutex.elapsed(); println!("{t:3}: R = {}, r = {}, spin: {time_of_spin:?}, mutex: {time_of_mutex:?}", unsafe { R }, r.lock().unwrap()); } } 最后得到结果: 1: R = 499999500000, r = 499999500000, spin: 96.835ms, mutex: 23.4733ms 2: R = 499999500000, r = 499999500000, spin: 106.3914ms, mutex: 19.5431ms 3: R = 499999500000, r = 499999500000, spin: 97.4153ms, mutex: 24.657ms 4: R = 499999500000, r = 499999500000, spin: 99.7176ms, mutex: 20.3693ms 5: R = 499999500000, r = 499999500000, spin: 102.2228ms, mutex: 24.1728ms 6: R = 499999500000, r = 499999500000, spin: 101.3587ms, mutex: 23.8349ms 7: R = 499999500000, r = 499999500000, spin: 104.0861ms, mutex: 22.4304ms 8: R = 499999500000, r = 499999500000, spin: 99.5858ms, mutex: 21.9831ms 9: R = 499999500000, r = 499999500000, spin: 101.6998ms, mutex: 20.7136ms 10: R = 499999500000, r = 499999500000, spin: 104.7081ms, mutex: 19.793ms 可以看出,比标准库的自旋锁API又快不少,CPU消耗也降低不少,如果完全停止的话,CPU甚至可以和一般的阻塞一样没有消耗。 但是优化还没结束,既然归根结底在与线程切换的效率,于是可以发现标准库里有个专门切换线程的API叫thread::yield_now,直译便是出让线程,于是修改掉复杂thread::parkAPI,只需要把最初版本核心部分稍微修改一下就可以了。 while SPIN_LOCK.swap(true, Ordering::Relaxed) { // 循环体内加一句出让线程 thread::yield_now(); } R += j; SPIN_LOCK.store(false, Ordering::Relaxed); 最后得到结果: 1: R = 499999500000, r = 499999500000, spin: 19.0725ms, mutex: 21.4468ms 2: R = 499999500000, r = 499999500000, spin: 19.1969ms, mutex: 24.8842ms 3: R = 499999500000, r = 499999500000, spin: 18.9377ms, mutex: 24.6436ms 4: R = 499999500000, r = 499999500000, spin: 17.1335ms, mutex: 20.1235ms 5: R = 499999500000, r = 499999500000, spin: 17.7659ms, mutex: 21.0874ms 6: R = 499999500000, r = 499999500000, spin: 18.8314ms, mutex: 24.8805ms 7: R = 499999500000, r = 499999500000, spin: 19.9078ms, mutex: 22.75ms 8: R = 499999500000, r = 499999500000, spin: 18.755ms, mutex: 21.2979ms 9: R = 499999500000, r = 499999500000, spin: 19.7315ms, mutex: 23.4285ms 10: R = 499999500000, r = 499999500000, spin: 18.1363ms, mutex: 24.1529ms 终于的终于,我们战胜了标准库的互斥锁。看起来yield_now确实是有效的,时间波动也很小。不过缺点也很明显,高频循环和线程切换会瞬间冲爆所有CPU。 根据其他的一些测试,其实纯粹的睡眠thread::sleep也能起到线程切换的效果, while SPIN_LOCK.swap(true, Ordering::Relaxed) { // 循环体内通过睡眠1微秒出让线程 thread::sleep(Duration::from_micros(1)); } R += j; SPIN_LOCK.store(false, Ordering::Relaxed); 1: R = 499999500000, r = 499999500000, spin: 117.4703ms, mutex: 23.7323ms 2: R = 499999500000, r = 499999500000, spin: 133.6754ms, mutex: 21.4141ms 3: R = 499999500000, r = 499999500000, spin: 136.0562ms, mutex: 43.2387ms 4: R = 499999500000, r = 499999500000, spin: 128.3459ms, mutex: 24.3047ms 5: R = 499999500000, r = 499999500000, spin: 130.6175ms, mutex: 25.2669ms 6: R = 499999500000, r = 499999500000, spin: 113.7597ms, mutex: 32.0094ms 7: R = 499999500000, r = 499999500000, spin: 136.6121ms, mutex: 21.7237ms 8: R = 499999500000, r = 499999500000, spin: 136.7076ms, mutex: 62.4027ms 9: R = 499999500000, r = 499999500000, spin: 140.1101ms, mutex: 32.1292ms 10: R = 499999500000, r = 499999500000, spin: 123.1714ms, mutex: 20.7013ms 可以看出,这个数量级的操作性能仅比thread::park略弱。 然而加大到1000000次操作后的sleep: 1: R = 49999995000000, r = 49999995000000, spin: 152.4242ms, mutex: 206.6386ms 2: R = 49999995000000, r = 49999995000000, spin: 152.5491ms, mutex: 208.9147ms 3: R = 49999995000000, r = 49999995000000, spin: 197.8593ms, mutex: 241.7115ms 4: R = 49999995000000, r = 49999995000000, spin: 149.9495ms, mutex: 252.3702ms 5: R = 49999995000000, r = 49999995000000, spin: 193.2647ms, mutex: 238.9686ms 6: R = 49999995000000, r = 49999995000000, spin: 157.2534ms, mutex: 232.8807ms 7: R = 49999995000000, r = 49999995000000, spin: 152.7905ms, mutex: 239.2438ms 8: R = 49999995000000, r = 49999995000000, spin: 154.7708ms, mutex: 234.9019ms 9: R = 49999995000000, r = 49999995000000, spin: 145.1315ms, mutex: 225.6531ms 10: R = 49999995000000, r = 49999995000000, spin: 186.7429ms, mutex: 258.4023ms 加大到1000000次操作后的yield_now: 1: R = 49999995000000, r = 49999995000000, spin: 161.9026ms, mutex: 233.4809ms 2: R = 49999995000000, r = 49999995000000, spin: 154.3257ms, mutex: 241.573ms 3: R = 49999995000000, r = 49999995000000, spin: 190.7096ms, mutex: 218.3715ms 4: R = 49999995000000, r = 49999995000000, spin: 193.1071ms, mutex: 250.1671ms 5: R = 49999995000000, r = 49999995000000, spin: 188.5109ms, mutex: 224.2478ms 6: R = 49999995000000, r = 49999995000000, spin: 189.241ms, mutex: 252.0153ms 7: R = 49999995000000, r = 49999995000000, spin: 188.9498ms, mutex: 244.9358ms 8: R = 49999995000000, r = 49999995000000, spin: 190.654ms, mutex: 219.5048ms 9: R = 49999995000000, r = 49999995000000, spin: 194.7326ms, mutex: 209.2761ms 10: R = 49999995000000, r = 49999995000000, spin: 188.9025ms, mutex: 251.199ms 可以看出如果操作次数进一步加大数量级,sleep反而会体现出优势,不仅强过Mutex,甚至能超过thread::yield_now。thread::park_timeout如果不添加唤醒其他线程的步骤也能达到类似的效果,这两种sleep/timeout的CPU消耗会明显比thread::yield_now更低。 总结: 关于测试,我测试机子是2.6GHz12线程win10,在不同配置和操作系统下,得出的结果不一定相同。比如最初的Relaxed加法出错的情况,我在官方的playground就无法复现出来,playground里怎么加结果都是对的,而yield_now在playground里性能比Mutex要快好3-4倍。 这一系列的实验,可以基本了解原子操作和自旋锁的一些功能特性。thread::yield_now在中小规模的数据处理上更有效,thread::sleep则在较大的数据上也能有性能优势,thread::park的线程暂停功能,则能更轻易的实现标准库里各种锁和通道,尽管性能想超越标准库还是不太容易。至于hint::spin_lock,感觉完全没有任何优势。 最后说明一下,这篇文章也仅仅是测试而已,至于为什么会有这些性能差异,还得需要诸位大佬们来解答。

相关推荐 去reddit讨论