Integrating OpenMP #212

parsifal-47 · 2025-01-05T18:52:34Z

I tried playing with different scheduling and restructuring grid launch loop, but have found no benefits of doing that. There is no one-size-fits-all for number of threads, but I found 4 to be tolerable overhead for short runs. Since these cores are implemented for GPUs the threads are assumed to be very light. Runs below are for default configuration of 4 threads and after the initial code generation (warmup) is done:

$ cat /proc/cpuinfo | grep "model name" | head -1
model name      : 13th Gen Intel(R) Core(TM) i9-13900K

$ python test_vec_add.py
bench_vecadd(4194304, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.029584, min=0.025436, std=0.004337, 50pp=0.027671, max=0.041039
CPU: Avg=0.212293, min=0.116719, std=0.053650, 50pp=0.201428, max=0.367745
bench_vecadd(4194304, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.039623, min=0.027530, std=0.031428, 50pp=0.034365, max=0.176014
CPU: Avg=0.102531, min=0.084439, std=0.050540, 50pp=0.091356, max=0.322346
bench_vecadd(8388608, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.052874, min=0.051575, std=0.004073, 50pp=0.051865, max=0.070552
CPU: Avg=0.263669, min=0.234627, std=0.010503, 50pp=0.265908, max=0.276139
bench_vecadd(8388608, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.072103, min=0.069362, std=0.001693, 50pp=0.072208, max=0.075836
CPU: Avg=0.171505, min=0.158613, std=0.015148, 50pp=0.169424, max=0.233900
bench_vecadd(16777216, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.113488, min=0.105001, std=0.006801, 50pp=0.113382, max=0.129658
CPU: Avg=0.434462, min=0.415685, std=0.011754, 50pp=0.432194, max=0.455476
bench_vecadd(16777216, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.140550, min=0.138820, std=0.001967, 50pp=0.139925, max=0.146632
CPU: Avg=0.298273, min=0.291456, std=0.007572, 50pp=0.296181, max=0.319118

$ python test_softmax.py
bench_softmax(1024, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.006476, min=0.004144, std=0.000617, 50pp=0.006703, max=0.006737
CPU: Avg=0.122640, min=0.025484, std=0.024319, 50pp=0.131805, max=0.136226
bench_softmax(1024, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.013195, min=0.005891, std=0.029987, 50pp=0.006210, max=0.143893
CPU: Avg=0.038585, min=0.023892, std=0.058095, 50pp=0.024242, max=0.291661
bench_softmax(2048, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.015310, min=0.013424, std=0.001702, 50pp=0.014917, max=0.018794
CPU: Avg=0.173618, min=0.118075, std=0.029875, 50pp=0.172012, max=0.245174
bench_softmax(2048, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.026125, min=0.023720, std=0.001801, 50pp=0.025985, max=0.029358
CPU: Avg=0.100961, min=0.093416, std=0.022481, 50pp=0.096955, max=0.198676
bench_softmax(4096, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.060022, min=0.058142, std=0.002109, 50pp=0.058852, max=0.066454
CPU: Avg=0.326994, min=0.307110, std=0.012973, 50pp=0.332134, max=0.345102
bench_softmax(4096, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.108494, min=0.106723, std=0.001424, 50pp=0.108579, max=0.112072
CPU: Avg=0.302443, min=0.295710, std=0.007158, 50pp=0.301341, max=0.331594
bench_softmax(8192, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.247161, min=0.236695, std=0.003374, 50pp=0.248375, max=0.250479
CPU: Avg=0.921703, min=0.911507, std=0.007468, 50pp=0.921097, max=0.947755
bench_softmax(8192, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.423307, min=0.420167, std=0.003226, 50pp=0.422663, max=0.435443
CPU: Avg=1.087445, min=1.081909, std=0.004594, 50pp=1.086404, max=1.104748

Below same runs without OpenMP

$ python test_vec_add.py
bench_vecadd(4194304, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.028019, min=0.024638, std=0.003926, 50pp=0.025822, max=0.037850
CPU: Avg=0.206308, min=0.087896, std=0.055139, 50pp=0.192890, max=0.363112
bench_vecadd(4194304, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.058581, min=0.050894, std=0.029926, 50pp=0.051244, max=0.188880
CPU: Avg=0.064444, min=0.050892, std=0.055465, 50pp=0.051246, max=0.306130
bench_vecadd(8388608, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.045687, min=0.044865, std=0.000342, 50pp=0.045697, max=0.046348
CPU: Avg=0.255593, min=0.165552, std=0.024769, 50pp=0.261888, max=0.285987
bench_vecadd(8388608, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.113111, min=0.111636, std=0.001145, 50pp=0.112891, max=0.115315
CPU: Avg=0.117115, min=0.111639, std=0.017151, 50pp=0.112947, max=0.191720
bench_vecadd(16777216, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.104407, min=0.104051, std=0.000239, 50pp=0.104377, max=0.104971
CPU: Avg=0.421223, min=0.393942, std=0.006962, 50pp=0.421754, max=0.427590
bench_vecadd(16777216, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.238135, min=0.232520, std=0.001440, 50pp=0.238539, max=0.239491
CPU: Avg=0.239498, min=0.236948, std=0.004708, 50pp=0.238564, max=0.259829

$ python test_softmax.py
bench_softmax(1024, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.006603, min=0.005808, std=0.000234, 50pp=0.006678, max=0.006708
CPU: Avg=0.123508, min=0.020989, std=0.024401, 50pp=0.131621, max=0.137913
bench_softmax(1024, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.021495, min=0.014405, std=0.030274, 50pp=0.014520, max=0.153455
CPU: Avg=0.027889, min=0.014406, std=0.058143, 50pp=0.014520, max=0.281326
bench_softmax(2048, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.018824, min=0.013350, std=0.005081, 50pp=0.016140, max=0.027493
CPU: Avg=0.178647, min=0.031018, std=0.039892, 50pp=0.177880, max=0.234586
bench_softmax(2048, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.056622, min=0.055601, std=0.000995, 50pp=0.056422, max=0.060713
CPU: Avg=0.062484, min=0.055604, std=0.026486, 50pp=0.056424, max=0.177926
bench_softmax(4096, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.058536, min=0.058117, std=0.000341, 50pp=0.058455, max=0.059743
CPU: Avg=0.322326, min=0.276561, std=0.015820, 50pp=0.322121, max=0.345675
bench_softmax(4096, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.247781, min=0.244237, std=0.001217, 50pp=0.247492, max=0.250554
CPU: Avg=0.249642, min=0.247209, std=0.007360, 50pp=0.247554, max=0.281481
bench_softmax(8192, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.246608, min=0.235026, std=0.009394, 50pp=0.249331, max=0.263891
CPU: Avg=0.920643, min=0.899138, std=0.013884, 50pp=0.921552, max=0.947152
bench_softmax(8192, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.989336, min=0.984804, std=0.005962, 50pp=0.987003, max=1.012817
CPU: Avg=0.989328, min=0.984795, std=0.005961, 50pp=0.987010, max=1.012817

There are other options for library, but OpenMP is likely the most straightforward, what do you think?

Integrating OpenMP

2bb0216

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating OpenMP #212

Integrating OpenMP #212

parsifal-47 commented Jan 5, 2025

Integrating OpenMP #212

Are you sure you want to change the base?

Integrating OpenMP #212

Conversation

parsifal-47 commented Jan 5, 2025