Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating OpenMP #212

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Integrating OpenMP #212

wants to merge 1 commit into from

Conversation

parsifal-47
Copy link
Contributor

I tried playing with different scheduling and restructuring grid launch loop, but have found no benefits of doing that. There is no one-size-fits-all for number of threads, but I found 4 to be tolerable overhead for short runs. Since these cores are implemented for GPUs the threads are assumed to be very light. Runs below are for default configuration of 4 threads and after the initial code generation (warmup) is done:

$ cat /proc/cpuinfo | grep "model name" | head -1
model name      : 13th Gen Intel(R) Core(TM) i9-13900K

$ python test_vec_add.py
bench_vecadd(4194304, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.029584, min=0.025436, std=0.004337, 50pp=0.027671, max=0.041039
CPU: Avg=0.212293, min=0.116719, std=0.053650, 50pp=0.201428, max=0.367745
bench_vecadd(4194304, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.039623, min=0.027530, std=0.031428, 50pp=0.034365, max=0.176014
CPU: Avg=0.102531, min=0.084439, std=0.050540, 50pp=0.091356, max=0.322346
bench_vecadd(8388608, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.052874, min=0.051575, std=0.004073, 50pp=0.051865, max=0.070552
CPU: Avg=0.263669, min=0.234627, std=0.010503, 50pp=0.265908, max=0.276139
bench_vecadd(8388608, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.072103, min=0.069362, std=0.001693, 50pp=0.072208, max=0.075836
CPU: Avg=0.171505, min=0.158613, std=0.015148, 50pp=0.169424, max=0.233900
bench_vecadd(16777216, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.113488, min=0.105001, std=0.006801, 50pp=0.113382, max=0.129658
CPU: Avg=0.434462, min=0.415685, std=0.011754, 50pp=0.432194, max=0.455476
bench_vecadd(16777216, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.140550, min=0.138820, std=0.001967, 50pp=0.139925, max=0.146632
CPU: Avg=0.298273, min=0.291456, std=0.007572, 50pp=0.296181, max=0.319118

$ python test_softmax.py
bench_softmax(1024, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.006476, min=0.004144, std=0.000617, 50pp=0.006703, max=0.006737
CPU: Avg=0.122640, min=0.025484, std=0.024319, 50pp=0.131805, max=0.136226
bench_softmax(1024, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.013195, min=0.005891, std=0.029987, 50pp=0.006210, max=0.143893
CPU: Avg=0.038585, min=0.023892, std=0.058095, 50pp=0.024242, max=0.291661
bench_softmax(2048, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.015310, min=0.013424, std=0.001702, 50pp=0.014917, max=0.018794
CPU: Avg=0.173618, min=0.118075, std=0.029875, 50pp=0.172012, max=0.245174
bench_softmax(2048, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.026125, min=0.023720, std=0.001801, 50pp=0.025985, max=0.029358
CPU: Avg=0.100961, min=0.093416, std=0.022481, 50pp=0.096955, max=0.198676
bench_softmax(4096, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.060022, min=0.058142, std=0.002109, 50pp=0.058852, max=0.066454
CPU: Avg=0.326994, min=0.307110, std=0.012973, 50pp=0.332134, max=0.345102
bench_softmax(4096, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.108494, min=0.106723, std=0.001424, 50pp=0.108579, max=0.112072
CPU: Avg=0.302443, min=0.295710, std=0.007158, 50pp=0.301341, max=0.331594
bench_softmax(8192, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.247161, min=0.236695, std=0.003374, 50pp=0.248375, max=0.250479
CPU: Avg=0.921703, min=0.911507, std=0.007468, 50pp=0.921097, max=0.947755
bench_softmax(8192, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.423307, min=0.420167, std=0.003226, 50pp=0.422663, max=0.435443
CPU: Avg=1.087445, min=1.081909, std=0.004594, 50pp=1.086404, max=1.104748

Below same runs without OpenMP

$ python test_vec_add.py
bench_vecadd(4194304, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.028019, min=0.024638, std=0.003926, 50pp=0.025822, max=0.037850
CPU: Avg=0.206308, min=0.087896, std=0.055139, 50pp=0.192890, max=0.363112
bench_vecadd(4194304, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.058581, min=0.050894, std=0.029926, 50pp=0.051244, max=0.188880
CPU: Avg=0.064444, min=0.050892, std=0.055465, 50pp=0.051246, max=0.306130
bench_vecadd(8388608, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.045687, min=0.044865, std=0.000342, 50pp=0.045697, max=0.046348
CPU: Avg=0.255593, min=0.165552, std=0.024769, 50pp=0.261888, max=0.285987
bench_vecadd(8388608, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.113111, min=0.111636, std=0.001145, 50pp=0.112891, max=0.115315
CPU: Avg=0.117115, min=0.111639, std=0.017151, 50pp=0.112947, max=0.191720
bench_vecadd(16777216, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.104407, min=0.104051, std=0.000239, 50pp=0.104377, max=0.104971
CPU: Avg=0.421223, min=0.393942, std=0.006962, 50pp=0.421754, max=0.427590
bench_vecadd(16777216, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.238135, min=0.232520, std=0.001440, 50pp=0.238539, max=0.239491
CPU: Avg=0.239498, min=0.236948, std=0.004708, 50pp=0.238564, max=0.259829

$ python test_softmax.py
bench_softmax(1024, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.006603, min=0.005808, std=0.000234, 50pp=0.006678, max=0.006708
CPU: Avg=0.123508, min=0.020989, std=0.024401, 50pp=0.131621, max=0.137913
bench_softmax(1024, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.021495, min=0.014405, std=0.030274, 50pp=0.014520, max=0.153455
CPU: Avg=0.027889, min=0.014406, std=0.058143, 50pp=0.014520, max=0.281326
bench_softmax(2048, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.018824, min=0.013350, std=0.005081, 50pp=0.016140, max=0.027493
CPU: Avg=0.178647, min=0.031018, std=0.039892, 50pp=0.177880, max=0.234586
bench_softmax(2048, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.056622, min=0.055601, std=0.000995, 50pp=0.056422, max=0.060713
CPU: Avg=0.062484, min=0.055604, std=0.026486, 50pp=0.056424, max=0.177926
bench_softmax(4096, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.058536, min=0.058117, std=0.000341, 50pp=0.058455, max=0.059743
CPU: Avg=0.322326, min=0.276561, std=0.015820, 50pp=0.322121, max=0.345675
bench_softmax(4096, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.247781, min=0.244237, std=0.001217, 50pp=0.247492, max=0.250554
CPU: Avg=0.249642, min=0.247209, std=0.007360, 50pp=0.247554, max=0.281481
bench_softmax(8192, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.246608, min=0.235026, std=0.009394, 50pp=0.249331, max=0.263891
CPU: Avg=0.920643, min=0.899138, std=0.013884, 50pp=0.921552, max=0.947152
bench_softmax(8192, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.989336, min=0.984804, std=0.005962, 50pp=0.987003, max=1.012817
CPU: Avg=0.989328, min=0.984795, std=0.005961, 50pp=0.987010, max=1.012817

There are other options for library, but OpenMP is likely the most straightforward, what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant