-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed vectorize function for tile (with a nelts list) in batch_matmul #77
base: master
Are you sure you want to change the base?
Conversation
I think the key change here is making Since the Edit: In my comments above, I overlooked the case where |
In my machine, using 4 x simdwidthof gives a simd_width value of 32, so I don't think the speedup is related with the first entry of nelts_list (32), but it is true that majority of the sizes are divisible by 32, so I don't know in which cases this tile is helping that it is giving a tiny bit of more performance. But that's why i don't understand why using sometimes 2 x simdwidthof or in mac 4 x simdwidthof would give better results. From what i'm seeing now testing in an m1 pro mac, is that 4 x simdwidthof gives 16, and in the m1 pro using the size of simd_width of 32 gives incorrect results in the matmul operation and using only a 16 simd_width size and less for the nelts list doesn't cause problems, so it seems that maybe using a simd_width size bigger than the cache line gives problems (the m1 cache line is 128 bytes), but that wouldn't explain the amd cpu (the cache line is of 64 bytes), so i don't know why using a size bigger than 4 x simdwidthof causes problems and a size bigger than 2 x simdwidthof for the reduce_add part also causes undefined behavior. So maybe I have to create a tile function that uses an if to first check if the nelts size it is trying to use is bigger than 4 x simdwidthof extra: (In the m1 we have 128bit for simd size and in my ryzen it is 256bit for simd size) |
I think your error with large |
No I also change that value, if I change the nelts list to 64 I put the stack size to 64 and the same code that works in the amd cpu (a stack size of 32 and nelts list of 32) gives an error in the m1 cpu so I am sure the problem has to be with the simd_width size (but I will change that, as you said you can introduce errors), but if I change the nelts list value to 16 in the m1 cpu the error is fixed so it seems that the error is the operation with a bigger simd size, but I wouldn't know why it gives an error using a value bigger than 4 x simdwidth. I'm going to test just making the value of nelts bigger in the original matmul, but maybe there won't be an error, as I said before, you get different behavior using reduce_add, if you use reduce_add with a value of 4 x simdwidth you get incorrect values, if you use 2 x simdwidth you don't get an error, based on that it seems that using a simd value bigger than what the cpu allows can give undefined behavior, the thing is why does it sometimes happen and sometimes it doesn't. (But if you have time and you can test it I would appreciate it, I'm 99% sure what I did is correct and the problem is the simd sizes, but if you can confirm it I would appreciate it so I don't give incorrect information) Update: If you use 8 x simdwidth (so a simd width of 64 for my amd cpu) with the original matmul code you also get undefined behavior, the matmul operation gives incorrect values. So I think this confirm that there is an undefined behavior when using a simd width bigger than what the cpu accepts |
Thanks for taking time to research this topic. |
BTW, I don't see any use of |
|
Hi @andresnowak any chance you can validate the ideas from this PR on the latest mojo release? Is there anything we can improve? |
This branch is the implementation of an idea that @mikowals wanted to implement, changing the vectorize function for tile in the Batch_matmul function. Based on benchmarking with lamatune, changing vectorize for tile does seem to help a little bit in speed of the program. (This tests were run on a ryzen 3600x)