You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given that batch_cast is designed to convert all values in the batch from one type to another as if converted by static_cast, it made sense that, in v7, batch_cast was templated on the size of the input and output batch size. With the change to v8 where the batch is now templated on the architecture, batch_cast currently only works if the if the input and output have the same architecture, which in practice means they need to have the same underlying width (at least on x86). This is a regression on v7.
For example, the following used to work fine on v7:
Unfortunately this doesn't currently work as the input and output architectures are different.
In order to make this work, I think we should modify batch_cast to look something like
template <classT_out, classT_in, classA>
inlinemake_sized_batch_t<T_out, batch<T_in, A>::size> batch_cast(batch<T_in, A> const& x) noexcept
{
using batch_out = make_sized_batch_t<T_out, batch<T_in, A>::size>;
returnkernel::batch_cast(x, batch_out {}, A {}, typename batch_out::arch {});
}
This change will require significant changes to all the overloads of kernel::batch_cast, kernel::detail::batch_cast, kernel::detail::conversion_type, and kernel::detail::fast_cast (among others) to add support for both an input and output architecture. For example, the generic overload of kernel::batch_cast would have to be modified to look something like
Given that this is my use case, I'm happy to do the implementation work for this, but I thought I'd touch base first and see if anyone has any strong opinions around this given this breaks some of the conventions around handing architecture (i.e. we'll now have to handle an input and output arch). Therefore are there any opinions on the code I posted above? Is there any preference on the ordering of template args? Also how should I handle the overloads for fast_cast? For example in the batch<int32_t, 8> to batch<int16_t, 8> case above, the input is __m256i, the output is __m128i, but the underlying intrinsic (_mm256_cvtepi32_epi16) was only introduced in AVX512 (requires AVX512F + AVX512VL)? Which header should I put this overload in? What would the signature of this overload look like given the implementation arch is unrelated to the input and output arch? Any guidance on this would be much appreciated.
The text was updated successfully, but these errors were encountered:
Hey and thanks for the detailed issue. There are plenty of difficult choices there :-) It's worth trying to split them in independent issues:
How do we support intrinsic that are inherently cross architecture (like an AVX builtin that operates on _m128 registers, etc) we currently have no generic support for this :-/ The only close construct we have is fma5<sse4.2> and fma5<avx> etc. We could envision an architecture named xsimd::ext::avx2<avx512f> which would mean avx2 with avx512 extensions
How do we handle cross-architecture cast? I like your approach here, I'd just order the template arguments as T_out, Arch, T_in so that one can force the architecture of the output batch without specifying the input batch type
I'd be happy to read @JohanMabille on that issue too ;-)
Hey @serge-sans-paille, sorry I've taken so long to respond. I guess I've been busier than I anticipated. I think adding both parts might be too much for me to take on, especially part 1 which I think will require some high-level architectural work to get right. I'm not sure I'm best placed to make some of the necessary design decisions, though I'm happy to look into implementing part 2 if you can come up with a concrete approach to part 1.
Given that batch_cast is designed to convert all values in the batch from one type to another as if converted by static_cast, it made sense that, in v7, batch_cast was templated on the size of the input and output batch size. With the change to v8 where the batch is now templated on the architecture, batch_cast currently only works if the if the input and output have the same architecture, which in practice means they need to have the same underlying width (at least on x86). This is a regression on v7.
For example, the following used to work fine on v7:
In order to implement the following in v8, I would expect the following to function identically:
Unfortunately this doesn't currently work as the input and output architectures are different.
In order to make this work, I think we should modify batch_cast to look something like
This change will require significant changes to all the overloads of
kernel::batch_cast
,kernel::detail::batch_cast
,kernel::detail::conversion_type
, andkernel::detail::fast_cast
(among others) to add support for both an input and output architecture. For example, the generic overload ofkernel::batch_cast
would have to be modified to look something likeGiven that this is my use case, I'm happy to do the implementation work for this, but I thought I'd touch base first and see if anyone has any strong opinions around this given this breaks some of the conventions around handing architecture (i.e. we'll now have to handle an input and output arch). Therefore are there any opinions on the code I posted above? Is there any preference on the ordering of template args? Also how should I handle the overloads for
fast_cast
? For example in thebatch<int32_t, 8>
tobatch<int16_t, 8>
case above, the input is__m256i
, the output is__m128i
, but the underlying intrinsic (_mm256_cvtepi32_epi16
) was only introduced in AVX512 (requires AVX512F + AVX512VL)? Which header should I put this overload in? What would the signature of this overload look like given the implementation arch is unrelated to the input and output arch? Any guidance on this would be much appreciated.The text was updated successfully, but these errors were encountered: