Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elbrus (e2k) architecture support #700

Open
wants to merge 14 commits into
base: wip/e2k
Choose a base branch
from

Conversation

makise-homura
Copy link
Contributor

This pull request introduces support of Elbrus hardware platform (which is based on Russian Elbrus CPU family) with its native lcc (eLbrus Compiler Collection) compiler.

This is linked to the corresponding PR for obs-studio.

ninja test has been run after building, all tests passed, both on x86_64 and e2k.

There is no ARM instructions support for now, because it looks too hard to make current implementations compile normally on e2k; but it may be introduced in following PRs, if needed.

@nemequ
Copy link
Member

nemequ commented Feb 1, 2021

This is awesome, thank you! I'm not familiar with the architecture, but I'd like to support it as best we can. I'm willing to merge more or less as-is, but maintenance is going to be a bit tricky…

Is it be possible to add a CI build, even if all we can do is cross-compile the tests from x86? If lcc can be installed on an x86 Linux machine it should be possible to add something to the development container as well. For what it's worth I'm happy to help turn generic installation instructions into a CI job.

Even better, of course, would be the ability to actually run the tests. Since no CI providers I'm aware of support e2k, this would likely mean an emulator; I don't see anything about qemu support for e2k, but maybe there is an out-of-tree patch we could use or something? Again, I'm happy to help get this integrated into our CI if possible.

meson.build Outdated Show resolved Hide resolved
simde/simde-arch.h Outdated Show resolved Hide resolved
simde/simde-arch.h Show resolved Hide resolved
simde/simde-common.h Outdated Show resolved Hide resolved
simde/x86/avx2.h Outdated Show resolved Hide resolved
test/meson.build Outdated Show resolved Hide resolved
@makise-homura
Copy link
Contributor Author

makise-homura commented Feb 3, 2021

Thanks! It's nice to see that project maintainer is interested in such a PR.

Unfortunately, there are some problems that may cause trouble for CI. First, there is no adequate e2k emulator (it is WIP, but may take a lot of time until it will be ready). There is instruction-precise simulator though, but it is not available publicly, since it is considered an engineering tool, and even if it was available, it is too slow to run any CPU-heavy stuff in it, like building or testing (it's about 1000 times slower than an actual hardware). There is also cross compiler, but it is available on request only (lcc compilers, both native and cross ones, have EDG frontend, which is proprietary and can't be freely distributed AFAIK). But instead we have three publicly available E2K machines, which can be accessed by users if they provide desired username and public SSH key (still no root access though). If it is enough to setup CI, then I'd be happy to provide access (so you may set up CI not just for building, but also for running tests).

Regarding access, you may contact me through Discord (makise-homura#8793) or join Telegram group which is a discussion related to mentioned SSH-accessible machines (it is primarily in Russian language, but most of us can speak back in English if someone is asking there for something in English). If either one isn't an option for you, you may suggest another option instead I guess.

@makise-homura
Copy link
Contributor Author

I pushed recently some other changes enabling OpenMP (which is implicitly enabled with LCC without any option like -fopenmp-simd) and getting rid of remaining warnings.
Still I have a few last ones:

/tmp/lcc_MouN4b.s: Assembler messages:
/tmp/lcc_MouN4b.s:65716: Warning: use of 'psllqh' with count > 15 leads to a zero result
/tmp/lcc_MouN4b.s:66745: Warning: use of 'psrlql' with count > 15 leads to a zero result
/tmp/lcc_60Ibsd.s: Assembler messages:
/tmp/lcc_60Ibsd.s:66815: Warning: use of 'psllqh' with count > 15 leads to a zero result
/tmp/lcc_60Ibsd.s:67846: Warning: use of 'psrlql' with count > 15 leads to a zero result

But they are probably low-level ones raised when assembler code is packed into VLIW word. Don't know what to do with them, and also have no idea if I should take them into account. Looks like kind of overflowed shifts, but I'm not sure.

@nemequ
Copy link
Member

nemequ commented Feb 3, 2021

Thanks! It's nice to see that project maintainer is interested in such a PR.

Unfortunately, there are some problems that may cause trouble for CI. First, there is no adequate e2k emulator (it is WIP, but may take a lot of time until it will be ready). There is instruction-precise simulator though, but it is not available publicly, since it is considered an engineering tool, and even if it was available, it is too slow to run any CPU-heavy stuff in it, like building or testing (it's about 1000 times slower than an actual hardware). There is also cross compiler, but it is available on request only (lcc compilers, both native and cross ones, have EDG frontend, which is proprietary and can't be freely distributed AFAIK). But instead we have three publicly available E2K machines, which can be accessed by users if they provide desired username and public SSH key (still no root access though). If it is enough to setup CI, then I'd be happy to provide access (so you may set up CI not just for building, but also for running tests).

Regarding access, you may contact me through Discord (makise-homura#8793) or join Telegram group which is a discussion related to mentioned SSH-accessible machines (it is primarily in Russian language, but most of us can speak back in English if someone is asking there for something in English). If either one isn't an option for you, you may suggest another option instead I guess.

Okay, it sounds like requesting access via SSH is the right way to go for now; I'll do that in a bit. That way at least I'll be able to do some periodic builds and debugging. I'm definitely interested in adding a CI job, but that would require either setting up CI servers on e2k hardware or a freely distributable emulator + (cross-)compiler. Hopefully one day :)

I pushed recently some other changes enabling OpenMP (which is implicitly enabled with LCC without any option like -fopenmp-simd) and getting rid of remaining warnings.
Still I have a few last ones:

That could be bugs in SIMDe. Intel tends to accept any 8-bit value for a lot of functions, so we have to handle things like shifting 16-bit lanes by 16+ bits, but IIRC that is UB. If we need to add some checks to do something like (count < 15) ? (a << count) : 0 in the portable code we can, but it sounds like this may be in the _mm_slli_epi16 path, in which case lcc should really handle that (since it is implementing Intel's API), though we could of course add a special case to work around the issue using a SIMDE_BUG_LCC_... macro.

@makise-homura
Copy link
Contributor Author

If we need to add some checks to do something like (count < 15) ? (a << count) : 0 in the portable code we can, but it sounds like this may be in the _mm_slli_epi16 path, in which case lcc should really handle that (since it is implementing Intel's API), though we could of course add a special case to work around the issue using a SIMDE_BUG_LCC_... macro.

Well, culprits are simde_mm_bslli_si128(a, 19); at line 908, and simde_mm_bsrli_si128(a, 19); at line 955 of test/x86/sse2.c when building sse2-native-c and sse2-native-cpp tests. Finally they end up in a call to _mm_slli_si128 and _mm_srli_si128 correspondingly. There is no way to check it inside a #define, because there is no way to use #if inside a #define. Also conditional expression like ((imm8 > 15) ? _mm_setzero_si128() : _mm_srli_si128(a, imm8)) is never optimized to exclude unreachable branch. So I tried to implement it as an inline function, like:

  #if defined(SIMDE_BUG_LCC_WARNING_ON_SHIFTS)
    inline __attribute__((always_inline)) simde__m128i
    simde_lcc_guard_mm_slli_si128(simde__m128i a, const int imm8)
    SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) {
      if (imm8 > 15) {
        return _mm_setzero_si128();
      } else {
        return _mm_slli_si128(a, imm8);
      }
    }
    #define simde_mm_bslli_si128(a, imm8) simde_lcc_guard_mm_slli_si128(a, imm8)
  #else
    #define simde_mm_bslli_si128(a, imm8) _mm_slli_si128(a, imm8)
  #endif

But, if I use static for simde_lcc_guard_mm_slli_si128, no inlining occurs, and compilation fails due to imm8 being a variable inside a function, and not a constant. If no static is used, then there's a warning like an entity with internal linkage cannot be referenced within an inline function with external linkage, and I presume this is not what we want (but still code compiles and produce no warnings).

Have you any ideas what to do with that?
Actually, if our goal is warning-less build, we can just suppress the -Wstatic-reference-in-c99-inline-function warning, but it feels like a very dirty hack here.

@nemequ
Copy link
Member

nemequ commented Feb 9, 2021

I'm merging some of this as 349da2b, 093b2c5, 24ddeba. I'll publish a wip/e2k branch in the SIMDe repository with your changes rebased.

Copy link
Member

@nemequ nemequ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also conditional expression like ((imm8 > 15) ? _mm_setzero_si128() : _mm_srli_si128(a, imm8)) is never optimized to exclude unreachable branch

You mean you see it compiled in the output? What compiler? I don't remember ever having a problem with it before on gcc, clang, or even MSVC, assuming the optimizer is configured to be sufficiently aggressive (for SIMDe I usually only worry about -O3).

Or is the problem that you still get an error because of the value passed to the imm8 parameter, even though that path isn't taken? If that's the case, what you can do is pass imm8 & 15; that will restrict the imm8 param to make the compiler happy, but since the value is going to be known at compile time the AND gets compiled away to nothing too.

Also, I suggest something like (imm8 & ~15) instead of (imm8 > 15) because of negatives. That's not just a completely insane thing to pass, either; Arm uses negatives to reverse direction (e.g., right shift -1 == left shift 1).

meson.build Outdated Show resolved Hide resolved
simde/simde-arch.h Show resolved Hide resolved
simde/simde-arch.h Outdated Show resolved Hide resolved
simde/simde-arch.h Outdated Show resolved Hide resolved
simde/simde-common.h Outdated Show resolved Hide resolved
simde/simde-complex.h Outdated Show resolved Hide resolved
test/meson.build Outdated Show resolved Hide resolved
@makise-homura
Copy link
Contributor Author

makise-homura commented Feb 10, 2021

You mean you see it compiled in the output? What compiler?

Yes, it was in LCC's assembler output.

Or is the problem that you still get an error because of the value passed to the imm8 parameter, even though that path isn't taken? If that's the case, what you can do is pass imm8 & 15

Yes! That worked. That was quite a clever hack. Pushed it in 42ecd54.

Also, I suggest something like (imm8 & ~15) instead of (imm8 > 15) because of negatives.

Hm, that may be an idea. Rewritten my 42ecd54 into 8b0b8fa, sorry for that previous commit.

@nemequ
Copy link
Member

nemequ commented Feb 16, 2021

I've been playing around a bit with this, and I have a pretty small test case for the reduced-alignment issue:

#include <stdint.h>
#include <stdio.h>

typedef union {
  int8_t i8 __attribute__((__vector_size__(32)));
} simde__m256i;

simde__m256i
simde_mm256_set1_epi8(int8_t a) {
  simde__m256i r;
  for (size_t i = 0 ; i < sizeof(r.i8) / sizeof(r.i8[0]) ; i++) {
    r.i8[i] = a;
  }
  return r;
}

int main(void) {
  simde__m256i a = simde_mm256_set1_epi8(42);

  return 0;
}

This is interesting because we're not actually requesting a specific alignment anywhere; LCC aligns 256-bit vectors to 32-byte boundaries by default. I would definitely classify that as an LCC bug.

The other time I ran into this diagnostic today was when I looked into implementing a maximum alignment on LCC (not necessary, it turns out; it was just my first guess for the cause of this). In that case, if you specify the __attribute__((__aligned__(N))) with an N less than the default alignment for that type LCC will emit the alignment-reduced diagnostic.

IMHO it's not really appropriate to emit this as GCC's documentation says: "When used on a struct, or struct member, the aligned attribute can only increase the alignment; in order to decrease it, the packed attribute must be specified as well." It's not unreasonable to use it that way, as SIMDe does. However, I can see a reasonable argument for having it as long as it's off-by-default (which it is).

The real problem is that AFAICT there is no way to disable it in code; no diagnostic number is provided so I have no idea what to suppress. Of course, since lcc doesn't offer a way to pop the warning stack we would be disabling the warning in any code which uses SIMDe, too, which is obviously not optimal… if there were a way to pop the stack I'd be a lot more comfortable just disabling the warning for SIMDe.

I also took a look at the XOP performance. This was a bit more work than I'd hoped since google-benchmark doesn't work on LCC… there are a couple of warnings to suppress (IIRC one in google-benchmark and another in google-test), IIRC about unused functions), but then there is another problem I couldn't find a quick work-around for. However, I tried using Hayai it it worked well. So far I've only tested _mm_permute2_ps, but the "native" version is faster. Here is the code:

#include <sys/random.h>

#define SIMDE_NO_NATIVE
#include <simde/x86/xop.h>

#include <x86intrin.h>

class RandomVectorsFixture
    :   public ::hayai::Fixture
{
public:
    virtual void SetUp()
    {   
        getrandom(&a, sizeof(a), GRND_RANDOM);
        getrandom(&b, sizeof(b), GRND_RANDOM);
        getrandom(&c, sizeof(c), GRND_RANDOM);
    }

    virtual void TearDown()
    { }

    union {
      __m128 native;
      simde__m128 simde;
    } a;
    union {
      __m128 native;
      simde__m128 simde;
    } b;
    union {
      __m128i native;
      simde__m128i simde;
    } c;
    union {
      __m128 native;
      simde__m128 simde;
    } r;
};

BENCHMARK_F(RandomVectorsFixture, test_simde_mm_permute2_ps, 1000, 1) {
  r.simde = simde_mm_permute2_ps(a.simde, b.simde, c.simde, 2);
}

#pragma diag_suppress 1444

BENCHMARK_F(RandomVectorsFixture, test_mm_permute2_ps, 1000, 1) {
  r.native = _mm_permute2_ps(a.native, b.native, c.native, 2);
}

And the results:

l++ -o bench bench.cpp -L../hayai/build/src/ -I../hayai/src/ -I../simde -lhayai_main
[==========] Running 2 benchmarks.
[ RUN      ] RandomVectorsFixture.test_simde_mm_permute2_ps (1000 runs, 1 iteration per run)
[     DONE ] RandomVectorsFixture.test_simde_mm_permute2_ps (0.917222 ms)
[   RUNS   ]        Average time: 0.917 us (~0.058 us)
                    Fastest time: 0.564 us (-0.353 us / -38.510 %)
                    Slowest time: 1.316 us (+0.399 us / +43.477 %)
                     Median time: 0.923 us (1st quartile: 0.908 us | 3rd quartile: 0.932 us)

             Average performance: 1090248.59849 runs/s
                Best performance: 1773049.64539 runs/s (+682801.04690 runs/s / +62.62801 %)
               Worst performance: 759878.41945 runs/s (-330370.17903 runs/s / -30.30228 %)
              Median performance: 1083423.61863 runs/s (1st quartile: 1101321.58590 | 3rd quartile: 1072961.37339)

[ITERATIONS]        Average time: 0.917 us (~0.058 us)
                    Fastest time: 0.564 us (-0.353 us / -38.510 %)
                    Slowest time: 1.316 us (+0.399 us / +43.477 %)
                     Median time: 0.923 us (1st quartile: 0.908 us | 3rd quartile: 0.932 us)

             Average performance: 1090248.59849 iterations/s
                Best performance: 1773049.64539 iterations/s (+682801.04690 iterations/s / +62.62801 %)
               Worst performance: 759878.41945 iterations/s (-330370.17903 iterations/s / -30.30228 %)
              Median performance: 1083423.61863 iterations/s (1st quartile: 1101321.58590 | 3rd quartile: 1072961.37339)
[ RUN      ] RandomVectorsFixture.test_mm_permute2_ps (1000 runs, 1 iteration per run)
[     DONE ] RandomVectorsFixture.test_mm_permute2_ps (0.461097 ms)
[   RUNS   ]        Average time: 0.461 us (~0.046 us)
                    Fastest time: 0.170 us (-0.291 us / -63.131 %)
                    Slowest time: 0.876 us (+0.415 us / +89.982 %)
                     Median time: 0.467 us (1st quartile: 0.451 us | 3rd quartile: 0.467 us)

             Average performance: 2168741.06750 runs/s
                Best performance: 5882352.94118 runs/s (+3713611.87368 runs/s / +171.23353 %)
               Worst performance: 1141552.51142 runs/s (-1027188.55608 runs/s / -47.36336 %)
              Median performance: 2141327.62313 runs/s (1st quartile: 2217294.90022 | 3rd quartile: 2141327.62313)

[ITERATIONS]        Average time: 0.461 us (~0.046 us)
                    Fastest time: 0.170 us (-0.291 us / -63.131 %)
                    Slowest time: 0.876 us (+0.415 us / +89.982 %)
                     Median time: 0.467 us (1st quartile: 0.451 us | 3rd quartile: 0.467 us)

             Average performance: 2168741.06750 iterations/s
                Best performance: 5882352.94118 iterations/s (+3713611.87368 iterations/s / +171.23353 %)
               Worst performance: 1141552.51142 iterations/s (-1027188.55608 iterations/s / -47.36336 %)
              Median performance: 2141327.62313 iterations/s (1st quartile: 2217294.90022 | 3rd quartile: 2141327.62313)
[==========] Ran 2 benchmarks.

I'd like to test each function (shouldn't be hard to modify that code), but I suspect it's going to be better to just ignore that warning.

@nemequ
Copy link
Member

nemequ commented Feb 17, 2021

0366dab, e38fe50, and ad8c7e0 move this along pretty well. With those patches in place I'm able to get to the point where the compilation fails due to the inefficient implementations. I'll try to get working on testing each of those tomorrow to make sure they are all faster than SIMDe's implementations.

e38fe50 is an excellent example of why I don't trust __GNUC__ and friends on non-GCC compilers (clang is the exception, but they only claim compatibility with GCC 4.2.1). I didn't realize that was the problem until I saw the names you chose for the bugs, but now it makes a lot of sense.

@makise-homura makise-homura changed the base branch from master to wip/e2k February 20, 2021 23:21
@makise-homura
Copy link
Contributor Author

makise-homura commented Feb 21, 2021

Sorry for long wait, I was a bit busy with work this week, so had any time to deal with SIMDe just today.
I've rebased my changes on wip/e2k, seems to be better now.

The only thing remaining to do for now, I guess, is to do something with reduced alignment warnings (remove -Wno-reduced-alignment from build system). I've already filed a bug report to LCC developers, I hope they'll deal somehow with it, at least they'll say how to get it over with it gracefully with current version of LCC. Probably I'll file another bug reports for SIMDE_BUG_LCC_TOO_STRICT_VECTOR_SHIFTS_AND_COMPARES, SIMDE_BUG_LCC_XOP_MISSING, SIMDE_BUG_LCC_WARNING_ON_SHIFTS, SIMDE_BUG_LCC_FMA_WRONG_RESULT, and SIMDE_BUG_LCC_AVX_NO_LOAD_STORE_U2, for we'll be able to avoid these hacks in SIMDe when some future version of LCC is used.

Also I hope you approve the way I avoid deprecation warnings in 9ff51ed (with a tiny fix in 7791129), and give an advice on what to do with OPENMP_SIMD.

@nemequ
Copy link
Member

nemequ commented Feb 23, 2021

No worries, I didn't have internet (except on my phone) for a while due to a move, so I wouldn't have been able to review this anyways.

I think I do like your idea for the deprecated diagnostics. I'm conflicted since I'm worried about someone doing something like

#pragma diag_suppress 1215,1444
// ...
#include "path/to/simde/x86/sse2.h"

call_deprecated_function();

But I think it's the best we're going to do for LCC (and a big improvement over just disabling them and leaving them that way), and unfortunately I don't think LCC is going to support push/pop any time soon.

and give an advice on what to do with OPENMP_SIMD.

You mean whether to just #define SIMDE_ENABLE_OPENMP #if defined(HEDLEY_MCST_LCC_VERSION)? I don't see a reason not to do so, do you? It might be a good idea to replace those !defined(SIMDE_ENABLE_OPENMP) checks in simde-common.h with !defined(SIMDE_ENABLE_OPENMP) && !defined(SIMDE_DISABLE_OPENMP) to give people the option of turning it off, but honestly I don't think that is necessary.

It seems like this is basically ready to merge, right? I see a few minor things I'd like to change, but nothing major; I'll just tweak those when I merge everything.

@makise-homura
Copy link
Contributor Author

I'm worried about someone doing something like

Yes, it is the exact case I'm expecting to be affected by using diag_suppress/diag_default. But I think, one who uses diag_suppress in some way in the start of file, might expect there can be diag_default somewhere, if he still get the suppressed warnings. And obviously it's a bit weird way (albeit legit for a "dirty hack") to suppress possible warnings in a single file rather than the whole build; one should use it for a piece of file that is problematic, not for the whole file. So the way I used it, is the best one of all possible ways in the case of not having push/pop, I guess.

It might be a good idea to replace those !defined(SIMDE_ENABLE_OPENMP) checks in simde-common.h with !defined(SIMDE_ENABLE_OPENMP) && !defined(SIMDE_DISABLE_OPENMP) to give people the option of turning it off

Yes, I did like this. So now OpenMP SIMD is enabled by default, and can be disabled if -DSIMDE_DISABLE_OPENMP is given. I edited README according to this also.

It seems like this is basically ready to merge, right?

Yes, now it's really ready to merge I guess. Today's commits fixed the every single thing remaining until E2K support could be considered implemented, so unless you have any change requests for it, I think these changes can land into master.

BTW, github still shows me that there's one more unresolved change requested, but I can't see it here.

@nemequ
Copy link
Member

nemequ commented Apr 30, 2021

Sorry it took so long, but this is almost done. As of a few days ago everything (416c243 and 269db2a) works on e2k, but the tests generate a lot of those -Wreduced-alignment warnings. For now you can get rid of them with -Wno-reduced-alignment and everything will work as expected.

I slightly changed how the calls to "deprecated" functions were made, mostly just wrapping them in statement exprs to squash the error, and I moved some stuff around a bit.

@makise-homura
Copy link
Contributor Author

Oops, sorry for long wait, I've totally forgot about this PR, and only few days ago I've been reminded of it. Actually I was like 'it is being merged, so it's all ok with it', and while I've had no further notifications, I forgot about it.
Yeah, I guess -Wreduced-alignment is a specific of e2k, that might be fixed in some new versions of compiler (1.25.19 seems to have this already fixed, but I'm not sure of completeness), so it looks not a way to fix in SIMDe itself.
Should then I fix something else, or we may proceed to final merge?

@a1batross
Copy link

bump! cc @nemequ @makise-homura

@nemequ
Copy link
Member

nemequ commented Feb 23, 2022

Sorry, I've been away from SIMDe for a while but I'm trying to get back into it now. I'll take a look at this over the weekend.

My memory is a bit foggy on the details, but I seem to remember that everything except for one set of changes for one issue (-Wreduced-alignment false positive?) has been merged... if that's not necessary with the latest version of the compiler my inclination is to not work around it in SIMDe and just ask people to use the latest compiler, especially if it's just to silence a warning and not required to get the tests to compile and pass. For a niche architecture / compiler I think that's okay.

@a1batross
Copy link

Another bump :)

@nemequ @makise-homura

@Torinde
Copy link
Contributor

Torinde commented Apr 9, 2024

Can someone point to place(s) where the supported instructions in the various e2k-v1/2/3/4/... are listed?

  • for Lintel (full emulation of a system, booting into standard x86 or x86-64 operating system like Windows)
  • for RTC (emulation suitable for x86 or x86-64 executables, booting into Elbrus Linux)

What I see in various other tickets and websites is conflicting, e.g. is AVX supported or not, which models support what, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants