Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VR4300 JITTER #1116

Open
krnlyng opened this issue Jan 20, 2025 · 13 comments
Open

VR4300 JITTER #1116

krnlyng opened this issue Jan 20, 2025 · 13 comments

Comments

@krnlyng
Copy link
Contributor

krnlyng commented Jan 20, 2025

Hi, I don't know if anybody is interested in this at all, but i've been working on a dynamic recompiler for mupen64plus.

It's very much WIP and currently x86_64 only, and some code is based on dolphin. I thought it makes sense to share a snapshot, although it still needs lots of improvements. That said some games do run faster with it.

Here is the code:
https://github.com/krnlyng/mupen64plus-core/tree/vr4300_jitter_snapshot

And some details:

WIP VR4300_JITTER.
It's a dynamic recompiler (currently x86_64 only) which is based on ideas
from the dolphin emulator, also reusing some of its code.

It's very much WIP but i thought i'd share a snapshot in case somebody is
interested.

Features:

  • VR4300_JITTER matches the pure interpreter (when compared via the core-compare
    feature).
    • This means all the registers, cycle count etc. match that of the pure
      interpreter, all while running at similar speeds and sometimes (up to
      10%-15%) faster than the other recompilers in mupen (when running in
      unlimited mode and with a graphics plugin modified to only draw every
      3600 frames or so, to avoid GPU bottlenecks).
    • NOTE: This doesn't mean that the cycle count is fully accurate, it just
      matches that of the pure interpreter of mupen. But i hope to do some tests
      sometime in the future and improve the accuracy.
    • NOTE: Ofc this applies only to games which have been thoroughly tested.
  • VR4300_JITTER has a fast block linker implementation (based on the dolphin
    implementation).
  • VR4300_JITTER has a fast dispatcher in assembly using an entry points lookup
    table (similar as dolphin).
  • VR4300_JITTER has a register cache (based on the dolphin implementation with
    tweaks for vr4300).
    • Supports both float and general purpose registers.
    • Supports register discarding, immediate detection and propagation, etc.
    • Generates optimized code for float registers depending on the FR status bit.
  • VR4300_JITTER has it's own instruction decoder at the moment, but this may
    change in the future.
  • VR4300_JITTER has a fastmem implementation.
    • The memory map is layed out as on real hardware and accesses to special
      addresses or virtual addresses are handled via faults (SIGSEGV on Linux).
    • Even virtual addresses (TLB) can be mapped (see MEMMAP_TLB_REGIONS) but
      this currently performs slower than without so it's disabled until further
      investiations have been done.
    • Load and stores to RDRAM via immediate addresses (as determined by the
      register cache) are optimized and do not need fault handling.
    • Similarly mi/rsp register accesses are optimized.
  • VR4300_JITTER does not need any hardcoded assembly files, everything is
    generated from code at runtime.

Un-features:

  • VR4300_JITTER is currently Linux only, but doesn't have to be if someone wants to take
    up other platforms.
  • Some of the instructions are not yet optimized to their full potential.
  • VR4300_JITTER probably has bugs that need to be ironed out.
  • VR4300_JITTER is not tested other than on little endian.
  • Some of the features, like the fast dispatcher require huge virtual memory
    mappings, which in itself is not a problem but on platforms where this can't
    be supported there is currently no fallback.

There is probably more i can't think of right now.

@krnlyng
Copy link
Contributor Author

krnlyng commented Jan 21, 2025

If somebody wants to try it (linux only for now) you can build it by specifying VR4300_JITTER=1 as an argument to make.

@m4xw
Copy link
Contributor

m4xw commented Jan 22, 2025

Hey, cool to see someone else bothering with writing a new JIT as well.
But i have doubts that it would even come close to new dynarec, the 10-15% you see might be from block recompilation that can happen all the time which has a bit more overhead, to get accurate perf stats you should log all recompile blocks and just instruct them to do it on startup, then benchmark afterwards.
Also theres quite a few perf options inside new dynarec, currently register optimization passes are quite limited due to being expensive but they perform really well.
Fwiw yes theres a bunch of divergences, we actually fixed quite a lot of them already.

Now I just need to find some people to help me wrap up my AOT recompiler lol

@krnlyng
Copy link
Contributor Author

krnlyng commented Jan 22, 2025

What i do to test performance is i modified the graphics plugin to render normally for ~60 seconds and then to render only every 3600'th frame so that i'm not gpu bottlenecked.

Then i build and run a unmodified version of the emulator and run it with this plugin and some game.
Then i take the average VI/s in different locations in game once the 60 seconds are over.

Then i build and run my version of the emulator with the same plugin and game and take the same measurements.
I'm not sure if that's wrong or not but it gives consistent results even when executed multiple times (to avoid bias).
I have also seen one game run slightly slower, which i will still investigate.

I recently pushed a fix which also improves games which modify the FR status bit a lot and some other minor fixes.

Note: I set ACCURATE_FPU=0 for both new_dynarec and vr4300_jitter. new_dynarec does not seem to implement this feature so to be fair i turned it off (please correct me if i'm wrong).

Also i'm comparing the pure interpreter and my jit using the core compare feature (except not while doing performance tests). In some games there are still issues but in some games they are equal (the other recompilers aren't afaict - unless i'm doing something wrong). I have run some games for multiple hours with the compare feature to see if any errors pop up. I haven't done that recently though so it might have regressed.

Tbh new_dynarec and vr4300_jitter have comparable speeds, but that said as an example mario 64 runs (with the above 60 seconds method) at about 4650VI/s after the intro cutscene with new_dynarec and 5400VI/s with vr4300_jitter, roughly, on my machine, similar in other situations. It may very well be an outlier.

There are probably still lots of instructions that have untested cases. I hope to implement some sort of unit testing which would compare the instructions (maybe with some randomizations) against what the pure interpreter does at some point, such that i can find all the cases and fix them.

Ofc there are still glaring issues with vr4300_jitter: no arm64 support at the moment, no support outside of linux (as of yet), lack of widespread testing. However all these things can be addressed.

Also vr4300_jitter does not have some of the optimizations that new_dynarec has, like treating certain blocks differently depending on whether they use 32 or 64 bit registers only and such. I haven't understood this optimization fully yet (as you can probably tell from my ramblings) and therefore it's not implemented yet, still performance is comparable and sometimes better.

I can do other performance comparisons if anybody is interested or has a more rigorous testing method for me.

@krnlyng
Copy link
Contributor Author

krnlyng commented Jan 22, 2025

And thank you for your reply, i was worried nobody would even look at it. And if there is a more direct way of communication let me know as well.

@m4xw
Copy link
Contributor

m4xw commented Jan 22, 2025

You might want to test with stop_after_jal = 1 (see new_dynarec.c), this causes the blocks to be shorter and recompilation times to be shorter too. usually that reduces perf but in your test that might improve them. In sm64 even things get loaded, invalided and recompiled all the time due to the overlays and block size / block linking etc do cause a considerable impact in those measurements

@loganmc10
Copy link
Member

@krnlyng Don't worry about comparing it to the new dynarec. That code is an unmaintainable mess and most people involved with mupen64plus would be happy with a better replacement.

If you want to discuss, most of the people involved with mupen64plus in this Discord server (#emu_mupen channel):
https://discord.gg/esPB38SJ5u

@krnlyng
Copy link
Contributor Author

krnlyng commented Jan 22, 2025

diff --git a/src/device/r4300/new_dynarec/new_dynarec.c b/src/device/r4300/new_dynarec/new_dynarec.c
index ee5282de..2af5e5c4 100644
--- a/src/device/r4300/new_dynarec/new_dynarec.c
+++ b/src/device/r4300/new_dynarec/new_dynarec.c
@@ -8719,7 +8719,7 @@ void new_dynarec_init(void)
   // Copy this into local area so we don't have to put it in every literal pool
   g_dev.r4300.new_dynarec_hot_state.invc_ptr=g_dev.r4300.cached_interp.invalid_code;
 #endif
-  stop_after_jal=0;
+  stop_after_jal=1;
   // TLB
   using_tlb=0;
   for(n=0;n<524288;n++) // 0 .. 0x7FFFFFFF

Like this? It runs at around 4300VI/s then.

@m4xw
Copy link
Contributor

m4xw commented Jan 22, 2025

yea like that. I thought that would reduce the overhead for this kind of measurement, but guess it doesnt.
I still have some doubts tho, but dont get me wrong if this runs faster consider me impressed. New dynarec is basically as fast, as it is a total mess. otherwise just some interrupt related handling would come to mind

@m4xw
Copy link
Contributor

m4xw commented Jan 22, 2025

Whats your implementation of the counter reg? did u orient yourself on new dynarec?

@krnlyng
Copy link
Contributor Author

krnlyng commented Jan 22, 2025

Whats your implementation of the counter reg? did u orient yourself on new dynarec?

I tried to keep it matching to the pure interpreter. The core compare feature wouldn't work if it's not matching to the pure interpreter. It for sure can and should be improved.

@m4xw
Copy link
Contributor

m4xw commented Jan 22, 2025

iirc the ones that desync with new_dynarec had some interpret_ flag but i cant remember if @Gillou68310 back then made it fully match. We started a bit of a effort back then but we didnt complete it

@m4xw
Copy link
Contributor

m4xw commented Jan 22, 2025

One thing u could test is how both jit's perform at varying count per op levels.
I suspect maybe we might not be triggering enough interrupts to explain this perf gap

@m4xw
Copy link
Contributor

m4xw commented Jan 22, 2025

oh i didnt see you do a fastmem implementation. that would explain it. We actually started implementation for new dynarec before too but ran into some issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants