Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Switch config values to UTF8 #109418

Merged
merged 17 commits into from
Nov 14, 2024
Merged

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Oct 31, 2024

  • Switch JitConfig strings from UTF16 to UTF8
  • Switch relevant JIT-EE and host methods to UTF8
  • Remove getJitTimeLogFilename JIT-EE call, which seemingly existed to avoid having a config parser in the JIT in release builds, but that ship has sailed

Subsumes #104805

* Switch JitConfig strings from UTF16 to UTF8
* Switch relevant JIT-EE and host methods to UTF8
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 31, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

{
WRAPPER_NO_CONTRACT;

SString str(SString::Utf8, name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SString str(SString::Utf8, name);
StackSString str;
SString(SString::Utf8Literal, name).ConvertToUnicode(str);

This is more efficient way to do the conversions (avoids heap allocations and unnecessary copies).

@jakobbotsch jakobbotsch marked this pull request as ready for review November 1, 2024 08:23
@jakobbotsch
Copy link
Member Author

Hmm, this seems to break the ability to specify file paths with unicode characters on Windows. Probably can't switch to fopen.

@jakobbotsch
Copy link
Member Author

Hmm, this seems to break the ability to specify file paths with unicode characters on Windows. Probably can't switch to fopen.

Actually it works fine with UTF8 enabled, this was operator error.

@jakobbotsch
Copy link
Member Author

PTAL @dotnet/jit-contrib @AaronRobinsonMSFT

@jakobbotsch jakobbotsch requested review from AaronRobinsonMSFT and a team November 1, 2024 10:15
@jkotas
Copy link
Member

jkotas commented Nov 1, 2024

Actually it works fine with UTF8 enabled, this was operator error.

Do you mean that you have to have your Windows configured to CP_ACP == CP_UTF8?.

@jakobbotsch
Copy link
Member Author

Actually it works fine with UTF8 enabled, this was operator error.

Do you mean that you have to have your Windows configured to CP_ACP == CP_UTF8?.

I think so, but I haven't confirmed. I can try to confirm whether this is the case.
Do you think we should continue using _wfopen for Windows to avoid this regression? I thought it might be ok since most of the JIT env vars involving file paths are pretty obscure / power user scenarios.

@jkotas
Copy link
Member

jkotas commented Nov 1, 2024

I thought it might be ok since most of the JIT env vars involving file paths are pretty obscure / power user scenarios.

I think it is ok for debug-only configs. If it is used for documented env vars in release builds, it is a compliance issue. All features have to handle globalization properly, even the obscure ones.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Nov 4, 2024

I think it is ok for debug-only configs. If it is used for documented env vars in release builds, it is a compliance issue. All features have to handle globalization properly, even the obscure ones.

I asked @EgorBo to test this since he has UTF8 disabled, but it appears that this fails much earlier on the VM side when you try to use non-ANSI paths:

Assert failure(PID 864836 [0x000d3244], Thread: 864760 [0xd31f8]): SString::_stricmp(valueNoCache, temp.GetUTF8()) == 0

CORECLR! `anonymous namespace'::EnvGetString + 0x3B8 (0x00007ff8`747f40d8)
CORECLR! `anonymous namespace'::GetConfigString + 0xDB (0x00007ff8`747f46eb)
CORECLR! CLRConfig::GetConfigValue + 0xD1 (0x00007ff8`747f4951)
CORECLR! CLRConfig::GetConfigValue + 0xCF (0x00007ff8`747f4eef)
CORECLR! JitHost::getStringConfigValue + 0x8F (0x00007ff8`741ed90f)
CLRJIT! JitConfigValues::initialize + 0x145C (0x00007ff9`206e70fc)
CLRJIT! jitStartup + 0x7E (0x00007ff9`205d9d6e)
CORECLR! LoadAndInitializeJIT + 0xEE7 (0x00007ff8`7496b267)
CORECLR! EEJitManager::LoadJIT + 0x1C0 (0x00007ff8`7496b650)
CORECLR! UnsafeJitFunction + 0x20D (0x00007ff8`741fe3ed)
    File: C:\prj\runtime-main\src\coreclr\utilcode\clrconfig.cpp:211
    Image: C:\prj\runtime-main\artifacts\bin\coreclr\windows.x64.Checked\CoreRun.exe
ERROR:
Assert failure(PID 864836 [0x000d3244], Thread: 864760 [0xd31f8]): SString::_stricmp(valueNoCache, temp.GetUTF8()) == 0

CORECLR! `anonymous namespace'::EnvGetString + 0x3B8 (0x00007ff8`747f40d8)
CORECLR! `anonymous namespace'::GetConfigString + 0xDB (0x00007ff8`747f46eb)
CORECLR! CLRConfig::GetConfigValue + 0xD1 (0x00007ff8`747f4951)
CORECLR! CLRConfig::GetConfigValue + 0xCF (0x00007ff8`747f4eef)
CORECLR! JitHost::getStringConfigValue + 0x8F (0x00007ff8`741ed90f)
CLRJIT! JitConfigValues::initialize + 0x145C (0x00007ff9`206e70fc)
CLRJIT! jitStartup + 0x7E (0x00007ff9`205d9d6e)
CORECLR! LoadAndInitializeJIT + 0xEE7 (0x00007ff8`7496b267)
CORECLR! EEJitManager::LoadJIT + 0x1C0 (0x00007ff8`7496b650)
CORECLR! UnsafeJitFunction + 0x20D (0x00007ff8`741fe3ed)
    File: C:\prj\runtime-main\src\coreclr\utilcode\clrconfig.cpp:211
    Image: C:\prj\runtime-main\artifacts\bin\coreclr\windows.x64.Checked\CoreRun.exe

(fails on main too)

Edit: This appears to just be a buggy assert, in release builds main seems to work ok, and indeed this PR regresses it.

@jkotas
Copy link
Member

jkotas commented Nov 4, 2024

Assert failure(PID 864836 [0x000d3244], Thread: 864760 [0xd31f8]): SString::_stricmp(valueNoCache, temp.GetUTF8()) == 0
Edit: This appears to just be a buggy assert, in release builds main seems to work ok, and indeed this PR regresses it

This came from https://github.com/dotnet/runtime/pull/59513/files#diff-ca28f8b9cc75904774104e92e308462d908149b3e95288e37acd5b719bf6e0e9R211 cc @AaronRobinsonMSFT

FILE* fopen_utf8(const char* path, const char* mode)
{
#ifdef HOST_WINDOWS
Utf16String<256> pathWide(path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paths on Windows can be longer than 256 characters

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a dynamic allocation inside Utf16String for the longer case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_PATH would be more appropriate here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really agree, this constant is just an optimization to avoid dynamic allocation in the common cases, it is not intended to have any relation to MAX_PATH.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really agree, this constant is just an optimization to avoid dynamic allocation in the common cases, it is not intended to have any relation to MAX_PATH.

The fact that 256 was choosen on Windows means it is precisely for a path and the common case. That is why the MAX_PATH is what is it, 260, because it was the most common path on Windows and indicates as such. If this was 1024 because that was common, then sure, but someone chose 256 and that aligns with MAX_PATH. The runtime is littered with MAX_PATH precisely for the reason being made.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose 256 because that is a 512 byte stack allocation, which was the largest size stack allocation I wanted to go with for the buffer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Given this is defined to only be on Windows and there is a macro with a predefined size for the "common" path on the platform, using MAX_PATH in a variable called pathWide seems like the most appropriate to me. If you think an explicit 256 is clearer then feel free to go with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the 256 simply because the value I picked here only coincidentally is close to MAX_PATH and wasn't intended to have any semantic relevance to its value (it seems like an outdated define nowadays given that the max is much higher).

@jakobbotsch
Copy link
Member Author

With help from @EgorBo we've verified that unicode paths now work again, even without UTF8 code page. @AaronRobinsonMSFT @jkotas can you please take another look?

@jakobbotsch
Copy link
Member Author

Ping @dotnet/jit-contrib for a review on the JIT changes

}
printf("Trying to unroll MemoryExtensions.Equals|SequenceEqual|StartsWith(op1, \"%s\")...\n", utf8Str);
}
#endif
Copy link
Member

@EgorBo EgorBo Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like too much effort to just dump a string in JitDump, I wouldn't bother. We should either remove it or move into some utility, I think it decreases code readability in this file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I added a Compiler::convertUtf16ToUtf8ForPrinting and switched this to use it.

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@amanasifkhalid amanasifkhalid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIT changes LGTM too, thanks!

src/coreclr/jit/utils.cpp Outdated Show resolved Hide resolved
@@ -1883,29 +1881,15 @@ AssemblyNamesList2::AssemblyNamesList2(const WCHAR* list, HostAllocator alloc)

AssemblyName* newName = new (m_alloc) AssemblyName();

// Null out the current character so we can do zero-terminated string work; we'll restore it later.
*listWalk = W('\0');
size_t nameLen = listWalk - nameStart;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This logic looks safe to me, though I'm a bit surprised that none of the native compilers are complaining about implicit cast from ptrdiff_t to size_t, since the former is signed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched it to ptrdiff_t

@jakobbotsch jakobbotsch merged commit c653208 into dotnet:main Nov 14, 2024
92 of 94 checks passed
@jakobbotsch jakobbotsch deleted the jit-config-utf8 branch November 14, 2024 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants