Database collation concerns from Spookerton #3646
MistakeNot4892
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Copied from Spook's ticket: #3018
Meta, I guess.
Database collation is not exactly a modern performance concern at ss13 scale. Because nebula aims to serve communities across languages and character sets it might still be a discussion worth having: collation is important for symbol comparison.
In #1318 back at the start of 2021 I mentioned utf8mb4_unicode_520_ci as a more up to date collation for text-centric tables, which are the most commonly used for ss13ish purposes. The PR's base utf8mb4 was still an upgrade over latin_swedish, and good. In the last couple of years things have moved on - but sadly not for the common better.
Mysql is still iterating on 8.0 and still supports the same utf8mb4_0900_ai_ci. Mariadb has recently (mid 2022) released 10.10 (the new lts, last month) 10.11. They also skipped matching mysql after all that time by releasing a unicode 14 collation set with 10.10 onward, preferring uca1400_ai_ci.
On premise: the later the unicode version referenced in the collation, in theory the more natural the sorting is and the better the grouping of those symbols is. To add - derived from mysql naming, the "ai/s" and "ci/s" suffixes stand for accent and case insensitivity/sensitivity.
As an example, in cases like "ted" and Téd", you probably want them to show up together in a search for either. Base utf8mb4 is completely unaware of both similarities, and that's just within extended latin.
So!
utf8mb4_unicode_520_ci is sadly still the last common best text collation. Given the choice of shipping least-worst, or duplicate-best, least-worst probably comes out on top for convenience? So, utf8mb4_unicode_520_ci is a good upgrade path for the collation indicated by the sql files the repo ships in sql/*.
Beta Was this translation helpful? Give feedback.
All reactions