Added a ton of Russian systems (no single standard :/) and reworked t…

…he way some of the processing is done.
zedseven · Jan 31, 2021 · cd90387 · cd90387
1 parent 7f54621
commit cd90387
Show file tree

Hide file tree

Showing 37 changed files with 2,252 additions and 303 deletions.
diff --git a/Documentation/articles/supported.md b/Documentation/articles/supported.md
@@ -1,7 +1,7 @@
 # Supported Languages and Systems
 The goal of Romanization.NET is to provide a simple, extensive way to romanize widely-used languages as accurately as possible.
 
-Below is a list of all supported languages and systems, with explanations of caveats and limitations if necessary.
+Below is a list of all supported languages and systems, with explanations of caveats and limitations if necessary. Languages are ordered lexicographically.
 
 
 
@@ -71,3 +71,71 @@ Only one reading type is supported, which is the Hangeul equivalent pronunciatio
 
 #### Additional Notes
 Because the goal of this package is, as the name suggests, romanization, the implementation also includes a function for first converting the Hanja to Hangeul, then romanizing the Hangeul using the system of your choice.
+
+
+
+## Russian
+At the time of writing, Russian has no single international standard of romanization/transliteration. Instead different systems are used by different groups for different purposes. As a result, there are many systems all implemented with very similar transliterations.
+
+### [BGN/PCGN](https://en.wikipedia.org/wiki/BGN/PCGN_romanization_of_Russian)
+Developed jointly by the Unites States Board on Geographic Names and the Permanent Committee on Geographical Names for British Official Use, it is designed to be easier for anglophones to pronounce.
+
+Because of this, it's likely a solid choice for romanizing text specifically for English speakers (US/CA/UK audience).
+
+
+### [GOST 7.79-2000 System A](https://en.wikipedia.org/wiki/GOST_7.79-2000) / [ISO 9](https://en.wikipedia.org/wiki/ISO_9)
+GOST 7.79-2000(A) focuses on mapping one Cyrillic character to one Latin character, potentially with diacritics.
+
+ISO 9:1995 is the current standard for Slavic transliteration from the ISO, and is based on ISO/R 9:1968.
+
+The two systems are functionally identical and in this library are combined into one, under the name of GOST 7.79-2000 System A. This is to retain consistency with the other GOST systems included, as it may be strange to have GOST 7.79-2000 System B but have A under a different name.
+
+
+### [GOST 7.79-2000 System B](https://en.wikipedia.org/wiki/GOST_7.79-2000)
+In contrast to the above, GOST 7.79-2000(B) focuses on mapping one Cyrillic character to potentially several Latin characters (eg. `щ -> shh`), but without the use of diacritics.
+
+
+### [GOST 16876-71 Table 1 (UNGEGN)](https://en.wikipedia.org/wiki/GOST_16876-71)
+GOST 16876-71(1) focuses on mapping one Cyrillic character to one Latin character, potentially with diacritics.
+
+It was recommended by the [United Nations Group of Experts on Geographical Names (UNGEGN)](https://en.wikipedia.org/wiki/United_Nations_Group_of_Experts_on_Geographical_Names) in 1987.
+
+GOST 16876-71 was most recently updated in 1980, and was abandoned in favour of GOST 7.79-2000 in 2002 by the Russian Federation.
+
+
+### [GOST 16876-71 Table 2](https://en.wikipedia.org/wiki/GOST_16876-71)
+GOST 16876-71(2) is another table in GOST 16876-71, and focuses on mapping one Cyrillic character to potentially several Latin characters (eg. `щ -> shh`), but without the use of diacritics.
+
+
+### [Scholarly/Scientific Transliteration](https://en.wikipedia.org/wiki/Scientific_transliteration_of_Cyrillic)
+The Scholarly transliteration system for Russian actually covers many slavic languages, with Russian being one of them. It tries to preserve pronunciation of the original characters while remaining unambiguous about it's transformations.
+
+
+### [ISO Recommendation No. 9 (ISO/R 9:1968)](https://en.wikipedia.org/wiki/ISO_9#ISO/R_9)
+Similar to the scholarly system, ISO/R 9 was created 1954 and updated in 1968. It also supports many Slavic languages, and was the ISO's earliest adoption of scholarly transliteration.
+
+
+### [American Library Association and Library of Congress (ALA-LC) System](https://en.wikipedia.org/wiki/ALA-LC_romanization_for_Russian)
+This system was initially established in 1904, and remains largely unchanged since 1941. It's primary purpose is in US, Canadian, and British libraries.
+
+This system uses some diacritics and uses two-letter tie characters for some Cyrillic characters.
+
+
+### [British Standard 2979:1958](https://en.wikipedia.org/wiki/Romanization_of_Russian#British_Standard)
+It is the main system of Oxford University Press, and was used by the British Library up until 1975.
+
+The ALA-LC system is now used by the British Library instead.
+
+
+### [ICAO Doc 9303](https://www.icao.int/publications/Documents/9303_p3_cons_en.pdf)
+Created by the International Civil Aviation Organization, a UN agency, the document is designed to make travel documents machine-readable.
+
+It contains tables for transliteration to Latin characters from many alphabets, including Cyrillic. The system uses no diacritics whatsoever, only standard ASCII characters.
+
+The system was put into effect by the Russian government in 2013 for all citizen passports.
+
+
+### [General Road Signs](https://en.wikipedia.org/wiki/Romanization_of_Russian#Road_signs_note)
+This is the system generally used for romanization for road signs and the like.
+
+This originally followed GOST 10807-78 (tables 17, 18), but now follows GOST R 52290-2004 (tables Г.4, Г.5).
diff --git a/README.md b/README.md
@@ -6,10 +6,10 @@
 
 A library for [romanization](https://en.wikipedia.org/wiki/Romanization) of widely-used languages using common romanization systems.
 
-Still a work in progress. Originally made as part of the [NUSRipper](https://github.com/zedseven/NusRipper) project.
+Still a work in progress.
 
 ## Supported Languages & Documentation
-At the moment Romanization.NET supports Chinese, Japanese, and Korean, with individual romanization systems supported for each.
+At the moment Romanization.NET supports Chinese, Japanese, Korean, and Russian, with individual romanization systems supported for each.
 
 For a comprehensive breakdown of supported languages and systems, [check out the full article](Documentation/articles/supported.md).
 

diff --git a/Romanization/IRomanizationSystem.cs b/Romanization/IRomanizationSystem.cs
@@ -8,6 +8,14 @@ namespace Romanization
 	/// </summary>
 	public interface IRomanizationSystem
 	{
+		/// <summary>
+		/// Whether this is a transliteration system, which is moreso concerned with preserving the characters of a language rather than the sounds.<br />
+		/// Some languages only have transliteration systems.<br />
+		/// For more information, visit:
+		/// <a href='https://en.wikipedia.org/wiki/Transliteration'>https://en.wikipedia.org/wiki/Transliteration</a>
+		/// </summary>
+		public bool TransliterationSystem { get; }
+
 		/// <summary>
 		/// The system-specific function that romanizes text according to the system's rules.
 		/// </summary>

diff --git a/Romanization/LanguageAgnostic.cs b/Romanization/LanguageAgnostic.cs
diff --git a/Romanization/LanguageAgnostic/CharSub.cs b/Romanization/LanguageAgnostic/CharSub.cs
@@ -0,0 +1,43 @@
+using System.Text.RegularExpressions;
+
+namespace Romanization.LanguageAgnostic
+{
+	internal interface ISub
+	{
+		public string Replace(string text);
+	}
+
+	internal class CharSub : ISub
+	{
+		private readonly Regex  _findRegex;
+		private readonly string _substitution;
+
+		public CharSub(string pattern, string substitution, bool ignoreCase = true)
+		{
+			_findRegex    = new Regex(pattern, ignoreCase ? RegexOptions.Compiled | RegexOptions.IgnoreCase : RegexOptions.Compiled);
+			_substitution = substitution;
+		}
+
+		public string Replace(string text)
+			=> _findRegex.Replace(text, _substitution);
+	}
+
+	internal class CharSubCased : ISub
+	{
+		private readonly Regex  _findRegexUpper;
+		private readonly Regex  _findRegexLower;
+		private readonly string _substitutionUpper;
+		private readonly string _substitutionLower;
+
+		public CharSubCased(string patternUpper, string patternLower, string substitutionUpper, string substitutionLower)
+		{
+			_findRegexUpper    = new Regex(patternUpper, RegexOptions.Compiled);
+			_findRegexLower    = new Regex(patternLower, RegexOptions.Compiled);
+			_substitutionUpper = substitutionUpper;
+			_substitutionLower = substitutionLower;
+		}
+
+		public string Replace(string text)
+			=> _findRegexLower.Replace(_findRegexUpper.Replace(text, _substitutionUpper), _substitutionLower);
+	}
+}
diff --git a/Romanization/LanguageAgnostic/Constants.cs b/Romanization/LanguageAgnostic/Constants.cs
@@ -0,0 +1,26 @@
+using System.Text.RegularExpressions;
+
+// ReSharper disable CommentTypo
+
+namespace Romanization.LanguageAgnostic
+{
+	/// <summary>
+	/// A global class for language-agnostic functions and constants (things that are independent of specific languages).
+	/// </summary>
+	internal static class Constants
+	{
+		// General Constants
+		public const string Vowels              = "aeiouy";
+		public const string Consonants          = "bcdfghjklmnpqrstvwxz";
+		public const string Punctuation         = @"\.?!";
+		public const char   IdeographicFullStop = '。';
+		public const char   Interpunct          = '・';
+
+		// Replacement Characters
+		public const string MacronA = "ā";
+		public const string MacronE = "ē";
+		public const string MacronI = "ī";
+		public const string MacronO = "ō";
+		public const string MacronU = "ū";
+	}
+}