Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The normalize function doesn't return 'm' for 'rn', 'r' followed by 'n' #36

Open
lazydog2 opened this issue Oct 12, 2023 · 2 comments
Open

Comments

@lazydog2
Copy link

When using the normalize function for 'rn', 'r' followed by 'n', it doesn't include 'm' in the returned list despite the fact that applying confusable_characters to 'm' includes 'rn' in the list, presumably normalize is only applying to single characters e.g:

>>> normalize('rn')
['rn']  # should return ['m', 'rn']
>>> confusable_characters('m')
['𝑚', 'μ', 'ᗰ', 'Ḿ', '𝘔', 'Ⲙ', '𝓂', 'ℳ', '𝑴', '𝓜', 'м', '𝚳', '𝚖', '𝙼', '𝜧', 'Ꮇ', '𝑀', '𝘮', '𝕞', '𝓶', '𑜀', 'm', '𝗺', '𝞛', '𝗆', 'ᛖ', '𝛭', 'Ⅿ', 'M', 'rn', 'µ', '𝙢', 'ṃ', '𝐦', 'ꭑ', 'Ṃ', '𝕸', 'ⲙ', '𝒎', 'ḿ', 'ϻ', 'ꓟ', 'M', 'm', 'ꮇ', 'ⅿ', '𐊰', '𝔐', 'Ϻ', 'Μ', '𝝡', 'Ṁ', '𝖬', '𝐌', '𐌑', '𝔪', '𝙈', '𝕄', 'ṁ', '𑣣', 'М', '𝖒', '𝗠']
@lazydog2
Copy link
Author

lazydog2 commented Oct 16, 2023

This version of normalize seems to handle multi-character confusables and closely matches the behavior of the existing normalize function:

import string
from copy import copy

def normalize(string, prioritize_chars=False, prioritized_char_set=string.ascii_lowercase):
    cache = {}
    for (k, v) in {key:set([value2.lower() for value2 in value if all(is_ascii(char) and char not in NON_NORMAL_ASCII_CHARS and (not prioritize_chars or char in prioritized_char_set) for char in value2) and (len(key) == 1 or key != value2)]) for (key,value) in CONFUSABLE_MAP.items() if string.lower().startswith(key.lower()) and (len(key) == 1 or key != value)}.items():
        cache[k.lower()].extend(v) if k.lower() in cache else cache.update({k.lower():list(v)})
    for x in range(1, len(string)):
        completed_string = string[0:x]
        remaining_string = string[x:]
        matching_confusables = {}
        for (k, v) in {key:set([value2.lower() for value2 in value if all(is_ascii(char) and char not in NON_NORMAL_ASCII_CHARS and (not prioritize_chars or char in prioritized_char_set) for char in value2) and (len(key) == 1 or key != value2)]) for (key,value) in CONFUSABLE_MAP.items() if remaining_string.lower().startswith(key.lower())}.items():
            matching_confusables[k.lower()].update(v) if k.lower() in matching_confusables else matching_confusables.update({k.lower():set(v)})
        for (k, v) in matching_confusables.items():
            normal_forms = [product(cache[completed_string], v)]
            cache_key = f'{completed_string}{k}'
            cache[cache_key].extend(normal_forms) if cache_key in cache else cache.update({cache_key:normal_forms})
        del cache[completed_string]
    
    for temp in next_string(cache[string]):
        yield temp

def next_string(node):
    if isinstance(node, tuple) and isinstance(node[0], str):
        yield f'{node[0]}{node[1]}'
    elif isinstance(node, tuple) and isinstance(node[0], product):
        for temp in next_string(copy(node[0])):
            yield f'{temp}{node[1]}'
    else:
        for temp in node:
            if isinstance(temp, tuple) and isinstance(temp[0], str):
                yield f'{temp[0]}{temp[1]}'
            elif isinstance(temp, tuple) and isinstance(temp[0], product):
                for temp2 in next_string(copy(temp[0])):
                    yield f'{temp2}{temp[1]}'
            else:
                for temp2 in copy(temp):
                    if isinstance(temp2, tuple) and isinstance(temp2[0], str):
                        yield f'{temp2[0]}{temp2[1]}'
                    else:
                        for temp3 in next_string(copy(temp2)):
                            yield f'{temp3}'

@drothlis
Copy link

drothlis commented Feb 27, 2024

To clarify, is_confusable('rn', 'm') does work as expected (it returns True). It's only normalize that doesn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants