On translation mapping #535

defanor · 2020-12-27T23:57:57Z

defanor
Dec 27, 2020
Maintainer

Already discussed over XMPP, but I'd better write it down here.

An issue with song lyrics translations is that if you're not familiar
enough with the original language, it is unclear which intonations
apply to which translated words (which original words correspond to
which translated ones). Hence the idea of defining mappings between
them.

It's not quite straightforward, since many words can be translated
into one, one can be translated into many, some may be added or thrown
away. And if we'll just map sentences into sentences, it's not good
enough, since exact intonations would still be unclear (although it's
better than no mapping at all).

In the basic form, we can just go through the translation, and refer
to the original words from each of its words, making the mapping file
perhaps a bit smaller than the lyrics and translation ones. For
instance, original:

Two households, both alike in dignity,
In fair Verona, where we lay our scene,
From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.

Its translation (I've intentionally picked one that is not word for
word, since it demonstrates the difficulties better):

Однажды две веронские семьи,
Во всем имея равные заслуги,
Умыли руки в собственной крови,
Храня предубежденье друг о друге.

A mapping follows. Here it's pairs of numbers, the second one being
for weight, which is described below; also its concrete syntax can be
a bit nicer/terser. Each list of pairs stands for a translated word,
referencing original words that influenced it.

tokenMap :: [[(Int, Int)]]
tokenMap = [
  [], [(0, 1)], [(6,1),(7,1),(8,2)], [(1,1)],

  [(2,1),(3,1),(4,1),(5,1)], [(2,1),(3,1),(4,1),(5,1)], [(2,1),(3,1),(4,1),(5,1)],
  [(2,1),(3,1),(4,2),(5,1)], [(2,1),(3,1),(4,1),(5,2)],

  [(17,1),(18,1),(19,1),(20,1),(22,1),(23,1),(24,1),(25,1),(26,1),(27,1)],
  [(17,1),(18,1),(19,1),(20,1),(22,1),(23,1),(24,1),(25,1),(26,2),(27,1)],
  [(17,1),(18,1),(19,1),(20,1),(22,1),(23,1),(24,1),(25,1),(26,1),(27,1)],
  [(17,1),(18,1),(19,1),(20,1),(22,1),(23,1),(24,1),(25,1),(26,1),(27,1)],
  [(17,1),(18,1),(19,1),(20,1),(22,1),(23,2),(24,1),(25,1),(26,1),(27,1)],

  [(14,1),(15,1),(16,1)], [(14,1),(15,1),(16,1)], [(14,1),(15,1),(16,1)],
  [(14,1),(15,1),(16,1)], [(14,1),(15,1),(16,1)]
  ]

The result of simply forming a graph, taking its connected components,
applying different styles (cycling through "^~`'-\"" symbols in this
case, probably will be colouring or other styling later) to those:

Two households, both alike in dignity,       | Однажды две веронские семьи,
^^^ ~~~~~~~~~~~ ```` ````` `` ````````       |         ^^^ ''''''''' ~~~~~~
In fair Verona, where we lay our scene,      | Во всем имея равные заслуги,
'' '''' '''''''                              | `` ```` ```` `````` ````````
From ancient grudge break to new mutiny,     | Умыли руки в собственной крови,
---- ------- ------ """"" "" """ """""""     | """"" """" " """"""""""" """"""
Where civil blood makes civil hands unclean. | Храня предубежденье друг о друге.
      """"" """"" """"" """"" """"" """""""" | ----- ------------- ---- - ------

This would work for word for word translations already, and gives at
least a rough idea of what's where, but should still be improved.
That's why I've added those weights above: one way to handle it may be
to define which translated words are more influenced by which words in
the original, and perhaps add a few levels of sub-styling (colour
tones, additional underlines, etc).

This approach may be quite problematic for larger works (styles could
repeat and mix, for instance), but should be fine for relatively small
song lyrics.

So, I think it's left to sort out sub-styling, and perhaps to try to
incorporate it into the database and the website. And maybe to find a
linguist to tell us why it's wrong.

defanor · 2020-12-28T12:13:11Z

defanor
Dec 28, 2020
Maintainer Author

Added sub-styling into the prototype. It does basically the same thing
as initial styling, just removes connections with weight 1 to set
second-level styles, with weight 2 to set third-level ones, etc. With
the weights as in the example (which can be fixed, actually, but it
doesn't matter much), the result looks like this:

Two households, both alike in dignity,       | Однажды две веронские семьи,
^^^ ~~~~~~~~~~~ ```` ````` `` ````````       |         ^^^ ''''''''' ~~~~~~
                           ^^ ~~~~~~~~       |             `````````       
In fair Verona, where we lay our scene,      | Во всем имея равные заслуги,
'' '''' '''''''                              | `` ```` ```` `````` ````````
        ```````                              |              ^^^^^^ ~~~~~~~~
From ancient grudge break to new mutiny,     | Умыли руки в собственной крови,
---- ------- ------ """"" "" """ """""""     | """"" """" " """"""""""" """"""
                                             |       ----               ''''''
Where civil blood makes civil hands unclean. | Храня предубежденье друг о друге.
      """"" """"" """"" """"" """"" """""""" | ----- ------------- ---- - ------
            '''''             -----          |                                  

Perhaps more interesting calculations can be done if it'd also take
into account how many original words a translated word references, but
the current one seems to be rather simple, and apparently working to
achieve the goal.

0 replies

defanor · 2020-12-30T11:37:00Z

defanor
Dec 30, 2020
Maintainer Author

Here is a possible mapping concrete syntax example (the parser/grammar is defined below):

 0 6,7,8:2 1
2,3,4,5 2,3,4,5 2,3,4,5 2,3:2,4,5 2,3,4,5:2
17,18,19,20,22,23,24,25,26,27
17,18,19,20,22,23,24,25,26:2,27
17,18,19,20,22,23,24,25,26,27
17,18,19,20,22,23,24,25,26,27
17,18,19,20,22,23:2,24,25,26,27
14,15,16 14,15,16 14,15,16 14,15,16 14,15,16

And the whole prototype, for textual styling as in examples above:

import Data.Graph
import Data.List
import Data.Maybe
import Text.Parsec
import Numeric
import System.Environment

data Key = Orig Int | Trans Int
  deriving (Eq, Ord, Show)

subgr :: [[(Int, Int)]] -> Int -> [[Key]]
subgr tokenMap minWeight =
  let maxOrig = foldl max 0 $ map fst $ concat tokenMap
      (graph, nodeFromVertex, vertexFromKey) = graphFromEdges $
        map (\n -> (Orig n, Orig n, [])) [0..maxOrig] ++
        zipWith (\refs n -> (Trans n, Trans n,
                             mapMaybe (\(to, w) -> if w > minWeight
                                                   then Just (Orig to)
                                                   else Nothing) refs))
        tokenMap [0..]
      toKey :: Tree Vertex -> [Key]
      toKey (Node n []) = [(\(k, _, _) -> k) $ nodeFromVertex n]
      toKey (Node n f) =
        ((\(k, _, _) -> k) $ nodeFromVertex n) : concatMap toKey f
  in filter (\x -> length x > 1) $ map toKey $ components graph

stylesForKey :: [[(Int, Int)]] -> [[a]] -> Key -> [Maybe a]
stylesForKey tokenMap styles k = zipWith
  (\s d -> snd <$> find (elem k . fst) (zip (subgr tokenMap d) (cycle s)))
  styles [0..]

styledTokens :: [[(Int, Int)]] -> (Int -> Key) -> [String]
             -> [[a]] -> [(String, [Maybe a])]
styledTokens tm kf ts styles =
  zipWith (\t n -> (t, (stylesForKey tm styles (kf n)))) ts [0..]

type Parser = Parsec String Int

pMapping :: Parser [[(Int, Int)]]
pMapping = refs `sepBy` space
  where
    refs :: Parser [(Int, Int)]
    refs = ref `sepBy` char ','
    ref :: Parser (Int, Int)
    ref = (,) <$> num <*> option 1 (char ':' >> num)
    num :: Parser Int
    num = do
      rel <- optionMaybe $ oneOf "+-"
      digits <- many1 digit
      prev <- getState
      case readDec digits of
        [(n, "")] ->
          let n' = case rel of
                Just '+' -> prev + n
                Just '-' -> prev - n
                Nothing -> n
          in putState n' >> pure n'
        _ -> error "Failed to read a number"


applyStyles :: [(String, [Maybe Char])] -> String
            -> (String, [String]) -> String
applyStyles _ "" (au, ad) = unlines (au:ad)
applyStyles toks (' ':s) (au, ad) =
  applyStyles toks s (au ++ " ", map (++ " ") ad)
applyStyles toks ('\n':s) (au, ad) =
  unlines (au:ad) ++ applyStyles toks s ("", map (const "") ad)
applyStyles ((tok, style):toks) s (au, ad) =
  case stripPrefix tok s of
    Just s' -> applyStyles toks s'
      (au ++ tok,
       zipWith (\ad' st -> ad' ++ replicate (length tok) st)
        ad $ map (fromMaybe ' ') style)
    Nothing -> "error"

sideBySide :: String -> String -> String
sideBySide s1 s2 = unlines $ zipWith
  (\l1 l2 -> l1 ++ replicate (maxLen - length l1) ' ' ++ " | " ++ l2)
  (lines s1) (lines s2)
  where maxLen = (foldl max 0 $ map length (lines s1))

main :: IO ()
main = do
  args <- getArgs
  case args of
    [origFn, transFn, mapFn] -> do
      str1 <- readFile origFn
      str2 <- readFile transFn
      mappingStr <- readFile mapFn
      let tokenMap' = runParser pMapping 0 mapFn mappingStr
      case tokenMap' of
        Left err -> print err
        Right tokenMap -> do
          let maxWeight = foldl max 0 $ map snd $ concat tokenMap
              t1 = applyStyles
                (styledTokens tokenMap Orig (words str1) (replicate maxWeight "^~`'-\",._"))
                str1 ("", (replicate maxWeight ""))
              t2 = applyStyles
                (styledTokens tokenMap Trans (words str2) (replicate maxWeight "^~`'-\",._"))
                str2 ("", (replicate maxWeight ""))
          putStr $ sideBySide t1 t2
    _ -> do
      pn <- getProgName
      putStrLn $ "Usage: " ++ pn ++ " <original> <translation> <mapping>"

0 replies

defanor · 2020-12-30T13:40:14Z

defanor
Dec 30, 2020
Maintainer Author

Had to adjust the prototype, including adding more styles, to handle actual song lyrics (on which I've finally tried it; picked the ones that are more or less word for word, from https://lyricstranslate.com/en/gruppa-krovi-gruppa-krovi-blood-type.html-0#songtranslation):

Теплое место                                | A warm place
^^^^^^ ~~~~~                                |   ^^^^ ~~~~~
Но улицы ждут отпечатков наших ног          | But the streets are waiting for the footprints of our feet
`` ''''' ---- """""""""" ,,,,, ...          | ``` ''' ''''''' --- ------- """ """ """""""""" ,, ,,, ....
Звездная пыль — на сапогах.                 | Stardust on our boots.
________ ____   ^^ ~~~~~~~~                 | ________ ^^     ~~~~~~
Мягкое кресло, клетчатый плед,              | A soft armchair, checkered plaid,
`````` ''''''' --------- """""              |   ```` ''''''''' --------- """"""
Не нажатый вовремя курок.                   | A trigger not pulled in time.
,, ....... _______ ^^^^^^                   |   ^^^^^^^ ,,, ...... __ _____
Солнечный день — в ослепительных снах.      | A sunny day - in blinding dreams.
~~~~~~~~~ ```` ' - """"""""""""" ,,,,,      |   ~~~~~ ``` ' -- """""""" ,,,,,,,
                                            | 
                                            | 
Группа крови — на рукаве,                   | Blood type - on the sleeve
...... _____ ^ ~~ ```````                   | _____ .... ^ ~~ ``` ``````
Мой порядковый номер — на рукаве,           | My serial number - on the sleeve,
''' ---------- """"" , .. _______           | '' ------ """""" , .. ___ _______
Пожелай мне удачи в бою, пожелай мне:       | Wish me luck in battle, wish for me
^^^^^^^ ~~~ ````` ' ---- """"""" ,,,,       | ^^^^ ~~ ```` '' ------- """" ,,, ,,
Не остаться в этой траве,                   | To not stay in this grass,
.. ________ ^ ~~~~ ``````                   |    ... ____ ^^ ~~~~ ``````
Не остаться в этой траве.                   | To not stay in this grass.
'' -------- " ,,,, ......                   |    ''' ---- "" ,,,, ......
Пожелай мне удачи, пожелай мне удачи!       | Wish me luck, wish me luck!
_______ ^^^ ~~~~~~ ``````` ''' ------       | ____ ^^ ~~~~~ ```` '' -----
                                            | 
                                            | 
И есть чем платить, но я не хочу            | And there is enough to pay, but I don't want
" ,,,, ,,, ........ __ ^ ~~ ````            | """ ,,,,, ,, ,,,,,,    .... ___ ^ ~~~~~ ````
Победы любой ценой.                         | Victory at any cost.
'''''' ----- """"""                         | ''''''' -- --- """""
Я никому не хочу ставить ногу на грудь.     | I don't want to put my leg on anyone's chest.
, ...... __ ^^^^ ~~~~~~~ ```` '' ------     | , _____ ^^^^    ~~~    ``` '' ........ ------
Я хотел бы остаться с тобой,                | I would have liked to stay with you,
" ,,,,, .. ________ ^ ~~~~~~                | " ..... .... ,,,,, __ ____ ^^^^ ~~~~
Просто остаться с тобой,                    | To just stay with you,
`````` '''''''' - """"""                    | '' ```` '''' ---- """"
Но высокая в небе звезда зовет меня в путь. | But the star high in the sky is calling me.
,, ....... _ ^^^^ ~~~~~~ ````` ''''         | ,,, ~~~ ~~~~ .... __ ^^^ ^^^ `` ``````` '''
                                            | 
                                            | 
Группа крови — на рукаве,                   | Blood type - on the sleeve
------ """"" , .. _______                   | """"" ---- , .. ___ ______
Мой порядковый номер — на рукаве,           | My serial number - on the sleeve,
^^^ ~~~~~~~~~~ ````` ' -- """""""           | ^^ ~~~~~~ `````` ' -- """ """""""
Пожелай мне удачи в бою, пожелай мне:       | Wish me luck in battle, wish for me
,,,,,,, ... _____ ^ ~~~~ ``````` ''''       | ,,,, .. ____ ^^ ~~~~~~~ ```` ''' ''
Не остаться в этой траве,                   | To not stay in this grass,
-- """""""" , .... ______                   |    --- """" ,, .... ______
Не остаться в этой траве.                   | To not stay in this grass.
^^ ~~~~~~~~ ` '''' ------                   |    ^^^ ~~~~ `` '''' ------
Пожелай мне удачи, пожелай мне удачи        | Wish me luck, wish me luck
""""""" ,,, ...... _______ ^^^ ~~~~~        | """" ,, ..... ____ ^^ ~~~~

Here's the mapping it took, with no weights higher than 1 this time:

 0 1
2 3 3 4 4 5 5 5 6 6 7
8,9 11  12
 13 14 15 16
 20 17 18 19 19
 21 22 23 24 25 26
28 27 29 30 31 31
32 33 34 35 36 37 37
38 39 40 41 42 43 44 44
 45 46 47 48 49
 50 51 52 53 54
55 56 57 58 59 60
61 62 62 62,63  64 65 66 67 68
69 70 70 71
72 74 75  76  77 78 73 79
80 82 82 81 83 83 84 85
87 86 87 88 89
90 94 94 91 92 93 93 95 95 96
100 99 101 102 103 103
104 105 106 107 108 109 109
110 111 112 113 114 115 116 116
 117 118 119 120 121
 122 123 124 125 126
127 128 129 130 131 132

Now I'm thinking that it may be handy to allow relative references (e.g., "130 +1 +1" instead of "130 131 132"), especially for the repeating bits, which are relatively common in song lyrics. Update: introduced relative positions (which can be mixed with absolute ones), now that mapping looks like this:

 0 +1
+1 +1 +0 +1 +0 +1 +0 +0 +1 +0 +1
+1,+1 +2  +1
 +1 +1 +1 +1
 +4 -3 +1 +1 +0
 +2 +1 +1 +1 +1 +1
+2 -1 +2 +1 +1 +0
+1 +1 +1 +1 +1 +1 +0
+1 +1 +1 +1 +1 +1 +1 +0
 +1 +1 +1 +1 +1
 +1 +1 +1 +1 +1
+1 +1 +1 +1 +1 +1
+1 +1 +0 +0,+1  +1 +1 +1 +1 +1
+1 +1 +0 +1
+1 +2 +1  +1  +1 +1 -5 +6
+1 +2 +0 -1 +2 +0 +1 +1
+2 -1 +1 +1 +1
+1 +4 +0 -3 +1 +1 +0 +2 +0 +1
+4 -1 +2 +1 +1 +0
+1 +1 +1 +1 +1 +1 +0
+1 +1 +1 +1 +1 +1 +1 +0
 +1 +1 +1 +1 +1
 +1 +1 +1 +1 +1
+1 +1 +1 +1 +1 +1

0 replies

snshn · 2021-01-01T10:35:14Z

snshn
Jan 1, 2021
Maintainer

This is super cool.

Two questions:

what happens in case the original text gets edited/changed? (the whole mapping will break, or only a part of it)?
how do you think we should store this map data along the translations: as base64, in metadata, as a separate file?

0 replies

defanor · 2021-01-01T11:33:37Z

defanor
Jan 1, 2021
Maintainer Author

1. what happens in case the original text gets edited/changed? (the whole mapping will break, or only a part of it)?

It maps positions into positions, so if words will be added or removed, positions will change, and the mapping should be adjusted accordingly. With the relative mappings (the "[+-]n" ones) it'd take fewer edits than with absolute ones.

2. how do you think we should store this map data along the translations: as base64, in metadata, as a separate file?

A separate file may be more handy for editing and manual inspection. Base-64 doesn't seem very useful here, since there's a textual syntax, and it can be put on a single line as it is. Yet another option is to combine it with translations themselves (e.g., for translation "foo bar baz" and mapping "1 3 2", a combined version may be "foo/1 bar/3 baz/2"), but that'd make translations less readable as plaintext files, which isn't nice. Perhaps we should also consider how to add footnotes/explanations/translation notes (similar to those on genius.com: explaining references, slang terms, etc) and try to use an unified (or at least somewhat consistent) style/approach for all these additional bits. Actually this makes me to wonder once again whether XML would be more suitable than a custom syntax here (maybe could define both, and a translation between them), since XML grammars for footnotes and mappings can be combined rather easily, without defining special rules for that. Though as long as it'd be machine-readable, we can change it and update files automatically later.

0 replies

snshn · 2021-01-01T19:08:17Z

snshn
Jan 1, 2021
Maintainer

Non-relative references may be not needed at all — if one word gets added/removed on line one, the whole map will need to be adjusted then.

I'm also thinking about indefinite articles:

Мягкое кресло, клетчатый плед,              | A soft armchair, checkered plaid,
`````` ''''''' --------- """""              |   ```` ''''''''' --------- """"""

may likely have to be

Мягкое кресло, клетчатый плед,              | A soft armchair, checkered plaid,
`````` ''''''' --------- """""              | ' ```` ''''''''' --------- """"""

since definite articles seem to be marked along with the noun they belong to

Мой порядковый номер — на рукаве,           | My serial number - on the sleeve,
''' ---------- """"" , .. _______           | '' ------ """""" , .. ___ _______

I'm also trying to figure out if punctuation marks should be counted as part of the word or not (e.g. in Spanish there's that inverted question mark in front of the sentence, that may have to be mapped to the question mark at the end of the sentence when translated into English).

How often do you think it may happen that references would jump between line numbers and paragraphs?

0 replies

defanor · 2021-01-29T05:58:46Z

defanor
Jan 29, 2021
Maintainer Author

I'm also trying to figure out if punctuation marks should be counted as part of the word or not (e.g. in Spanish there's that inverted question mark in front of the sentence, that may have to be mapped to the question mark at the end of the sentence when translated into English).

I don't think that how punctuation is handled is that important: it's perhaps the easiest part of a foreign language to understand, and not exactly pronounced on its own in songs. But if we were more pedantic, there are other things to worry about -- such as languages that don't split words with spaces, RTL scripts, better handling of compound words. But this relatively rough and simple approach would cover many cases reasonably well anyway, and can be improved later.

How often do you think it may happen that references would jump between line numbers and paragraphs?

Not very often if we'll pick word for word translations, but more often with poetic ones (as evident from the examples above).

0 replies

snshn · 2021-02-15T07:52:43Z

snshn
Feb 15, 2021
Maintainer

I'm ready to implement this, but need to confirm first if we want to make the point of relativity to be in the beginning of every line, or rather keep it to be the beginning of the document. +1/+0 works, but I'm trying to think about potential changes in the original text and how they may affect the text's translation map. Wouldn't want the diff to be the size of the map, not like there're GUI tools to create these maps, and doing that by hand because of one word being added/removed could be very frustrating.

0 replies

defanor · 2021-02-15T09:53:38Z

defanor
Feb 15, 2021
Maintainer Author

I'm ready to implement this, but need to confirm first if we want to make the point of relativity to be in the beginning of every line, or rather keep it to be the beginning of the document. +1/+0 works, but I'm trying to think about potential changes in the original text and how they may affect the text's translation map. Wouldn't want the diff to be the size of the map, not like there're GUI tools to create these maps, and doing that by hand because of one word being added/removed could be very frustrating.

Taking lines into account would add constraints for translations and complicate the model a bit, making it less flexible. Perhaps a simple version (only taking words into account) would suffice at first, and if we'll want to switch to positions relative to line beginning, it can be done automatically later. Besides, as shown in the first example above, some of the translated words may be on a line other than the ones they are translations of. Could also consider referencing lines + words inside those (absolutely or relatively for line numbers, e.g., "12:4", "-1:3"), but more complex syntax and semantics may make it more confusing.

0 replies

snshn · 2021-02-27T22:54:59Z

snshn
Feb 27, 2021
Maintainer

Translation directory structure is in place, I'm ready to enhance lyrics-website-generator.git to implement this, where should we start?

0 replies

defanor · 2021-02-28T09:43:56Z

defanor
Feb 28, 2021
Maintainer Author

Could start with rewriting this in Python. The inputs are original text, translation, and mapping, and the output is a formatted/styled original text along with its translation.

Apparently my Python installation is not in a usable state (system repositories only provide Python 2 packages, pip says that "PyPI's XMLRPC API is currently disabled due to unmanageable load"), so here is an outline instead of code.

According to an old SO answer, major Python graph libraries are igraph bindings and NetworkX; the latter may be preferable for possibly being easier to set. Not sure about Python parsing libraries, I found them to be a pain to use about 5 years ago (but the grammar here is simple, perhaps even regexps will do).

Parse the mapping, producing a two-dimensional list/array of pairs of numbers: for every translated word, there should be a list of corresponding words in the original text, along with the weights of those connections (that's pMapping above, producing something like tokenMap from the first example).
Find which words (in both original and translation) should have the same styles, possibly at different weights separately (by iterating over weights, including only connections with weights higher than the current one into the search; though could also ignore weights altogether at first). NetworkX's connected_components function should be helpful there: compose a graph out of the parsed mapping, and take the connected components. That's the subgr function above.
Generate or define styles (underlines, colours) for the connections. To make them visually distinct enough, most likely will have to reuse/rotate the same styles. Could use, for instance, the HCL colour space, using the "hue" component to distinguish non-connected components (unrelated words), and the "colorfulness" component to reflect weights. HSLuv may also be handy. Luminance should be tweaked for legibility (possibly depending on a theme: dark or light), perhaps better to ignore it for styling. Different underline styles could be used to duplicate the same information, and make it usable on monochrome displays, by colour blind people, in web browsers with colour overriding, in printed documents, etc. Actually given that there's not too many distinct enough colours or underline styles, could simply use a predefined list of styles, not bothering with generation.
Tokenize the original and translated texts (simply split on newlines and other whitespaces for now), compose a document out of those and the calculated styles.

0 replies

snshn · 2021-03-28T22:05:13Z

snshn
Mar 28, 2021
Maintainer

Don't remember if I mentioned this before, but I think that after the POC is implemented, we may want to see if it's feasible to add annotation for what kind of language the original line is in — that way the mapping could indicate that info for multi-lingual texts.

0 replies

defanor · 2022-01-30T21:27:07Z

defanor
Jan 30, 2022
Maintainer Author

Not sure how to add it into the generator (should formatLyricsAndMetadata be edited, or perhaps rewritten for use in 50-translations-write.py?), here are just the functions to parse a mapping and find connected components in Python:

import re
from enum import Enum, auto

class NodeType(Enum):
    ORIG = auto()
    TRANS = auto()

p_inter_refs = re.compile('\s')
p_inter_ref = re.compile(',')
p_ref = re.compile('^(?P<sign>[-+]?)(?P<position>\d+)(?::(?P<weight>\d+))?$')

def parse_mapping(s):
    refs_num = 0
    ref_num = 0
    cur_pos = 0
    result = list()
    for refs in p_inter_refs.split(s):
        l = list()
        if refs:
            for ref in p_inter_ref.split(refs):
                m = p_ref.match(ref)
                if m:
                    (sign, pos, weight) = m.groups()
                    p = int(pos)
                    w = int(weight) if weight else 1
                    if sign == '+':
                        cur_pos += p
                    elif sign == '-':
                        cur_pos -= p
                    else:
                        cur_pos = p
                    l.append((cur_pos, w))
                else:
                    raise Exception('Failed to parse a reference at {}:{}'.
                                    format(refs_num, ref_num))
                ref_num += 1
        result.append(l)
        refs_num += 1
        ref_num = 0
        
    return result

def get_components(l, threshold):
    # build a graph
    g = dict()
    for i in range(len(l)):
        g[(NodeType.TRANS, i)] = list()
        for (to, w) in l[i]:
            if w >= threshold:
                g[(NodeType.TRANS, i)].append((NodeType.ORIG, to))
                if (NodeType.ORIG, to) in g:
                    g[(NodeType.ORIG, to)].append((NodeType.TRANS, i))
                else:
                    g[(NodeType.ORIG, to)] = list([(NodeType.TRANS, i)])

    # find connected components
    components = list()
    while g:
        (key, queue) = g.popitem()
        if queue:
            component = set([key])
            while queue:
                k = queue.pop()
                component.add(k)
                if k in g:
                    queue += g.pop(k)
            components.append(component)
    return components

Each component points at words in the original text and its translation, so we can assign styles to components and assign those to corresponding words while rendering (perhaps by setting classes, and then tweaking it via CSS), and iterate a few times to handle different weights.

1 reply

snshn Feb 1, 2022
Maintainer

50-translations.py:359 is definitely the way to go for this, perhaps there needs to be a parameter there, or a separate sister function in utils to write translation mappings... I'll look into how to do that this weekend, or earlier, likely utilizing the <table> tag.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On translation mapping #535

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 13 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

On translation mapping #535

defanor Dec 27, 2020 Maintainer

Replies: 13 comments · 1 reply

defanor Dec 28, 2020 Maintainer Author

defanor Dec 30, 2020 Maintainer Author

defanor Dec 30, 2020 Maintainer Author

snshn Jan 1, 2021 Maintainer

defanor Jan 1, 2021 Maintainer Author

snshn Jan 1, 2021 Maintainer

defanor Jan 29, 2021 Maintainer Author

snshn Feb 15, 2021 Maintainer

defanor Feb 15, 2021 Maintainer Author

snshn Feb 27, 2021 Maintainer

defanor Feb 28, 2021 Maintainer Author

snshn Mar 28, 2021 Maintainer

defanor Jan 30, 2022 Maintainer Author

snshn Feb 1, 2022 Maintainer

defanor
Dec 27, 2020
Maintainer

Replies: 13 comments 1 reply

defanor
Dec 28, 2020
Maintainer Author

defanor
Dec 30, 2020
Maintainer Author

defanor
Dec 30, 2020
Maintainer Author

snshn
Jan 1, 2021
Maintainer

defanor
Jan 1, 2021
Maintainer Author

snshn
Jan 1, 2021
Maintainer

defanor
Jan 29, 2021
Maintainer Author

snshn
Feb 15, 2021
Maintainer

defanor
Feb 15, 2021
Maintainer Author

snshn
Feb 27, 2021
Maintainer

defanor
Feb 28, 2021
Maintainer Author

snshn
Mar 28, 2021
Maintainer

defanor
Jan 30, 2022
Maintainer Author

snshn Feb 1, 2022
Maintainer