Replies: 13 comments 1 reply
-
Added sub-styling into the prototype. It does basically the same thing as initial styling, just removes connections with weight 1 to set second-level styles, with weight 2 to set third-level ones, etc. With the weights as in the example (which can be fixed, actually, but it doesn't matter much), the result looks like this: Two households, both alike in dignity, | Однажды две веронские семьи, ^^^ ~~~~~~~~~~~ ```` ````` `` ```````` | ^^^ ''''''''' ~~~~~~ ^^ ~~~~~~~~ | ````````` In fair Verona, where we lay our scene, | Во всем имея равные заслуги, '' '''' ''''''' | `` ```` ```` `````` ```````` ``````` | ^^^^^^ ~~~~~~~~ From ancient grudge break to new mutiny, | Умыли руки в собственной крови, ---- ------- ------ """"" "" """ """"""" | """"" """" " """"""""""" """""" | ---- '''''' Where civil blood makes civil hands unclean. | Храня предубежденье друг о друге. """"" """"" """"" """"" """"" """""""" | ----- ------------- ---- - ------ ''''' ----- | Perhaps more interesting calculations can be done if it'd also take into account how many original words a translated word references, but the current one seems to be rather simple, and apparently working to achieve the goal. |
Beta Was this translation helpful? Give feedback.
-
Here is a possible mapping concrete syntax example (the parser/grammar is defined below):
And the whole prototype, for textual styling as in examples above: import Data.Graph
import Data.List
import Data.Maybe
import Text.Parsec
import Numeric
import System.Environment
data Key = Orig Int | Trans Int
deriving (Eq, Ord, Show)
subgr :: [[(Int, Int)]] -> Int -> [[Key]]
subgr tokenMap minWeight =
let maxOrig = foldl max 0 $ map fst $ concat tokenMap
(graph, nodeFromVertex, vertexFromKey) = graphFromEdges $
map (\n -> (Orig n, Orig n, [])) [0..maxOrig] ++
zipWith (\refs n -> (Trans n, Trans n,
mapMaybe (\(to, w) -> if w > minWeight
then Just (Orig to)
else Nothing) refs))
tokenMap [0..]
toKey :: Tree Vertex -> [Key]
toKey (Node n []) = [(\(k, _, _) -> k) $ nodeFromVertex n]
toKey (Node n f) =
((\(k, _, _) -> k) $ nodeFromVertex n) : concatMap toKey f
in filter (\x -> length x > 1) $ map toKey $ components graph
stylesForKey :: [[(Int, Int)]] -> [[a]] -> Key -> [Maybe a]
stylesForKey tokenMap styles k = zipWith
(\s d -> snd <$> find (elem k . fst) (zip (subgr tokenMap d) (cycle s)))
styles [0..]
styledTokens :: [[(Int, Int)]] -> (Int -> Key) -> [String]
-> [[a]] -> [(String, [Maybe a])]
styledTokens tm kf ts styles =
zipWith (\t n -> (t, (stylesForKey tm styles (kf n)))) ts [0..]
type Parser = Parsec String Int
pMapping :: Parser [[(Int, Int)]]
pMapping = refs `sepBy` space
where
refs :: Parser [(Int, Int)]
refs = ref `sepBy` char ','
ref :: Parser (Int, Int)
ref = (,) <$> num <*> option 1 (char ':' >> num)
num :: Parser Int
num = do
rel <- optionMaybe $ oneOf "+-"
digits <- many1 digit
prev <- getState
case readDec digits of
[(n, "")] ->
let n' = case rel of
Just '+' -> prev + n
Just '-' -> prev - n
Nothing -> n
in putState n' >> pure n'
_ -> error "Failed to read a number"
applyStyles :: [(String, [Maybe Char])] -> String
-> (String, [String]) -> String
applyStyles _ "" (au, ad) = unlines (au:ad)
applyStyles toks (' ':s) (au, ad) =
applyStyles toks s (au ++ " ", map (++ " ") ad)
applyStyles toks ('\n':s) (au, ad) =
unlines (au:ad) ++ applyStyles toks s ("", map (const "") ad)
applyStyles ((tok, style):toks) s (au, ad) =
case stripPrefix tok s of
Just s' -> applyStyles toks s'
(au ++ tok,
zipWith (\ad' st -> ad' ++ replicate (length tok) st)
ad $ map (fromMaybe ' ') style)
Nothing -> "error"
sideBySide :: String -> String -> String
sideBySide s1 s2 = unlines $ zipWith
(\l1 l2 -> l1 ++ replicate (maxLen - length l1) ' ' ++ " | " ++ l2)
(lines s1) (lines s2)
where maxLen = (foldl max 0 $ map length (lines s1))
main :: IO ()
main = do
args <- getArgs
case args of
[origFn, transFn, mapFn] -> do
str1 <- readFile origFn
str2 <- readFile transFn
mappingStr <- readFile mapFn
let tokenMap' = runParser pMapping 0 mapFn mappingStr
case tokenMap' of
Left err -> print err
Right tokenMap -> do
let maxWeight = foldl max 0 $ map snd $ concat tokenMap
t1 = applyStyles
(styledTokens tokenMap Orig (words str1) (replicate maxWeight "^~`'-\",._"))
str1 ("", (replicate maxWeight ""))
t2 = applyStyles
(styledTokens tokenMap Trans (words str2) (replicate maxWeight "^~`'-\",._"))
str2 ("", (replicate maxWeight ""))
putStr $ sideBySide t1 t2
_ -> do
pn <- getProgName
putStrLn $ "Usage: " ++ pn ++ " <original> <translation> <mapping>" |
Beta Was this translation helpful? Give feedback.
-
Had to adjust the prototype, including adding more styles, to handle actual song lyrics (on which I've finally tried it; picked the ones that are more or less word for word, from https://lyricstranslate.com/en/gruppa-krovi-gruppa-krovi-blood-type.html-0#songtranslation):
Here's the mapping it took, with no weights higher than 1 this time:
Now I'm thinking that it may be handy to allow relative references (e.g., "130 +1 +1" instead of "130 131 132"), especially for the repeating bits, which are relatively common in song lyrics. Update: introduced relative positions (which can be mixed with absolute ones), now that mapping looks like this:
|
Beta Was this translation helpful? Give feedback.
-
This is super cool. Two questions:
|
Beta Was this translation helpful? Give feedback.
-
1. what happens in case the original text gets edited/changed? (the
whole mapping will break, or only a part of it)?
It maps positions into positions, so if words will be added or removed,
positions will change, and the mapping should be adjusted accordingly.
With the relative mappings (the "[+-]n" ones) it'd take fewer edits than
with absolute ones.
2. how do you think we should store this map data along the
translations: as base64, in metadata, as a separate file?
A separate file may be more handy for editing and manual inspection.
Base-64 doesn't seem very useful here, since there's a textual syntax,
and it can be put on a single line as it is.
Yet another option is to combine it with translations themselves (e.g.,
for translation "foo bar baz" and mapping "1 3 2", a combined version
may be "foo/1 bar/3 baz/2"), but that'd make translations less readable
as plaintext files, which isn't nice.
Perhaps we should also consider how to add
footnotes/explanations/translation notes (similar to those on
genius.com: explaining references, slang terms, etc) and try to use an
unified (or at least somewhat consistent) style/approach for all these
additional bits. Actually this makes me to wonder once again whether XML
would be more suitable than a custom syntax here (maybe could define
both, and a translation between them), since XML grammars for footnotes
and mappings can be combined rather easily, without defining special
rules for that.
Though as long as it'd be machine-readable, we can change it and update
files automatically later.
|
Beta Was this translation helpful? Give feedback.
-
Non-relative references may be not needed at all — if one word gets added/removed on line one, the whole map will need to be adjusted then. I'm also thinking about indefinite articles:
may likely have to be
since definite articles seem to be marked along with the noun they belong to
I'm also trying to figure out if punctuation marks should be counted as part of the word or not (e.g. in Spanish there's that inverted question mark in front of the sentence, that may have to be mapped to the question mark at the end of the sentence when translated into English). How often do you think it may happen that references would jump between line numbers and paragraphs? |
Beta Was this translation helpful? Give feedback.
-
I'm also trying to figure out if punctuation marks should be counted as part of
the word or not (e.g. in Spanish there's that inverted question mark in front
of the sentence, that may have to be mapped to the question mark at the end of
the sentence when translated into English).
I don't think that how punctuation is handled is that important: it's
perhaps the easiest part of a foreign language to understand, and not
exactly pronounced on its own in songs. But if we were more pedantic,
there are other things to worry about -- such as languages that don't
split words with spaces, RTL scripts, better handling of compound words.
But this relatively rough and simple approach would cover many cases
reasonably well anyway, and can be improved later.
How often do you think it may happen that references would jump between line
numbers and paragraphs?
Not very often if we'll pick word for word translations, but more often
with poetic ones (as evident from the examples above).
|
Beta Was this translation helpful? Give feedback.
-
I'm ready to implement this, but need to confirm first if we want to make the point of relativity to be in the beginning of every line, or rather keep it to be the beginning of the document. |
Beta Was this translation helpful? Give feedback.
-
I'm ready to implement this, but need to confirm first if we want to make the
point of relativity to be in the beginning of every line, or rather keep it to
be the beginning of the document. +1/+0 works, but I'm trying to think about
potential changes in the original text and how they may affect the text's
translation map. Wouldn't want the diff to be the size of the map, not like
there're GUI tools to create these maps, and doing that by hand because of one
word being added/removed could be very frustrating.
Taking lines into account would add constraints for translations and
complicate the model a bit, making it less flexible. Perhaps a simple
version (only taking words into account) would suffice at first, and if
we'll want to switch to positions relative to line beginning, it can be
done automatically later. Besides, as shown in the first example above,
some of the translated words may be on a line other than the ones they
are translations of.
Could also consider referencing lines + words inside those (absolutely
or relatively for line numbers, e.g., "12:4", "-1:3"), but more complex
syntax and semantics may make it more confusing.
|
Beta Was this translation helpful? Give feedback.
-
Translation directory structure is in place, I'm ready to enhance lyrics-website-generator.git to implement this, where should we start? |
Beta Was this translation helpful? Give feedback.
-
Could start with rewriting this in Python. The inputs are original text, translation, and mapping, and the output is a formatted/styled original text along with its translation. Apparently my Python installation is not in a usable state (system repositories only provide Python 2 packages, pip says that "PyPI's XMLRPC API is currently disabled due to unmanageable load"), so here is an outline instead of code. According to an old SO answer, major Python graph libraries are igraph bindings and NetworkX; the latter may be preferable for possibly being easier to set. Not sure about Python parsing libraries, I found them to be a pain to use about 5 years ago (but the grammar here is simple, perhaps even regexps will do).
|
Beta Was this translation helpful? Give feedback.
-
Don't remember if I mentioned this before, but I think that after the POC is implemented, we may want to see if it's feasible to add annotation for what kind of language the original line is in — that way the mapping could indicate that info for multi-lingual texts. |
Beta Was this translation helpful? Give feedback.
-
Not sure how to add it into the generator (should import re
from enum import Enum, auto
class NodeType(Enum):
ORIG = auto()
TRANS = auto()
p_inter_refs = re.compile('\s')
p_inter_ref = re.compile(',')
p_ref = re.compile('^(?P<sign>[-+]?)(?P<position>\d+)(?::(?P<weight>\d+))?$')
def parse_mapping(s):
refs_num = 0
ref_num = 0
cur_pos = 0
result = list()
for refs in p_inter_refs.split(s):
l = list()
if refs:
for ref in p_inter_ref.split(refs):
m = p_ref.match(ref)
if m:
(sign, pos, weight) = m.groups()
p = int(pos)
w = int(weight) if weight else 1
if sign == '+':
cur_pos += p
elif sign == '-':
cur_pos -= p
else:
cur_pos = p
l.append((cur_pos, w))
else:
raise Exception('Failed to parse a reference at {}:{}'.
format(refs_num, ref_num))
ref_num += 1
result.append(l)
refs_num += 1
ref_num = 0
return result
def get_components(l, threshold):
# build a graph
g = dict()
for i in range(len(l)):
g[(NodeType.TRANS, i)] = list()
for (to, w) in l[i]:
if w >= threshold:
g[(NodeType.TRANS, i)].append((NodeType.ORIG, to))
if (NodeType.ORIG, to) in g:
g[(NodeType.ORIG, to)].append((NodeType.TRANS, i))
else:
g[(NodeType.ORIG, to)] = list([(NodeType.TRANS, i)])
# find connected components
components = list()
while g:
(key, queue) = g.popitem()
if queue:
component = set([key])
while queue:
k = queue.pop()
component.add(k)
if k in g:
queue += g.pop(k)
components.append(component)
return components Each component points at words in the original text and its translation, so we can assign styles to components and assign those to corresponding words while rendering (perhaps by setting classes, and then tweaking it via CSS), and iterate a few times to handle different weights. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
All reactions