-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create alignment.md #459
base: master
Are you sure you want to change the base?
Create alignment.md #459
Conversation
@@ -21,8 +21,7 @@ Parallel data sets can be created manually, automatically, or created synthetica | |||
- Human [post-editing](../workflows/post-editing.md) | |||
- [Crawling](crawling.md) | |||
- [Alignment](alignment.md) | |||
|
|||
Parallel data can be created by crawling and aligned monolingual test, and by [back-translation](back-translation.md) or [back-copying](back-translation.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest removing the whole sentence here, as it mostly repeats the previously mentioned points. As for creating parallel data by aligned monolingual text, I'm not sure if it's relevant here, as monolingual data alignment is used to create comparable corpora in a single language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for your PR, @liashahnazaryan!
I've added some comments, especially to try to avoid repetitions. Let me know what you think!
customisation/alignment.md
Outdated
|
||
Alignment can be used to create [parallel data](/customisation/parallel-data.md). | ||
The aligned parallel corpora are then used to train machine translation models. | ||
The goal is to help the machine translation system accurately translate text from one language to another by recognising patterns and regularities in the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence may be too long. Perhaps we can rephrase it to something like this (it doesn't have to be exactly like this):
The goal of this task is to allow the machine translation system to recognize patterns and regularities, and its equivalents.
customisation/alignment.md
Outdated
The aligned parallel corpora are then used to train machine translation models. | ||
The goal is to help the machine translation system accurately translate text from one language to another by recognising patterns and regularities in the data. | ||
|
||
#### Example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend not to use ####
.
Perhaps it's better to have titles introduced with ##
and examples, with ###
?
customisation/alignment.md
Outdated
|
||
German: `Das` `Buch` `liegt` `auf` `dem` `Tisch` `.` | ||
|
||
By identifying the corresponding words, such as `book` and `Buch` or `table` and `Tisch`, the two example sentences are aligned and used as [training data](/customisation/training-data.md) for the machine translation system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a repetition of what we have said in the previous part, when defining alignment
.
Perhaps, we can rephrase it so that it just introduces new info:
In word-level alignment, the corresponding words, such as book
and Buch
, or table
and Tisch
are identified, aligned and used as training data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explain phrase- and sentence-level alignment with this sentence too?
customisation/alignment.md
Outdated
--- | ||
|
||
**Alignment** is the process of identifying and linking the corresponding text units in the source and target languages. | ||
Data sets can be aligned at the word, phrase, or sentence level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the dash is necessary:
"... at the word-, phrase-, or sentence-level"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't use a hyphen in that sentence, as I wanted "word", "phrase" and "sentence" to modify the word "level", while in other cases where I use, e.g., "word-, phrase-, and sentence-level alignment", "word-, phrase-, and sentence-level" are used as adjectives to modify "alignment".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think no hyphen is correct in this case.
customisation/alignment.md
Outdated
|
||
### Approaches | ||
|
||
Machine translation systems use various alignment approaches to link two data sets at different granularity levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just "Alignment approaches are based on different granularity levels."? Or does it delete important information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Granularity levels" here are used to describe the two data sets that should be aligned. But now that I think about this, the whole sentence sounds redundant, as we wouldn't need an alignment if we knew that the data sets were identical. So I think we can delete the sentence and just pass to enumerating the approaches. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less is more, haha :)
Sure, go ahead.
customisation/alignment.md
Outdated
|
||
Machine translation systems use various alignment approaches to link two data sets at different granularity levels. | ||
|
||
- In manual alignment, bilingual human translators align corresponding text [segments](/concepts/segment.md) in the source and target languages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd avoid the "bilingual" in "bilingual human translators", as it is implied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other articles, we tried to avoid "source" and "target", although it's correct, and use "input" and "output" languages.
Do you think it would be a good idea to add these term preferences to the Style Guide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea as it can help to avoid confusion around such cases where the terminology varies depending on the contributor's preferences.
customisation/alignment.md
Outdated
Machine translation systems use various alignment approaches to link two data sets at different granularity levels. | ||
|
||
- In manual alignment, bilingual human translators align corresponding text [segments](/concepts/segment.md) in the source and target languages. | ||
- [Rule-based machine translation](/approaches/rule-based-machine-translation.md) uses linguistic rules and patterns to align words and phrases in two languages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid repetition:
- Rule-based machine translation uses linguistic rules and patterns.
Thank you for the comments, @cefoo! |
I think this article should be only about aligning sentences between a pair of documents, not about aligning words within a pair of sentences. Or, we should have 2 separate articles, Sentence alignment and Word alignment. |
Made several changes to eliminate word and phrase-level alignment from the article, as sentence alignment is more relevant to machine translation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted parts about the word and phrase-level alignment from the article to be more relevant to machine translation.
Hi, @cefoo! I've made several minor changes to the article. Please let me know what you think :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @liashahnazaryan!
Thank you so much for this update! The article is looking good!!
Tagging @bittlingmayer for his review as well.
customisation/alignment.md
Outdated
**Alignment** is the process of identifying and linking the corresponding sentences in the input and output languages. | ||
|
||
Alignment can be used to create [parallel data](/parallel-data). | ||
The aligned parallel corpora are then used to train machine translation models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we link the term "train" to training
, even though it doesn't exist yet?
customisation/alignment.md
Outdated
|
||
Alignment can be used to create [parallel data](/parallel-data). | ||
The aligned parallel corpora are then used to train machine translation models. | ||
The goal is to improve machine translation accuracy through pattern and regularity recognition in data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to make it simpler:
The goal is to improve machine translation accuracy by recognizing patterns and their frequency in data.
It may be a silly update, but the term "regularity", although accurate, made me think of academic/research speech.
The statistical relationships are based on the likelihood of observing alignments in a training corpus. | ||
- With neural approaches, alignment is predicted automatically through [neural networks](/neural-machine-translation#neural-networks) by mapping the input and output sentences into [vectors](/vector). | ||
|
||
## Challenges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think examples would be helpful? I am thinking specifically of the second and, specially, the last item in this list.
- Aligning sentences with varying lengths, punctuation, and complex structures can be challenging for alignment algorithms. | ||
- Many words and phrases can have multiple meanings or form idiomatic expressions. | ||
Semantic ambiguity can trigger inaccurate sentence alignments. | ||
- Typological similarities of languages can result in sentence pairs that share highly similar linguistic properties but have different meanings and translations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a comma before "but"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's no need for it as the subject doesn't change.
Hey, @cefoo! Thanks for the comments. I've made several changes. Please let me know what you think about the examples. Do they need more explanation, or are they good to go as they are? |
Description
Fixes # 71
Type of PR
Checklist: