-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create alignment.md #459
base: master
Are you sure you want to change the base?
Create alignment.md #459
Changes from 8 commits
b1fbd37
17d9b3c
0e9e61e
92a8aff
c41f547
219b038
20a259b
8dd0b40
7157343
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,27 @@ | ||
--- | ||
parent: Customisation | ||
layout: coming_soon | ||
title: Alignment | ||
description: | ||
description: Linking corresponding sentences in the input and output languages | ||
--- | ||
|
||
**Alignment** is the process of identifying and linking the corresponding sentences in the input and output languages. | ||
|
||
Alignment can be used to create [parallel data](/parallel-data). | ||
The aligned parallel corpora are then used to train machine translation models. | ||
The goal is to improve machine translation accuracy through pattern and regularity recognition in data. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe to make it simpler:
It may be a silly update, but the term "regularity", although accurate, made me think of academic/research speech. |
||
|
||
## Approaches | ||
|
||
- In manual alignment, human translators align corresponding [segmented sentences](/sentence-splitting) in the input and output languages. | ||
- Rule-based approaches use explicit heuristic rules, such as sentence length, word order, or other patterns observed in parallel data. | ||
- Statistical models rely on statistical algorithms that find and analyse relationship patterns in comparable corpora. | ||
The statistical relationships are based on the likelihood of observing alignments in a training corpus. | ||
- With neural approaches, alignment is predicted automatically through [neural networks](/neural-machine-translation#neural-networks) by mapping the input and output sentences into [vectors](/vector). | ||
|
||
## Challenges | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think examples would be helpful? I am thinking specifically of the second and, specially, the last item in this list. |
||
|
||
- Aligning sentences with varying lengths, punctuation, and complex structures can be challenging for alignment algorithms. | ||
- Many words and phrases can have multiple meanings or form idiomatic expressions. | ||
Semantic ambiguity can trigger inaccurate sentence alignments. | ||
- Typological similarities of languages can result in sentence pairs that share highly similar linguistic properties but have different meanings and translations. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a comma before "but"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there's no need for it as the subject doesn't change. |
||
Similarity-based interference can lead to incorrect alignments. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,8 +21,7 @@ Parallel data sets can be created manually, automatically, or created synthetica | |
- Human [post-editing](../workflows/post-editing.md) | ||
- [Crawling](crawling.md) | ||
- [Alignment](alignment.md) | ||
|
||
Parallel data can be created by crawling and aligned monolingual test, and by [back-translation](back-translation.md) or [back-copying](back-translation.md). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest removing the whole sentence here, as it mostly repeats the previously mentioned points. As for creating parallel data by aligned monolingual text, I'm not sure if it's relevant here, as monolingual data alignment is used to create comparable corpora in a single language. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, you are right! |
||
- [Back-translation](back-translation.md) or [back-copying](back-translation.md) | ||
|
||
### Goals | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we link the term "train" to
training
, even though it doesn't exist yet?