From b1fbd37d4961d8fe5281b5d5758f712b01350cb6 Mon Sep 17 00:00:00 2001 From: Lia Shahnazaryan Date: Tue, 28 Mar 2023 23:20:12 +0200 Subject: [PATCH 1/8] Create alignment.md --- customisation/alignment.md | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/customisation/alignment.md b/customisation/alignment.md index 5e6cf222d..a02d9d5f2 100644 --- a/customisation/alignment.md +++ b/customisation/alignment.md @@ -1,6 +1,40 @@ --- parent: Customisation -layout: coming_soon title: Alignment -description: +description: Linking corresponding units in the source and target languages --- + +**Alignment** is the process of identifying and linking the corresponding text units in the source and target languages. +Data sets can be aligned at the word, phrase, or sentence level. + +Alignment can be used to create [parallel data](/customisation/parallel-data.md). +The aligned parallel corpora are then used to train machine translation models. +The goal is to help the machine translation system accurately translate text from one language to another by recognising patterns and regularities in the data. + +#### Example + +Sentences are [split](/concepts/sentence-splitting.md) into smaller [tokens](/concepts/token.md). + +English: `The` `book` `is` `on` `the` `table` `.` + +German: `Das` `Buch` `liegt` `auf` `dem` `Tisch` `.` + +By identifying the corresponding words, the two example sentences are aligned and used as [training data](/customisation/training-data.md) for the machine translation system. + +### Approaches + +Machine translation systems use various alignment approaches to link two data sets at different granularity levels. + +- In manual alignment, bilingual human translators align corresponding text [segments](/concepts/segment.md) in the source and target languages. +- [Rule-based machine translation](/approaches/rule-based-machine-translation.md) uses linguistic rules and patterns to align words and phrases in two languages. +- The [statistical machine translation](/approaches/statistical-machine-translation.md) models rely on statistical algorithms that find relationships between words and phrases in the source and target languages. +The statistical relationships are based on the likelihood of observing alignments in a training corpus. +- In [neural machine translation](/approaches/neural-machine-translation.md), alignment is learned automatically through [neural networks](/approaches/neural-machine-translation#neural-networks.md). +The neural models can be based on various encoder-decoder architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), or [transformer](/approaches/transformers.md) models. + +### Challenges + +- The word order, sentence structure, and punctuation can differ significantly between languages, making it challenging to align sentences at the word or phrase level. +- Many words and phrases can have multiple meanings or form idiomatic expressions. +Semantic ambiguity can result in inaccurate sentence alignments. +- Out-of-vocabulary (OOV) words that are not present in the machine translation system [vocabulary](/concepts/vocabulary.md) can result in errors in the translation. From 17d9b3c2937072ca8a7aeddaf88657226a595cb6 Mon Sep 17 00:00:00 2001 From: Lia Shahnazaryan Date: Wed, 29 Mar 2023 10:23:19 +0200 Subject: [PATCH 2/8] Update alignment.md --- customisation/alignment.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/customisation/alignment.md b/customisation/alignment.md index a02d9d5f2..694d59cee 100644 --- a/customisation/alignment.md +++ b/customisation/alignment.md @@ -19,7 +19,7 @@ English: `The` `book` `is` `on` `the` `table` `.` German: `Das` `Buch` `liegt` `auf` `dem` `Tisch` `.` -By identifying the corresponding words, the two example sentences are aligned and used as [training data](/customisation/training-data.md) for the machine translation system. +By identifying the corresponding words, such as `book` and `Buch` or `table` and `Tisch`, the two example sentences are aligned and used as [training data](/customisation/training-data.md) for the machine translation system. ### Approaches From 0e9e61e5774297c14b4d8ceaa2e95b112cc5aeaf Mon Sep 17 00:00:00 2001 From: Lia Shahnazaryan Date: Wed, 29 Mar 2023 17:41:01 +0200 Subject: [PATCH 3/8] Minor edits --- customisation/parallel-data.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/customisation/parallel-data.md b/customisation/parallel-data.md index 9ffc3775e..4dbc49e7f 100644 --- a/customisation/parallel-data.md +++ b/customisation/parallel-data.md @@ -21,8 +21,7 @@ Parallel data sets can be created manually, automatically, or created synthetica - Human [post-editing](../workflows/post-editing.md) - [Crawling](crawling.md) - [Alignment](alignment.md) - -Parallel data can be created by crawling and aligned monolingual test, and by [back-translation](back-translation.md) or [back-copying](back-translation.md). +- [Back-translation](back-translation.md) or [back-copying](back-translation.md) ### Goals From c41f5473747a08c82f42848bc0cf82cd98c556e1 Mon Sep 17 00:00:00 2001 From: Lia Shahnazaryan Date: Wed, 29 Mar 2023 21:10:59 +0200 Subject: [PATCH 4/8] Fixes --- customisation/alignment.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/customisation/alignment.md b/customisation/alignment.md index 694d59cee..4ca277beb 100644 --- a/customisation/alignment.md +++ b/customisation/alignment.md @@ -29,7 +29,7 @@ Machine translation systems use various alignment approaches to link two data se - [Rule-based machine translation](/approaches/rule-based-machine-translation.md) uses linguistic rules and patterns to align words and phrases in two languages. - The [statistical machine translation](/approaches/statistical-machine-translation.md) models rely on statistical algorithms that find relationships between words and phrases in the source and target languages. The statistical relationships are based on the likelihood of observing alignments in a training corpus. -- In [neural machine translation](/approaches/neural-machine-translation.md), alignment is learned automatically through [neural networks](/approaches/neural-machine-translation#neural-networks.md). +- In [neural machine translation](/approaches/neural-machine-translation.md), alignment is learned automatically through [neural networks](/approaches/neural-machine-translation.md#neural-networks). The neural models can be based on various encoder-decoder architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), or [transformer](/approaches/transformers.md) models. ### Challenges From 219b0388b22248576fae3fb0c9c8d17d46b8a59e Mon Sep 17 00:00:00 2001 From: Lia Shahnazaryan Date: Thu, 30 Mar 2023 17:07:09 +0200 Subject: [PATCH 5/8] New edits --- customisation/alignment.md | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/customisation/alignment.md b/customisation/alignment.md index 4ca277beb..a7cf8897d 100644 --- a/customisation/alignment.md +++ b/customisation/alignment.md @@ -1,17 +1,17 @@ --- parent: Customisation title: Alignment -description: Linking corresponding units in the source and target languages +description: Linking corresponding units in the input and output languages --- -**Alignment** is the process of identifying and linking the corresponding text units in the source and target languages. +**Alignment** is the process of identifying and linking the corresponding text units in the input and output languages. Data sets can be aligned at the word, phrase, or sentence level. Alignment can be used to create [parallel data](/customisation/parallel-data.md). The aligned parallel corpora are then used to train machine translation models. -The goal is to help the machine translation system accurately translate text from one language to another by recognising patterns and regularities in the data. +The goal is to improve machine translation accuracy through pattern and regularity recognition in data. -#### Example +### Example Sentences are [split](/concepts/sentence-splitting.md) into smaller [tokens](/concepts/token.md). @@ -19,20 +19,22 @@ English: `The` `book` `is` `on` `the` `table` `.` German: `Das` `Buch` `liegt` `auf` `dem` `Tisch` `.` -By identifying the corresponding words, such as `book` and `Buch` or `table` and `Tisch`, the two example sentences are aligned and used as [training data](/customisation/training-data.md) for the machine translation system. +In word-level alignment, the corresponding words, such as `book` and `Buch`, or `table` and `Tisch` are identified, aligned and used as [training data](/customisation/training-data.md). -### Approaches +Phrase-level alignment matches relative phrases, such as `on the table` and `auf dem Tisch`. -Machine translation systems use various alignment approaches to link two data sets at different granularity levels. +In sentence-level alignment, the entire sentences are linked, such as `The book is on the table.` and `Das Buch liegt auf dem Tisch.` -- In manual alignment, bilingual human translators align corresponding text [segments](/concepts/segment.md) in the source and target languages. -- [Rule-based machine translation](/approaches/rule-based-machine-translation.md) uses linguistic rules and patterns to align words and phrases in two languages. -- The [statistical machine translation](/approaches/statistical-machine-translation.md) models rely on statistical algorithms that find relationships between words and phrases in the source and target languages. +## Approaches + +- In manual alignment, human translators align corresponding text [segments](/concepts/segment.md) in the input and output languages. +- [Rule-based machine translation](/approaches/rule-based-machine-translation.md) uses linguistic rules and patterns. +- The [statistical machine translation](/approaches/statistical-machine-translation.md) models rely on statistical algorithms that find relationships between words and phrases in two languages. The statistical relationships are based on the likelihood of observing alignments in a training corpus. - In [neural machine translation](/approaches/neural-machine-translation.md), alignment is learned automatically through [neural networks](/approaches/neural-machine-translation.md#neural-networks). The neural models can be based on various encoder-decoder architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), or [transformer](/approaches/transformers.md) models. -### Challenges +## Challenges - The word order, sentence structure, and punctuation can differ significantly between languages, making it challenging to align sentences at the word or phrase level. - Many words and phrases can have multiple meanings or form idiomatic expressions. From 20a259bba31e7b2277d47e4c988c590584f4ef06 Mon Sep 17 00:00:00 2001 From: Lia Shahnazaryan Date: Thu, 22 Jun 2023 19:34:44 +0200 Subject: [PATCH 6/8] Update alignment.md Made several changes to eliminate word and phrase-level alignment from the article, as sentence alignment is more relevant to machine translation. --- customisation/alignment.md | 34 ++++++++++------------------------ 1 file changed, 10 insertions(+), 24 deletions(-) diff --git a/customisation/alignment.md b/customisation/alignment.md index a7cf8897d..910c64315 100644 --- a/customisation/alignment.md +++ b/customisation/alignment.md @@ -1,42 +1,28 @@ --- parent: Customisation title: Alignment -description: Linking corresponding units in the input and output languages +description: Linking corresponding sentences in the input and output languages --- -**Alignment** is the process of identifying and linking the corresponding text units in the input and output languages. -Data sets can be aligned at the word, phrase, or sentence level. +**Alignment** is the process of identifying and linking the corresponding sentences in the input and output languages. Alignment can be used to create [parallel data](/customisation/parallel-data.md). The aligned parallel corpora are then used to train machine translation models. The goal is to improve machine translation accuracy through pattern and regularity recognition in data. -### Example - -Sentences are [split](/concepts/sentence-splitting.md) into smaller [tokens](/concepts/token.md). - -English: `The` `book` `is` `on` `the` `table` `.` - -German: `Das` `Buch` `liegt` `auf` `dem` `Tisch` `.` - -In word-level alignment, the corresponding words, such as `book` and `Buch`, or `table` and `Tisch` are identified, aligned and used as [training data](/customisation/training-data.md). - -Phrase-level alignment matches relative phrases, such as `on the table` and `auf dem Tisch`. - -In sentence-level alignment, the entire sentences are linked, such as `The book is on the table.` and `Das Buch liegt auf dem Tisch.` - ## Approaches -- In manual alignment, human translators align corresponding text [segments](/concepts/segment.md) in the input and output languages. -- [Rule-based machine translation](/approaches/rule-based-machine-translation.md) uses linguistic rules and patterns. -- The [statistical machine translation](/approaches/statistical-machine-translation.md) models rely on statistical algorithms that find relationships between words and phrases in two languages. +- In manual alignment, human translators align corresponding [segmented sentences](/concepts/sentence-splitting.md) in the input and output languages. +- Rule-based approaches use linguistic rules and patterns, such as word order, syntactic properties, punctuation, and sentence boundaries. +- The statistical models rely on statistical algorithms that find and analyse relationship patterns in comparable corpora. The statistical relationships are based on the likelihood of observing alignments in a training corpus. -- In [neural machine translation](/approaches/neural-machine-translation.md), alignment is learned automatically through [neural networks](/approaches/neural-machine-translation.md#neural-networks). +- With neural approaches, alignment is learned automatically through [neural networks](/approaches/neural-machine-translation.md#neural-networks). The neural models can be based on various encoder-decoder architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), or [transformer](/approaches/transformers.md) models. ## Challenges -- The word order, sentence structure, and punctuation can differ significantly between languages, making it challenging to align sentences at the word or phrase level. +- Aligning sentences with varying lengths, punctuation, and complex structures can be challenging for alignment algorithms. - Many words and phrases can have multiple meanings or form idiomatic expressions. -Semantic ambiguity can result in inaccurate sentence alignments. -- Out-of-vocabulary (OOV) words that are not present in the machine translation system [vocabulary](/concepts/vocabulary.md) can result in errors in the translation. +Semantic ambiguity can trigger inaccurate sentence alignments. +- Typological similarities of languages can result in sentence pairs that share highly similar linguistic properties but have different meanings and translations. +Similarity-based interference can lead to incorrect alignments. From 8dd0b409f173e016e0519d57efd63776c229ea04 Mon Sep 17 00:00:00 2001 From: Lia Shahnazaryan Date: Fri, 23 Feb 2024 13:33:26 +0100 Subject: [PATCH 7/8] Minor changes --- customisation/alignment.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/customisation/alignment.md b/customisation/alignment.md index 910c64315..ba6de8d2d 100644 --- a/customisation/alignment.md +++ b/customisation/alignment.md @@ -6,18 +6,17 @@ description: Linking corresponding sentences in the input and output languages **Alignment** is the process of identifying and linking the corresponding sentences in the input and output languages. -Alignment can be used to create [parallel data](/customisation/parallel-data.md). +Alignment can be used to create [parallel data](/parallel-data). The aligned parallel corpora are then used to train machine translation models. The goal is to improve machine translation accuracy through pattern and regularity recognition in data. ## Approaches -- In manual alignment, human translators align corresponding [segmented sentences](/concepts/sentence-splitting.md) in the input and output languages. -- Rule-based approaches use linguistic rules and patterns, such as word order, syntactic properties, punctuation, and sentence boundaries. -- The statistical models rely on statistical algorithms that find and analyse relationship patterns in comparable corpora. +- In manual alignment, human translators align corresponding [segmented sentences](/sentence-splitting) in the input and output languages. +- Rule-based approaches use explicit heuristic rules, such as sentence length, word order, or other patterns observed in parallel data. +- Statistical models rely on statistical algorithms that find and analyse relationship patterns in comparable corpora. The statistical relationships are based on the likelihood of observing alignments in a training corpus. -- With neural approaches, alignment is learned automatically through [neural networks](/approaches/neural-machine-translation.md#neural-networks). -The neural models can be based on various encoder-decoder architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), or [transformer](/approaches/transformers.md) models. +- With neural approaches, alignment is predicted automatically through [neural networks](/neural-machine-translation#neural-networks) by mapping the input and output sentences into [vectors](/vector). ## Challenges From 715734313bb95c75111e3cacef17095e8b2884f4 Mon Sep 17 00:00:00 2001 From: Lia Shahnazaryan Date: Thu, 29 Feb 2024 17:29:04 +0100 Subject: [PATCH 8/8] Adding examples.md --- customisation/alignment.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/customisation/alignment.md b/customisation/alignment.md index ba6de8d2d..089587570 100644 --- a/customisation/alignment.md +++ b/customisation/alignment.md @@ -7,8 +7,8 @@ description: Linking corresponding sentences in the input and output languages **Alignment** is the process of identifying and linking the corresponding sentences in the input and output languages. Alignment can be used to create [parallel data](/parallel-data). -The aligned parallel corpora are then used to train machine translation models. -The goal is to improve machine translation accuracy through pattern and regularity recognition in data. +The aligned parallel corpora are then used to [train](/training) machine translation models. +The goal is to improve machine translation accuracy by recognizing patterns and their frequency in data. ## Approaches @@ -22,6 +22,8 @@ The statistical relationships are based on the likelihood of observing alignment - Aligning sentences with varying lengths, punctuation, and complex structures can be challenging for alignment algorithms. - Many words and phrases can have multiple meanings or form idiomatic expressions. -Semantic ambiguity can trigger inaccurate sentence alignments. +Semantic ambiguity can trigger inaccurate sentence alignments. +For example, the English idiom `to make a mountain out of a molehill` corresponds to the German phrase `aus einer Mücke einen Elefanten zu machen`, resulting in misalignment. - Typological similarities of languages can result in sentence pairs that share highly similar linguistic properties but have different meanings and translations. Similarity-based interference can lead to incorrect alignments. +For example, the sentence `I saw a man with a telescope.` can be interpreted in two ways, leading to different translations.