Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create alignment.md #459

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 24 additions & 2 deletions customisation/alignment.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,28 @@
---
parent: Customisation
layout: coming_soon
title: Alignment
description:
description: Linking corresponding sentences in the input and output languages
---

**Alignment** is the process of identifying and linking the corresponding sentences in the input and output languages.

Alignment can be used to create [parallel data](/customisation/parallel-data.md).
The aligned parallel corpora are then used to train machine translation models.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we link the term "train" to training, even though it doesn't exist yet?

The goal is to improve machine translation accuracy through pattern and regularity recognition in data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to make it simpler:

The goal is to improve machine translation accuracy by recognizing patterns and their frequency in data.

It may be a silly update, but the term "regularity", although accurate, made me think of academic/research speech.


## Approaches

- In manual alignment, human translators align corresponding [segmented sentences](/concepts/sentence-splitting.md) in the input and output languages.
- Rule-based approaches use linguistic rules and patterns, such as word order, syntactic properties, punctuation, and sentence boundaries.
- The statistical models rely on statistical algorithms that find and analyse relationship patterns in comparable corpora.
The statistical relationships are based on the likelihood of observing alignments in a training corpus.
- With neural approaches, alignment is learned automatically through [neural networks](/approaches/neural-machine-translation.md#neural-networks).
The neural models can be based on various encoder-decoder architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), or [transformer](/approaches/transformers.md) models.

## Challenges
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think examples would be helpful? I am thinking specifically of the second and, specially, the last item in this list.


- Aligning sentences with varying lengths, punctuation, and complex structures can be challenging for alignment algorithms.
- Many words and phrases can have multiple meanings or form idiomatic expressions.
Semantic ambiguity can trigger inaccurate sentence alignments.
- Typological similarities of languages can result in sentence pairs that share highly similar linguistic properties but have different meanings and translations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comma before "but"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no need for it as the subject doesn't change.

Similarity-based interference can lead to incorrect alignments.
3 changes: 1 addition & 2 deletions customisation/parallel-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,7 @@ Parallel data sets can be created manually, automatically, or created synthetica
- Human [post-editing](../workflows/post-editing.md)
- [Crawling](crawling.md)
- [Alignment](alignment.md)

Parallel data can be created by crawling and aligned monolingual test, and by [back-translation](back-translation.md) or [back-copying](back-translation.md).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing the whole sentence here, as it mostly repeats the previously mentioned points. As for creating parallel data by aligned monolingual text, I'm not sure if it's relevant here, as monolingual data alignment is used to create comparable corpora in a single language.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right!

- [Back-translation](back-translation.md) or [back-copying](back-translation.md)

### Goals

Expand Down