Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create alignment.md #459

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

liashahnazaryan
Copy link
Contributor

@liashahnazaryan liashahnazaryan commented Mar 28, 2023

Description

Fixes # 71

Type of PR

  • Creates the article [Alignment]

Checklist:

@@ -21,8 +21,7 @@ Parallel data sets can be created manually, automatically, or created synthetica
- Human [post-editing](../workflows/post-editing.md)
- [Crawling](crawling.md)
- [Alignment](alignment.md)

Parallel data can be created by crawling and aligned monolingual test, and by [back-translation](back-translation.md) or [back-copying](back-translation.md).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing the whole sentence here, as it mostly repeats the previously mentioned points. As for creating parallel data by aligned monolingual text, I'm not sure if it's relevant here, as monolingual data alignment is used to create comparable corpora in a single language.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right!

Copy link
Collaborator

@cefoo cefoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your PR, @liashahnazaryan!

I've added some comments, especially to try to avoid repetitions. Let me know what you think!


Alignment can be used to create [parallel data](/customisation/parallel-data.md).
The aligned parallel corpora are then used to train machine translation models.
The goal is to help the machine translation system accurately translate text from one language to another by recognising patterns and regularities in the data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence may be too long. Perhaps we can rephrase it to something like this (it doesn't have to be exactly like this):


The goal of this task is to allow the machine translation system to recognize patterns and regularities, and its equivalents.

The aligned parallel corpora are then used to train machine translation models.
The goal is to help the machine translation system accurately translate text from one language to another by recognising patterns and regularities in the data.

#### Example
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend not to use ####.
Perhaps it's better to have titles introduced with ## and examples, with ###?


German: `Das` `Buch` `liegt` `auf` `dem` `Tisch` `.`

By identifying the corresponding words, such as `book` and `Buch` or `table` and `Tisch`, the two example sentences are aligned and used as [training data](/customisation/training-data.md) for the machine translation system.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a repetition of what we have said in the previous part, when defining alignment.

Perhaps, we can rephrase it so that it just introduces new info:


In word-level alignment, the corresponding words, such as book and Buch, or table and Tisch are identified, aligned and used as training data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explain phrase- and sentence-level alignment with this sentence too?

---

**Alignment** is the process of identifying and linking the corresponding text units in the source and target languages.
Data sets can be aligned at the word, phrase, or sentence level.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the dash is necessary:


"... at the word-, phrase-, or sentence-level"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't use a hyphen in that sentence, as I wanted "word", "phrase" and "sentence" to modify the word "level", while in other cases where I use, e.g., "word-, phrase-, and sentence-level alignment", "word-, phrase-, and sentence-level" are used as adjectives to modify "alignment".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think no hyphen is correct in this case.


### Approaches

Machine translation systems use various alignment approaches to link two data sets at different granularity levels.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just "Alignment approaches are based on different granularity levels."? Or does it delete important information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Granularity levels" here are used to describe the two data sets that should be aligned. But now that I think about this, the whole sentence sounds redundant, as we wouldn't need an alignment if we knew that the data sets were identical. So I think we can delete the sentence and just pass to enumerating the approaches. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less is more, haha :)
Sure, go ahead.


Machine translation systems use various alignment approaches to link two data sets at different granularity levels.

- In manual alignment, bilingual human translators align corresponding text [segments](/concepts/segment.md) in the source and target languages.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid the "bilingual" in "bilingual human translators", as it is implied.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other articles, we tried to avoid "source" and "target", although it's correct, and use "input" and "output" languages.
Do you think it would be a good idea to add these term preferences to the Style Guide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea as it can help to avoid confusion around such cases where the terminology varies depending on the contributor's preferences.

Machine translation systems use various alignment approaches to link two data sets at different granularity levels.

- In manual alignment, bilingual human translators align corresponding text [segments](/concepts/segment.md) in the source and target languages.
- [Rule-based machine translation](/approaches/rule-based-machine-translation.md) uses linguistic rules and patterns to align words and phrases in two languages.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid repetition:

@liashahnazaryan
Copy link
Contributor Author

Thank you so much for your PR, @liashahnazaryan!

I've added some comments, especially to try to avoid repetitions. Let me know what you think!

Thank you for the comments, @cefoo!
I've made several changes and responded to your comments where relevant. Hope I haven't missed anything.

@bittlingmayer
Copy link
Collaborator

I think this article should be only about aligning sentences between a pair of documents, not about aligning words within a pair of sentences.

Or, we should have 2 separate articles, Sentence alignment and Word alignment.

@liashahnazaryan liashahnazaryan marked this pull request as draft April 7, 2023 19:14
Made several changes to eliminate word and phrase-level alignment from the article, as sentence alignment is more relevant to machine translation.
@liashahnazaryan liashahnazaryan marked this pull request as ready for review June 22, 2023 17:35
Copy link
Contributor Author

@liashahnazaryan liashahnazaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted parts about the word and phrase-level alignment from the article to be more relevant to machine translation.

@liashahnazaryan
Copy link
Contributor Author

Hi, @cefoo! I've made several minor changes to the article. Please let me know what you think :)

Copy link
Collaborator

@cefoo cefoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @liashahnazaryan!
Thank you so much for this update! The article is looking good!!
Tagging @bittlingmayer for his review as well.

**Alignment** is the process of identifying and linking the corresponding sentences in the input and output languages.

Alignment can be used to create [parallel data](/parallel-data).
The aligned parallel corpora are then used to train machine translation models.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we link the term "train" to training, even though it doesn't exist yet?


Alignment can be used to create [parallel data](/parallel-data).
The aligned parallel corpora are then used to train machine translation models.
The goal is to improve machine translation accuracy through pattern and regularity recognition in data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to make it simpler:

The goal is to improve machine translation accuracy by recognizing patterns and their frequency in data.

It may be a silly update, but the term "regularity", although accurate, made me think of academic/research speech.

The statistical relationships are based on the likelihood of observing alignments in a training corpus.
- With neural approaches, alignment is predicted automatically through [neural networks](/neural-machine-translation#neural-networks) by mapping the input and output sentences into [vectors](/vector).

## Challenges
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think examples would be helpful? I am thinking specifically of the second and, specially, the last item in this list.

- Aligning sentences with varying lengths, punctuation, and complex structures can be challenging for alignment algorithms.
- Many words and phrases can have multiple meanings or form idiomatic expressions.
Semantic ambiguity can trigger inaccurate sentence alignments.
- Typological similarities of languages can result in sentence pairs that share highly similar linguistic properties but have different meanings and translations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comma before "but"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no need for it as the subject doesn't change.

@liashahnazaryan
Copy link
Contributor Author

Hey, @cefoo! Thanks for the comments. I've made several changes. Please let me know what you think about the examples. Do they need more explanation, or are they good to go as they are?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants