Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link for downloading the back translation code is not working #108

Open
sgmoo opened this issue Jun 19, 2021 · 5 comments
Open

Link for downloading the back translation code is not working #108

sgmoo opened this issue Jun 19, 2021 · 5 comments

Comments

@sgmoo
Copy link

sgmoo commented Jun 19, 2021

While trying to run back_translate/download.sh, I get the following error:

> bash download.sh

--2021-06-19 12:36:11--  https://storage.googleapis.com/uda_model/text/back_trans_checkpoints.zip 
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.8.16, 172.217.9.208, 172.217.12.240, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.8.16|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-06-19 12:36:11 ERROR 404: Not Found.
unzip:  cannot find or open back_trans_checkpoints.zip, back_trans_checkpoints.zip.zip or back_trans_checkpoints.zip.ZIP.

It seems that the storage.googleapis.com/uda_model bucket is not valid anymore. Is there an alternate link that I can use to download the back_translate code?

@JosephElHachem
Copy link

Hello, I am experiencing the same issue and I hope it will be resolved soon !

@sebamenabar
Copy link

Hi, I have, the same problem, anybody managed to get the checkpoints?

@YuandZhang
Copy link

same issue. Have you sovle that problem?

@sebamenabar
Copy link

sebamenabar commented Sep 23, 2021

Maybe this could be of help, I made a small code to make the backtranslations with HuggingFace, although I have not tested the quality of the generated data, if they perform well with UDA, or the time it would take to translate the whole dataset, but visually they seem good. It works with transformers==4.4.2 and may require some modifications on newer versions.

import torch
from transformers import MarianMTModel, MarianTokenizer

torch.cuda.empty_cache()

en_fr_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
en_fr_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr").cuda()

fr_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
fr_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-fr-en").cuda()

src_text = [
    "Hi how are you?",
]

translated_tokens = en_fr_model.generate(
    **{k: v.cuda() for k, v in en_fr_tokenizer(src_text, return_tensors="pt", padding=True, max_length=512).items()},
    do_sample=True, 
    top_k=10, 
    temperature=2.0,
)
in_fr = [en_fr_tokenizer.decode(t, skip_special_tokens=True) for t in translated_tokens]

bt_tokens = fr_en_model.generate(
    **{k: v.cuda() for k, v in fr_en_tokenizer(in_fr, return_tensors="pt", padding=True, max_length=512).items()},
    do_sample=True, 
    top_k=10, 
    temperature=2.0,
)
in_en = [fr_en_tokenizer.decode(t, skip_special_tokens=True) for t in bt_tokens]

For the arguments used to generate please refer to https://huggingface.co/blog/how-to-generate.

Example of input data and backtranslation:

Input: I lived in Tokyo for 7 months. Knowing the reality of long train commutes, bike rides from the train station, soup stands, and other typical scenes depicted so well, certainly added to my own appreciation for this film which I really, really liked. There are aspects of Japanese life in this film painted with vivid colors but you don't have to speak Japanese to enjoy this movie. Director Suo's tricks were subtle for the most part; I found his highlighting the character called Tamako Tamura with a soft filter, making her sublime, a tiny bit contrived but most of the directors tricks were so gentle that I was fully pulled in and just danced with his characters. Or cried. Or laughed aloud. Wonderful. A+.
---
Output: I lived in Tokyo for seven months. I know the reality of train rides, bike rides from the train station, soup stands, and other typical scenes shown so nicely, probably added to my own appreciation of this film I really, really loved. There are aspects of Japanese life in this film painted with vivid colors but you don't have to speak Japanese to enjoy this movie. The pieces of the director Suo have been subtle to most, I found that he highlights the character called Tamaki Tamura with a sweet filter, which makes her sublime, a bit confused but most of the movie-makers' tricks were so soft that I was completely shot in it and just dancing with his characters. Or wept. or laughed aloud. Wonderful. A+.

@Liu-Jingyao
Copy link

Maybe this could be of help, I made a small code to make the backtranslations with HuggingFace, although I have not tested the quality of the generated data, if they perform well with UDA, or the time it would take to translate the whole dataset, but visually they seem good. It works with transformers==4.4.2 and may require some modifications on newer versions.

import torch
from transformers import MarianMTModel, MarianTokenizer

torch.cuda.empty_cache()

en_fr_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
en_fr_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr").cuda()

fr_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
fr_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-fr-en").cuda()

src_text = [
    "Hi how are you?",
]

translated_tokens = en_fr_model.generate(
    **{k: v.cuda() for k, v in en_fr_tokenizer(src_text, return_tensors="pt", padding=True, max_length=512).items()},
    do_sample=True, 
    top_k=10, 
    temperature=2.0,
)
in_fr = [en_fr_tokenizer.decode(t, skip_special_tokens=True) for t in translated_tokens]

bt_tokens = fr_en_model.generate(
    **{k: v.cuda() for k, v in fr_en_tokenizer(in_fr, return_tensors="pt", padding=True, max_length=512).items()},
    do_sample=True, 
    top_k=10, 
    temperature=2.0,
)
in_en = [fr_en_tokenizer.decode(t, skip_special_tokens=True) for t in bt_tokens]

For the arguments used to generate please refer to https://huggingface.co/blog/how-to-generate.

Example of input data and backtranslation:

Input: I lived in Tokyo for 7 months. Knowing the reality of long train commutes, bike rides from the train station, soup stands, and other typical scenes depicted so well, certainly added to my own appreciation for this film which I really, really liked. There are aspects of Japanese life in this film painted with vivid colors but you don't have to speak Japanese to enjoy this movie. Director Suo's tricks were subtle for the most part; I found his highlighting the character called Tamako Tamura with a soft filter, making her sublime, a tiny bit contrived but most of the directors tricks were so gentle that I was fully pulled in and just danced with his characters. Or cried. Or laughed aloud. Wonderful. A+.
---
Output: I lived in Tokyo for seven months. I know the reality of train rides, bike rides from the train station, soup stands, and other typical scenes shown so nicely, probably added to my own appreciation of this film I really, really loved. There are aspects of Japanese life in this film painted with vivid colors but you don't have to speak Japanese to enjoy this movie. The pieces of the director Suo have been subtle to most, I found that he highlights the character called Tamaki Tamura with a sweet filter, which makes her sublime, a bit confused but most of the movie-makers' tricks were so soft that I was completely shot in it and just dancing with his characters. Or wept. or laughed aloud. Wonderful. A+.

Thanks! I'll try it as a substitute for the source code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants