Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word_level_augmentation #101

Open
stellaywu opened this issue Dec 22, 2020 · 6 comments
Open

word_level_augmentation #101

stellaywu opened this issue Dec 22, 2020 · 6 comments

Comments

@stellaywu
Copy link

Hey, i'm running tf_idf augmentation for IMDB dataset and noticed that tf_idf-0.7 or tf_idf-0.9 will lead to more than half of the tokens in a sentence get replaced. Is that a desired outcome ?

Thanks!

@spolavar
Copy link

spolavar commented Jan 6, 2021

Hi Stella,
I am also trying to test tf-idf augmentation on a new dataset (twitter data), did you use word-level augmentation? How are you analyzing the word replacement in augmented datasets? Are you converting the tensorflow records to text?

Thank you!

@spolavar
Copy link

spolavar commented Jan 6, 2021

I was hoping that word_level_augmentation will generate minimally abrasive sentences, but when I used unif-0.9 and tf_idf-0.9 on IMDB dataset I get completely gibberish examples, tf_idf is much better but still no where close to being a good match to the original example.
Has anyone tested the word_level augmentation before?

@stellaywu
Copy link
Author

@spolavar I saved the augmented texts to check the results. If you are just trying to do augmentation maybe nlpaug will help?

I reproduced the results eventually on IMDB and DBPedia with tf_idf and unif aug. In my understanding the value 0.7 is the probability of token being replaced, so 0.9 will lead to more token being replaced than 0.7. for tf-idf this value is a scaler so not exactly the probability but still the larger the more tokens will be replaced. For unif the best value was 0.3 according to the paper (it's indeed the case in my experiments). The reason may be that tf-idf is replacing unimportant words so it's more tolerant to the replacement. (Authors please correct me if I'm wrong )

@spolavar
Copy link

@stellaywu thank you for the clarification, I agree with you when I lowered the probability the aug examples are comparable. I however still haven't been lucky enough to test the performance of uda against the inbuilt BERT model yet! My processed data is ready, but I run into tf compatibility issues. I am running the code on tf 2.3.1 and Python 3.6 and have decided to use the tf.compat.v1 to bridge. It allowed me to run the data processing part, but I am still having issues with the modeling part. Can you share a bit more about how you could tackle the code porting issues and could reproduce the results? Thank you again!

@spolavar
Copy link

spolavar commented Jan 12, 2021

to be specific when I run the command bash scripts/run_base.sh --max_seq_length=${MAX_SEQ_LENGTH} for baseline model on my custom data. I get the following error:

  File "uda/text/utils/proc_data_utils.py", line 188, in training_input_fn_builder
    sup_total_data_files = tf.contrib.slim.parallel_reader.get_data_files(
AttributeError: module 'tensorflow.compat.v1' has no attribute 'contrib'

This is triggered by the training input function trying to read the tensorflow data files from main.py

train_input_fn = proc_data_utils.training_input_fn_builder(
        FLAGS.sup_train_data_dir,
        FLAGS.unsup_data_dir,
        FLAGS.aug_ops,
        FLAGS.aug_copy,
        FLAGS.unsup_ratio)

Any ideas how to handle the tf.contrib error?

@stellaywu
Copy link
Author

@spolavar what is your tf version? the code only support tf 1
Try adding this
import tensorflow as tf import tensorboard as tb tf.io.gfile = tb.compat.tensorflow_stub.io.gfile
as your tensorflow import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants