Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assert [tok.text for tok in tokens] == [ AssertionError: Spacy and Stanza word mismatch #7

Open
AnnaWegmann opened this issue Aug 29, 2024 · 2 comments

Comments

@AnnaWegmann
Copy link

AnnaWegmann commented Aug 29, 2024

Hi! Maybe you can help me with the following:

After creating a conda environment with

conda create --name aw_value python=3.10.13
conda activate aw_value
pip install value-nlp
pip install datasets==2.20.0

I am calling

export TASK_NAME=sst2
export PYTHONHASHSEED=1234
python run_glue.py  --model_name_or_path roberta-base --task_name $TASK_NAME --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3 --output_dir output/$TASK_NAME/roberta_base --dialect "aave" --morphosyntax --do_train 

and get the error

  File "/home/uu_cs_nlpsoc/awegmann/StyleTokenizer/src/run_value.py", line 743, in <module>
    main()
  File "/home/uu_cs_nlpsoc/awegmann/StyleTokenizer/src/run_value.py", line 553, in main
    raw_datasets = raw_datasets.map(
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/dataset_dict.py", line 869, in map
    {
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/dataset_dict.py", line 870, in <dictcomp>
    k: dataset.map(
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3161, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3552, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3421, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/uu_cs_nlpsoc/awegmann/StyleTokenizer/src/run_value.py", line 517, in preprocess_function
    conversions1 = [
  File "/home/uu_cs_nlpsoc/awegmann/StyleTokenizer/src/run_value.py", line 518, in <listcomp>
    dialect.convert_sae_to_dialect(example)
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/multivalue/BaseDialect.py", line 193, in convert_sae_to_dialect
    self.update(string)
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/multivalue/BaseDialect.py", line 218, in update
    self.coref_clusters = self.create_coref_cluster(string)
File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/multivalue/BaseDialect.py", line 237, in create_coref_cluster
    assert [tok.text for tok in tokens] == [
AssertionError: Spacy and Stanza word mismatch

Any experience with this error? Does the run_glue.py still work for you in your env? I also had to delete the mapping in AfricanAmericanVernacular(mapping, ...).

FYI: I renamed run_glue to run_value.py

@AnnaWegmann
Copy link
Author

My current goal is to run value with a model I trained. Would you recommend doing this via the source install or with the older repo https://github.com/SALT-NLP/value/ ?

@Helw150
Copy link
Member

Helw150 commented Oct 8, 2024

Hi Anna!

Thanks for your patience - we don't get a ton of issues so I'm not in the habit of checking!

This repo is updated beyond the original VALUE repo, so this is preferred. The main reason you'd want to use the original VALUE repo is if you were trying to reproduce the experiments from that paper you exactly.

Let me look into the bug you are seeing and try to resolve it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants