Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce the results of Llama7B dora_r32. #14

Open
xiaoshingshing2 opened this issue Jul 2, 2024 · 7 comments
Open

Cannot reproduce the results of Llama7B dora_r32. #14

xiaoshingshing2 opened this issue Jul 2, 2024 · 7 comments

Comments

@xiaoshingshing2
Copy link

xiaoshingshing2 commented Jul 2, 2024

First of all, using the official checkpoint is ok. The results on BoolQ is 69.63 while the official result is 69.7.

However, when I try to reproduce the results, I encounter two problems.

The first is about Llama7B dora_r32 without dora_simple. I change three commands in llama_7B_Dora.sh. For example, the micro_batch_size from 16 to 4, learning_rate from 2e-4 to 1e-4, and add --dora_simple False to avoid using dora_simple. I use the command line sh llama_7B_Dora.sh 32 64 ./finetuned_result/dora_r32 0, and the results are

BoolQ PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA Average
69.3 78.9 78.3 54.3 80.0 82.6 66.1 81.0 73.8

which are worse than the official results.

The second is that when I delete the --dora_simple False to accelerate the training process with dora_simple, the results are even worse:

BoolQ PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA Average
32.9 75.5 71.8 9.9 41.3 81.9 66.3 75.8 56.9
@xiaoshingshing2
Copy link
Author

xiaoshingshing2 commented Jul 2, 2024

This is the log of the training process and adapter_config with --dora_simple False
trainer_state.json
adapter_config.json

@xiaoshingshing2 xiaoshingshing2 changed the title Cannot reproduce the results of Llama7B. Cannot reproduce the results of Llama7B dora_r32. Jul 2, 2024
@nbasyl
Copy link
Collaborator

nbasyl commented Jul 2, 2024

Did you install all the packages following requirements.txt?

@xiaoshingshing2
Copy link
Author

xiaoshingshing2 commented Jul 3, 2024

Hi, I donot install bitsandbytes and my pytorch version is 2.1.0. The transformers package is installed with pip install transformers==4.36.0. Other packages are the same as requirements.txt.

Will that hurt the performance?

The packages I use are listed below:

Package Version


accelerate 0.25.0
aiofiles 23.2.1
aiohttp 3.9.5
aiosignal 1.3.1
altair 5.3.0
annotated-types 0.7.0
anyio 4.4.0
appdirs 1.4.4
asttokens 2.4.1
async-timeout 4.0.3
attrs 23.2.0
black 23.12.0
certifi 2024.6.2
charset-normalizer 3.3.2
click 8.1.7
cmake 3.29.6
contourpy 1.2.1
cycler 0.12.1
datasets 2.15.0
decorator 5.1.1
dill 0.3.7
dnspython 2.6.1
email_validator 2.2.0
exceptiongroup 1.2.1
executing 2.0.1
fastapi 0.111.0
fastapi-cli 0.0.4
ffmpy 0.3.2
filelock 3.15.4
fire 0.5.0
fonttools 4.53.0
frozenlist 1.4.1
fsspec 2023.10.0
gradio 4.9.0
gradio_client 0.7.2
h11 0.14.0
httpcore 1.0.5
httptools 0.6.1
httpx 0.27.0
huggingface-hub 0.23.4
idna 3.7
importlib_resources 6.4.0
ipython 8.25.0
jedi 0.19.1
Jinja2 3.1.4
jsonschema 4.22.0
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
lit 18.1.8
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.9.0
matplotlib-inline 0.1.7
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.15
mypy-extensions 1.0.0
networkx 3.3
numpy 1.26.4
orjson 3.10.5
packaging 24.1
pandas 2.2.2
parso 0.8.4
pathspec 0.12.1
pexpect 4.9.0
pillow 10.3.0
pip 24.0
platformdirs 4.2.2
prompt_toolkit 3.0.47
protobuf 5.27.2
psutil 6.0.0
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 16.1.0
pyarrow-hotfix 0.6
pydantic 2.7.4
pydantic_core 2.18.4
pydub 0.25.1
Pygments 2.18.0
pyparsing 3.1.2
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-multipart 0.0.9
pytorch-triton-rocm 2.1.0
pytz 2024.1
PyYAML 6.0.1
referencing 0.35.1
regex 2024.5.15
requests 2.32.3
rich 13.7.1
rpds-py 0.18.1
safetensors 0.4.3
scipy 1.11.4
semantic-version 2.10.0
sentencepiece 0.1.99
setuptools 69.5.1
shellingham 1.5.4
six 1.16.0
sniffio 1.3.1
stack-data 0.6.3
starlette 0.37.2
sympy 1.12.1
termcolor 2.4.0
tokenize-rt 5.2.0
tokenizers 0.15.2
tomli 2.0.1
tomlkit 0.12.0
toolz 0.12.1
torch 2.1.0+rocm5.6
tqdm 4.66.4
traitlets 5.14.3
transformers 4.36.0
typer 0.12.3
typing_extensions 4.12.2
tzdata 2024.1
ujson 5.10.0
urllib3 2.2.2
uvicorn 0.30.1
uvloop 0.19.0
watchfiles 0.22.0
wcwidth 0.2.13
websockets 11.0.3
wheel 0.43.0
xxhash 3.4.1
yarl 1.9.4

I am trying to use exactly the same package as in requirements.txt, and will update my results when the finetuning and testing process finish.

@xiaoshingshing2
Copy link
Author

xiaoshingshing2 commented Jul 4, 2024

I use exactly the packages in requirements.txt, and the results are:

BoolQ PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA Average
68.7 83.3 79.4 85.5 81.3 80.8 66.0 78.8 78.0

which has 2.1% average accuracy gap between the results reported in readme.

@xiaoshingshing2
Copy link
Author

xiaoshingshing2 commented Jul 8, 2024

New updates:

I use exactly the packages in requirements.txt, and the results on r=[8,16,32,64] still have a gap with the results reported in readme, while the result on r=[4] is better.

Average acc:

r original reproduce
4 61.9 65.2
8 77.9 72.5
16 77.5 62.7
32 78.4 78.0
64 76.8 76.3

Is this a normal result?

@zhanqiqi77
Copy link

@xiaoshingshing2 I have encountered a similar issue. Did you manage to resolve it? Could you provide your package versions?

@xiaoshingshing2
Copy link
Author

@xiaoshingshing2 I have encountered a similar issue. Did you manage to resolve it? Could you provide your package versions?

In the latest update, I used exactly the packages in requirements.txt with the same versions. I still have the problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants