-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint path should be absolute #111
Comments
As it notes,
This is for compatibility w/ cloud storage. |
@luyug I am using Google Cloud TPU V4-8 as suggested. Can you help with the issue? |
change to something like |
@MXueguang Thanks, it resole issue. I am facing an issue in encoding, I am using the below code to encode msmarco. python -m tevatron.tevax.experimental.mp.encode But it does not save embedding at the output path. Please find the screenshot for the same below. @MXueguang @luyug can you help me in this? |
at time of training i am geeting this error :
code : python -m tevatron.tevax.experimental.mp.train_lora
--checkpoint_dir retriever-mistral-jax
--train_file Tevatron/msmarco-passage-aug
--model_name mistralai/Mistral-7B-v0.1
--model_type mistral
--batch_size 128
--num_target_passages 16
--learning_rate 1e-4
--seed 12345
--mesh_shape 1 -1
--weight_decay 0.00001
--num_epochs 1
--max_query_length 64
--max_passage_length 128
--pooling eos
--scale_by_dim True
--grad_cache
--passage_num_chunks 32
--query_num_chunks 4
Error:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/irlab/tevatron/src/tevatron/tevax/experimental/mp/train_lora.py", line 394, in
main()
File "/home/irlab/tevatron/src/tevatron/tevax/experimental/mp/train_lora.py", line 375, in main
checkpoint_manager.save(
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/checkpoint_manager.py", line 515, in save
self._checkpointers[k].save(item_dir, item, **kwargs)
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/async_checkpointer.py", line 281, in save
commit_ops = asyncio.run(self._handler.async_save(tmpdir, args=ckpt_args))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/pytree_checkpoint_handler.py", line 835, in async_save
commit_futures = await asyncio.gather(*serialize_ops)
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1376, in serialize
tspec = self._get_json_tspec_write(
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1273, in _get_json_tspec_write
tspec = self._get_json_tspec(
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1253, in _get_json_tspec
tspec: Dict[str, Any] = get_tensorstore_spec(
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 821, in get_tensorstore_spec
raise ValueError(f'Checkpoint path should be absolute. Got {directory}')
ValueError: Checkpoint path should be absolute. Got retriever-mistral-jax/0.orbax-checkpoint-tmp-1711610071493337/lora.orbax-checkpoint-tmp-1711610103114668
The text was updated successfully, but these errors were encountered: