We follow BOLAA environment to design the hotpotqa benchmark. And we designed the invidual agent with AgentLite. Run evaluation with
pip install joblib
cd hotpotqa
python evaluate_hotpot_qa.py --llm gpt-4-0613 --agent_arch act
We follow AgentBoard environment to setup the webshop benchmark. And we designed the individual agent via AgentLite with Search and Click actions.
- Follow setup in the AgentBoard to run the webshop in backend. Go to your home directory first and running the following scripts from AgentBoard.
conda create -n agentboard python=3.8.13 # python version should be 3.8.13
conda activate agentboard
git clone https://github.com/hkust-nlp/AgentBoard.git
cd AgentBoard
mkdir data
wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz
tar -zxvf data.tar.gz
INSTALL_WEBARENA=false bash ./setup.sh
cd ./agentboard/environment/WebShop
bash ./run_dev.sh
it is highly suggested running webshop in backend with tmux
.
- Since AgentLite is using a different python version, you should create a new environment for AgentLite.
- Run AgentLite evaluation in this folder with
cd webshop
python evaluate_webshop.py --llm gpt-4-0613 --agent_arch act
We follow AgentBoard environment to setup the tool-query benchmark. And we designed the individual agent via AgentLite with all the corresponding function call as actions.
You should first get a data/tool-query
folder, which is a copy of data from AgentBoard with
cp -rf [AgentBoardPath]/data/tool-query ./tool/
Then, you could directly run evaluation in this folder with
cd tool
python evaluate_tools.py --llm gpt-4-0613 --agent_arch react
We follow AgentBoard environment to setup the tool-operation benchmark. And we designed the individual agent via AgentLite with all the corresponding function call as actions.
You should first get a data/tool-operation
folder, which is a copy of data from AgentBoard with
cp -rf [AgentBoardPath]/data/tool-operation ./tool-operation/data/
Then, you could directly run evaluation in this folder with
cd tool
python evaluate_tool_operation.py --llm gpt-4-32k --agent_arch react
You can substitute the --agent_arch
with different architectures as in BOLAA, including react
, act
, planact
, planreact
, zs
, zst
, bolaa
. The multi-agent architecture of hotpotqa will soon be released.
Note that the bolaa
implementation is slightly different from the original paper due to the communication template is different in AgentLite implementation. You could change it in your best practice.
We suggest using url-based way for inference, such as fastchat to get response from local model.
We provides the example code lines in benchmark evaluation code.
You only need to change the base_url
, llm_name
for your local model inference.