This guide provides detailed instructions on how to convert various datasets from their public sources to our required format, including LLaVA-In-Context, Dense Captions, Visual Storytelling, TV Captions, Scene Navigation, Spot The Difference, and EGO4D. By following the specified steps, users can easily set up on these datasets. The output for each dataset will be saved in a corresponding JSON file named <dataset_name>.json
in the output
folder.
Download the coco2017 images (coco2014 might also be work), put the images in a folder with the path <image_root>
. Download the meta for the training image ids, put the meta file at the path <meta>
.
The folder structure should be like this:
<image_root>/
annotations/
val2017/
train2017/
000000498792.jpg
XXXXXXXXXXXX.jpg
...
Run the following command (the --num_threads
is optional):
python main.py --name=2d.Llava --image_path=<meta> --image_root=<image_root>/train2017 [--num_threads=<num_threads>]
The output will be saved in output/LA.json
.
Download the Dense Captions videos in ActivityNet, put the videos in a folder with the path <image_path>
The folder structure should be like this:
<image_path>/
<video_id>.mp4
...
Run the following command:
python main.py --name=video.DenseCaptions --image_path=<image_path> [--num_threads=<num_threads>]
The output will be saved in output/DC.json
.
Download the Visual Storytelling Dataset and extract the train.story-in-sequence.json
to a path, let <json_path>
be the path of the json file, and run the following command:
python main.py --name=video.VisualStoryTelling --image_path=<json_path> [--num_threads=<num_threads>]
The output will be saved in output/VST.json
.
Download the TV Captions video frames (3FPS) and extract the zip
to a path, let <image_path>
be the path of the extracted folder.
The folder structure should be like this:
<image_path>/
bbt_frames/
...
castle_frames/
...
house_frames/
...
met_frames/
...
Run the following command:
python main.py --name=video.TVCaptions --image_path=<image_path> [--num_threads=<num_threads>]
The output will be saved in output/TV.json
.
Download the ScanNet v2 dataset from the official website, let <image_path>
be the path of the dataset
The folder structure should be like this:
<image_path>/
scene0000_00/
color/
000000.jpg
...
...
...
Run the following command:
python main.py --name=3d.SceneNavigation --image_path=<image_path> [--num_threads=<num_threads>]
The output will be saved in output/SN.json
.
Download the Spot The Difference Dataset from Google Drive, let <image_path>
be the path of the dataset.
The folder structure should be like this:
<image_path>/
<image>.jpg
...
Run the following command:
python main.py --name=change.SpotTheDifference --image_path=<image_path> [--num_threads=<num_threads>]
The output will be saved in output/SD.json
.
Download the COCO 2017 train dataset from COCO website, let <image_path>
be the path of the dataset.
The folder structure should be like this:
<image_path>/
<image>.jpg
...
Run the following command:
python main.py --name=change.CocoGeneralDifference --image_path=<image_path> [--num_threads=<num_threads>]
The output will be saved in output/CGD.json
.
Download the EGO4D dataset, let <image_path>
be the path of the dataset.
The folder structure should be like this:
<image_path>/
<videos>.mp4
...
Run the following command:
python main.py --name=fpv.EGO4D --image_path=<image_path> [--num_threads=<num_threads>]
The output will be saved in output/E4D.json
.