Convert It

This guide provides detailed instructions on how to convert various datasets from their public sources to our required format, including LLaVA-In-Context, Dense Captions, Visual Storytelling, TV Captions, Scene Navigation, Spot The Difference, and EGO4D. By following the specified steps, users can easily set up on these datasets. The output for each dataset will be saved in a corresponding JSON file named <dataset_name>.json in the output folder.

LLaVA-In-Context

Download the coco2017 images (coco2014 might also be work), put the images in a folder with the path <image_root>. Download the meta for the training image ids, put the meta file at the path <meta>.

The folder structure should be like this:

<image_root>/
    annotations/
    val2017/
    train2017/
        000000498792.jpg
        XXXXXXXXXXXX.jpg
    ...

Run the following command (the --num_threads is optional):

python main.py --name=2d.Llava --image_path=<meta>  --image_root=<image_root>/train2017 [--num_threads=<num_threads>]

The output will be saved in output/LA.json.

Dense Captions

Download the Dense Captions videos in ActivityNet, put the videos in a folder with the path <image_path>

The folder structure should be like this:

<image_path>/
    <video_id>.mp4
    ...

Run the following command:

python main.py --name=video.DenseCaptions --image_path=<image_path> [--num_threads=<num_threads>]

The output will be saved in output/DC.json.

Visual Storytelling

Download the Visual Storytelling Dataset and extract the train.story-in-sequence.json to a path, let <json_path> be the path of the json file, and run the following command:

python main.py --name=video.VisualStoryTelling --image_path=<json_path> [--num_threads=<num_threads>]

The output will be saved in output/VST.json.

TV Captions

Download the TV Captions video frames (3FPS) and extract the zip to a path, let <image_path> be the path of the extracted folder.

The folder structure should be like this:

<image_path>/
    bbt_frames/
        ...
    castle_frames/
        ...
    house_frames/
        ...
    met_frames/
        ...

Run the following command:

python main.py --name=video.TVCaptions --image_path=<image_path> [--num_threads=<num_threads>]

The output will be saved in output/TV.json.

Scene Navigation

Download the ScanNet v2 dataset from the official website, let <image_path> be the path of the dataset

The folder structure should be like this:

<image_path>/
    scene0000_00/
        color/
            000000.jpg
            ...
        ...
    ...

Run the following command:

python main.py --name=3d.SceneNavigation --image_path=<image_path> [--num_threads=<num_threads>]

The output will be saved in output/SN.json.

Spot The Difference (Subtle Difference Version)

Download the Spot The Difference Dataset from Google Drive, let <image_path> be the path of the dataset.

The folder structure should be like this:

<image_path>/
    <image>.jpg
    ...

Run the following command:

python main.py --name=change.SpotTheDifference --image_path=<image_path> [--num_threads=<num_threads>]

The output will be saved in output/SD.json.

Spot The Difference (COCO General Difference Version)

Download the COCO 2017 train dataset from COCO website, let <image_path> be the path of the dataset.

The folder structure should be like this:

<image_path>/
    <image>.jpg
    ...

Run the following command:

python main.py --name=change.CocoGeneralDifference --image_path=<image_path> [--num_threads=<num_threads>]

The output will be saved in output/CGD.json.

EGO4D

Download the EGO4D dataset, let <image_path> be the path of the dataset.

The folder structure should be like this:

<image_path>/
    <videos>.mp4
    ...

Run the following command:

python main.py --name=fpv.EGO4D --image_path=<image_path> [--num_threads=<num_threads>]

The output will be saved in output/E4D.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Convert It

LLaVA-In-Context

Dense Captions

Visual Storytelling

TV Captions

Scene Navigation

Spot The Difference (Subtle Difference Version)

Spot The Difference (COCO General Difference Version)

EGO4D

Files

README.md

Latest commit

History

README.md

File metadata and controls

Convert It

LLaVA-In-Context

Dense Captions

Visual Storytelling

TV Captions

Scene Navigation

Spot The Difference (Subtle Difference Version)

Spot The Difference (COCO General Difference Version)

EGO4D