Skip to content

Commit

Permalink
Dev/onevision (#149)
Browse files Browse the repository at this point in the history
* claude auto detect json mode

* extract information

* use claude to generate

* fix bugs

* fix

* generate data

* chore: Update dataset name and version for live_bench task

* gpt-4-turbo => gpt-4o

* chore: Update dataset capture settings in create_dataset.py

* everything use gpt-4o

* websites

* livebench_july

* Refactor code to simplify data assignment in example.ipynb

* chore: Update dataset name for live_bench task

* Update lmms_eval configuration to use process_results_use_image and full_docs options

* split tasks

* chore: Update live_bench task dataset names and versions

* feat: Add multi-choice parsing and processing functions

This commit adds two new functions to the `egoschema/utils.py` file: `get_multi_choice_info` and `parse_multi_choice_response`. These functions are used to parse and process multi-choice responses in the Egoschema task.

The `get_multi_choice_info` function extracts information about the available choices from the input document, while the `parse_multi_choice_response` function parses the generated response and returns the predicted index.

These functions are essential for accurately processing multi-choice answers in the Egoschema task.

* feat: Add regex-based parsing for multi-choice predictions

This commit enhances the `perceptiontest_val_process_results_mc` function in the `utils.py` file. It introduces regex-based parsing to extract the predicted choice from the raw text prediction. If a match is found for A, B, C, or D, the matched letter is used as the prediction. Otherwise, an empty string is set as the prediction.

This improvement ensures accurate processing of multi-choice predictions in the perception test validation.

Co-authored-by: [Co-author Name] <[[email protected]]>

* refactor: Improve accuracy calculation in perception test validation

This commit refactors the `perceptiontest_val_aggregate_accuracy` function in the `utils.py` file. Instead of comparing the string representations of `answer_id` and `pred_id`, it now directly checks the `correct` field in the `accuracy` dictionary. This change ensures more accurate calculation of the overall accuracy in the perception test validation.

Co-authored-by: [Co-author Name] <[[email protected]]>

* Refactor accuracy calculation in perception test validation

* feat: Add SRT_API model to available models

This commit adds the SRT_API model to the list of available models in the `__init__.py` file. This model can now be used for evaluation and testing purposes.

Co-authored-by: [Co-author Name] <[[email protected]]>

* chore: Update live_bench task dataset names and versions

(cherry picked from commit 46a85b40b013503e52d007c96ca0607bd4604a3e)

* refactor: Update question_for_eval key in MathVerseEvaluator

This commit updates the key "question_for_eval" to "question" in the MathVerseEvaluator class. This change ensures consistency and clarity in the code.

Co-authored-by: [Co-author Name] <[[email protected]]>

* refactor: Update generate_submission_file function in mathverse utils

This commit updates the `generate_submission_file` function in the `mathverse/utils.py` file. It adds the `problem_version` to the filename to ensure that the submission files are saved with the correct problem version. This change improves the organization and clarity of the code.

Co-authored-by: [Co-author Name] <[[email protected]]>

* refactor: Update default template YAML files

This commit updates the default template YAML files in the `lmms_eval/tasks` directory. It modifies the `generation_kwargs` and `metadata` sections to improve the configuration and versioning of the tasks. These changes ensure consistency and clarity in the code.

Co-authored-by: [Co-author Name] <[[email protected]]>

---------

Co-authored-by: Fanyi Pu <[email protected]>
Co-authored-by: [Co-author Name] <[[email protected]]>
Co-authored-by: kcz358 <[email protected]>
  • Loading branch information
4 people authored Jul 30, 2024
1 parent 5d9e5f6 commit 2ddedaf
Show file tree
Hide file tree
Showing 21 changed files with 122 additions and 90 deletions.
2 changes: 2 additions & 0 deletions lmms_eval/api/task.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,10 @@ class TaskConfig(dict):
validation_split: str = None
test_split: str = None
fewshot_split: str = None # TODO: assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
full_docs: bool = False
# formatting / prompting options.
# see docs/advanced_task_guide.md for more info
process_results_use_image: bool = False
process_docs: Callable = None
doc_to_visual: Union[Callable, str] = None
doc_to_text: Union[Callable, str] = None
Expand Down
7 changes: 2 additions & 5 deletions lmms_eval/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ def evaluate(
# hack: remove image columns to speed avoid loading images and speed up postprocessing
# reason: doc_iterator will actually load image if it's in the doc.
docs = task.test_docs() if task.has_test_docs() else task.validation_docs()
if "d170" not in task_name and "dc100" not in task_name and "dc200" not in task_name and "llava_wilder" not in task_name and "live_bench" not in task_name and "wildvision" not in task_name:
if not task.config["process_results_use_image"]:
remove_cols = []
features = docs.features
# If it is an Image instance or a Sequence of Image instance. Remove it
Expand All @@ -340,10 +340,7 @@ def evaluate(
docs = docs.remove_columns(remove_cols)

####################### Processing with Full Docs Mode #######################
if task_name in ["videochatgpt_consistency"]:
full_docs = True
else:
full_docs = False
full_docs = task.config["full_docs"]

doc_iterator = itertools.islice(enumerate(docs), lm.rank, limit, lm.world_size)
# Instead of converting the iterator to a list, use `itertools.tee` to create a parallel iterator for counting
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
model_specific_prompt_kwargs:
default:
pre_prompt: ""
post_prompt: ""
post_prompt: ""
process_results_use_image: true
33 changes: 5 additions & 28 deletions lmms_eval/tasks/live_bench/live_bench.yaml
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,31 +1,8 @@
dataset_path: lmms-lab/LiveBench
dataset_kwargs:
token: True
task: "live_bench"
test_split: test
dataset_name: 2024-07
output_type: generate_until
doc_to_visual: !function utils.livebench_doc_to_visual
doc_to_text: !function utils.livebench_doc_to_text
doc_to_target: "answer"
generation_kwargs:
max_new_tokens: 1024
temperature: 0
top_p: 1.0
num_beams: 1
do_sample: false
process_results: !function utils.livebench_process_results
metric_list:
- metric: gpt4_eval_score
aggregation: !function utils.livebench_aggregate_results
higher_is_better: true
# - metric: gpt4_eval_score_mini
# aggregation: !function utils.livebench_aggregate_results
# higher_is_better: true
model_specific_prompt_kwargs:
default:
pre_prompt: ""
post_prompt: ""
group: live_bench
task:
- live_bench_2406
- live_bench_2407

metadata:
api_type : openai
eval_with_mini: false
3 changes: 3 additions & 0 deletions lmms_eval/tasks/live_bench/live_bench_2406.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
task: "live_bench_2406"
dataset_name: 2024-06
include: live_bench_template_yaml
3 changes: 3 additions & 0 deletions lmms_eval/tasks/live_bench/live_bench_2407.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
task: "live_bench_2407"
dataset_name: 2024-07
include: live_bench_template_yaml
28 changes: 28 additions & 0 deletions lmms_eval/tasks/live_bench/live_bench_template_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
dataset_path: lmms-lab/LiveBench
dataset_kwargs:
token: True
test_split: test
dataset_name: 2024-07
output_type: generate_until
doc_to_visual: !function utils.livebench_doc_to_visual
doc_to_text: !function utils.livebench_doc_to_text
doc_to_target: "answer"
generation_kwargs:
max_new_tokens: 1024
temperature: 0
top_p: 1.0
num_beams: 1
do_sample: false
process_results: !function utils.livebench_process_results
process_results_use_image: true
metric_list:
- metric: gpt4_eval_score
aggregation: !function utils.livebench_aggregate_results
higher_is_better: true
# - metric: gpt4_eval_score_mini
# aggregation: !function utils.livebench_aggregate_results
# higher_is_better: true
model_specific_prompt_kwargs:
default:
pre_prompt: ""
post_prompt: ""
1 change: 1 addition & 0 deletions lmms_eval/tasks/llava_wilder/_default_template_wilder_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ generation_kwargs:
num_beams: 1
do_sample: false
process_results: !function utils.llava_process_results
process_results_use_image: true
metric_list:
- metric: gpt_eval_llava_all
aggregation: !function utils.llava_all_aggregation
Expand Down
2 changes: 1 addition & 1 deletion lmms_eval/tasks/mathverse/mathverse_evals.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ def eval_results(self, results, config):
problem = {
"question_type": inst["question_type"],
"answer": inst["answer"] if "answer" in inst else None,
"question_for_eval": inst["question_for_eval"],
"question_for_eval": inst["question"],
}
if config["metadata"].get("trunk_response", -1) > 0:
prediction = " ".join(full_prediction.split(" ")[-config["metadata"]["trunk_response"] :])
Expand Down
10 changes: 10 additions & 0 deletions lmms_eval/tasks/mathverse/mathverse_testmini_vision.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
group: mathverse_testmini_vision
task:
- mathverse_testmini_vision_intensive
- mathverse_testmini_vision_dominant
- mathverse_testmini_vision_only
metadata:
version: 0.0
gpt_eval_model_name: "gpt-3.5-turbo"
trunk_response: 30
quick_match: false
7 changes: 4 additions & 3 deletions lmms_eval/tasks/mathverse/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,18 +75,19 @@ def mathverse_aggregate_results_submission(results, args, *, calculate_gain=Fals

def mathverse_aggregate_results_eval(results, args, *, calculate_gain=False, random_scores=None):
split_flag = results[0]["metadata"]["split"]
problem_version = results[0]["metadata"]["problem_version"].lower().replace(" ", "_")
# save the result first, in case the gpt evaluation fails
path = generate_submission_file(f"mathverse_{split_flag}_results.json", args)
path = generate_submission_file(f"mathverse_{split_flag}_{problem_version}_results.json", args)
with open(path, "w") as f:
json.dump(results, f, indent=4)
# gpt evaluation
results_dict, scores = mathverse_evaluator.eval_results(results, config)
# save results
path = generate_submission_file(f"mathverse_{split_flag}_results.json", args)
path = generate_submission_file(f"mathverse_{split_flag}_{problem_version}_results.json", args)
with open(path, "w") as f:
json.dump(results_dict, f, indent=4)
# save scores
path = generate_submission_file(f"mathverse_{split_flag}_scores.json", args)
path = generate_submission_file(f"mathverse_{split_flag}_{problem_version}_scores.json", args)
with open(path, "w") as f:
json.dump(scores, f, indent=4)
eval_logger.info(f"Saved scores to {path}")
Expand Down
6 changes: 6 additions & 0 deletions lmms_eval/tasks/mmmu/_default_template_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
generation_kwargs:
max_new_tokens: 16

metadata:
version: 0.0
interleaved_format: false
9 changes: 3 additions & 6 deletions lmms_eval/tasks/mmmu/mmmu_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,10 @@ doc_to_text: !function utils.mmmu_doc_to_text
doc_to_target: "answer"
# The return value of process_results will be used by metrics
process_results: !function utils.mmmu_process_results
# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
generation_kwargs:
max_new_tokens: 16
image_aspect_ratio: original

metric_list:
- metric: submission
aggregation: !function utils.mmmu_test_aggregate_results_for_submission
higher_is_better: true
metadata:
- version: 0.0

include: _default_template_yaml
11 changes: 3 additions & 8 deletions lmms_eval/tasks/mmmu/mmmu_val.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,10 @@ doc_to_text: !function utils.mmmu_doc_to_text
doc_to_target: "answer"
# The return value of process_results will be used by metrics
process_results: !function utils.mmmu_process_results
# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
generation_kwargs:
max_new_tokens: 128
model_specific_generation_kwargs:
llava:
image_aspect_ratio: original

metric_list:
- metric: mmmu_acc
aggregation: !function utils.mmmu_aggregate_results
higher_is_better: true
metadata:
- version: 0.0

include: _default_template_yaml
27 changes: 20 additions & 7 deletions lmms_eval/tasks/mmmu/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
import numpy as np
import os
import json

from pathlib import Path
import yaml

from lmms_eval.tasks._task_utils.file_utils import generate_submission_file

Expand All @@ -14,13 +15,23 @@
MULTI_CHOICE_PROMPT = "Answer with the option's letter from the given choices directly."
OPEN_ENDED_PROMPT = "Answer the question using a single word or phrase."

with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
for i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)

config = yaml.safe_load("".join(safe_data))


def replace_images_tokens(input_string):
# for i in range(1, 8):
# question_text = f"<image {i}>"
# query_text = "<image>"
# if question_text in input_string:
# input_string = input_string.replace(question_text, query_text)
for i in range(1, 8):
question_text = f"<image {i}>"
query_text = "<image>"
if question_text in input_string:
input_string = input_string.replace(question_text, query_text)
return input_string


Expand All @@ -44,7 +55,9 @@ def construct_prompt(doc):

def mmmu_doc_to_text(doc):
question = construct_prompt(doc)
return replace_images_tokens(question)
if config["metadata"]["interleaved_format"]:
question = replace_images_tokens(question)
return question


def mmmu_doc_to_visual(doc):
Expand Down
3 changes: 3 additions & 0 deletions lmms_eval/tasks/nextqa/_default_template_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,6 @@ dataset_kwargs:
token: True
video: True
cache_dir: nextqa
metadata:
version: 0.0.1
load_package: False
42 changes: 17 additions & 25 deletions lmms_eval/tasks/nextqa/utils.py
Original file line number Diff line number Diff line change
@@ -1,40 +1,15 @@
import os
import yaml

import random
import pandas as pd

from pathlib import Path

from loguru import logger as eval_logger

try:
from pywsd.utils import lemmatize_sentence
except ImportError:
eval_logger.debug("pywsd not installed. Please install pywsd to use this module. You can install it by running 'pip install pywsd'")

try:
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

try:
import nltk

nltk.download("averaged_perceptron_tagger", quiet=True)
nltk.download("wordnet", quiet=True)
nltk.download("punkt", quiet=True)
except Exception as e:
eval_logger.debug(f"nltk download failed: {e}")
except ImportError:
eval_logger.debug("nltk not installed. Please install nltk to use this module. You can install it by running 'pip install nltk'")

from lmms_eval.tasks._task_utils.video_loader import get_cache_dir, get_video
import numpy as np


OPTIONS = ["A", "B", "C", "D", "E"]


with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
Expand All @@ -45,6 +20,23 @@

config = yaml.safe_load("".join(safe_data))

if config["metadata"]["load_package"]:
try:
from pywsd.utils import lemmatize_sentence
except ImportError:
eval_logger.debug("pywsd not installed. Please install pywsd to use this module. You can install it by running 'pip install pywsd'")

try:
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
import nltk

nltk.download("averaged_perceptron_tagger", quiet=True)
nltk.download("wordnet", quiet=True)
nltk.download("punkt", quiet=True)
except ImportError:
eval_logger.debug("nltk not installed. Please install nltk to use this module. You can install it by running 'pip install nltk'")

stopwords = set(pd.read_csv(Path(__file__).parent / "stopwords.csv").squeeze())

cache_dir = get_cache_dir(config, "NExTVideo")
Expand Down
4 changes: 1 addition & 3 deletions lmms_eval/tasks/videochatgpt/videochatgpt_consistency.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,9 @@ metric_list:
aggregation: !function utils.videochatgpt_aggregate_consistency
higher_is_better: true
include: _default_template_yaml
full_docs: true

generation_kwargs:
until:
- "ASSISTANT:"
image_aspect_ratio: original
max_new_tokens: 1024
temperature: 0
top_p: 1.0
Expand Down
7 changes: 7 additions & 0 deletions lmms_eval/tasks/videochatgpt/videochatgpt_generic.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,10 @@ metric_list:
aggregation: !function utils.videochatgpt_aggregate_score
higher_is_better: true
include: _default_template_yaml

generation_kwargs:
max_new_tokens: 1024
temperature: 0
top_p: 1.0
num_beams: 1
do_sample: false
3 changes: 0 additions & 3 deletions lmms_eval/tasks/videochatgpt/videochatgpt_temporal.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,6 @@ metric_list:
include: _default_template_yaml

generation_kwargs:
until:
- "ASSISTANT:"
image_aspect_ratio: original
max_new_tokens: 1024
temperature: 0
top_p: 1.0
Expand Down
1 change: 1 addition & 0 deletions lmms_eval/tasks/wild_vision_bench/_default_template_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ output_type: generate_until
doc_to_visual: !function utils.wild_vision_doc_to_visual
doc_to_text: !function utils.wild_vision_doc_to_text
doc_to_target: !function utils.wild_vision_doc_to_target
process_results_use_image: true
generation_kwargs:
max_new_tokens: 4096
temperature: 0
Expand Down

0 comments on commit 2ddedaf

Please sign in to comment.