Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create benchmark_diff_analysis.py #2099

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

elifarley
Copy link

@elifarley elifarley commented Oct 21, 2024

This PR introduces a script that compares two benchmark run directories and analyzes the changes in individual test performance.

For each test in a benchmark run, it compares the first and the second run regarding how many failed attempts until the test passed (if a test didn't pass in any attempt, it considers how many failed attempts there were) and then categorizes the test as either improved, worsened, stable, or present in only the first or the second run (the original --diffs switch simply fails if a test is not present in both first and second runs).

Comparison with the existing --diffs switch

Let's compare the output of the built-in switch --diffs and the new tool on this PR when both are pointed to the same pair of benchmark runs:

Original

benchmark.py --diffs ./benchmark/benchmark.py --diffs tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/

all-your-base
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/all-your-base/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/all-your-base/.aider.chat.history.md

allergies
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/allergies/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/allergies/.aider.chat.history.md

anagram
True tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/anagram/.aider.chat.history.md
False tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/anagram/.aider.chat.history.md

bank-account
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/bank-account/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/bank-account/.aider.chat.history.md

bob
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/bob/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/bob/.aider.chat.history.md

change
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/change/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/change/.aider.chat.history.md

complex-numbers
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/complex-numbers/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/complex-numbers/.aider.chat.history.md

house
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/house/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/house/.aider.chat.history.md

ledger
True tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/ledger/.aider.chat.history.md
False tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/ledger/.aider.chat.history.md

linked-list
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/linked-list/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/linked-list/.aider.chat.history.md

list-ops
True tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/list-ops/.aider.chat.history.md
False tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/list-ops/.aider.chat.history.md

luhn
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/luhn/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/luhn/.aider.chat.history.md

matrix
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/matrix/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/matrix/.aider.chat.history.md

nth-prime
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/nth-prime/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/nth-prime/.aider.chat.history.md

pascals-triangle
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/pascals-triangle/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/pascals-triangle/.aider.chat.history.md

perfect-numbers
True tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/perfect-numbers/.aider.chat.history.md
False tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/perfect-numbers/.aider.chat.history.md

queen-attack
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/queen-attack/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/queen-attack/.aider.chat.history.md

resistor-color
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/resistor-color/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/resistor-color/.aider.chat.history.md

robot-name
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/robot-name/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/robot-name/.aider.chat.history.md

robot-simulator
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/robot-simulator/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/robot-simulator/.aider.chat.history.md

rotational-cipher
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/rotational-cipher/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/rotational-cipher/.aider.chat.history.md

run-length-encoding
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/run-length-encoding/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/run-length-encoding/.aider.chat.history.md

satellite
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/satellite/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/satellite/.aider.chat.history.md

say
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/say/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/say/.aider.chat.history.md

secret-handshake
True tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/secret-handshake/.aider.chat.history.md
False tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/secret-handshake/.aider.chat.history.md

series
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/series/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/series/.aider.chat.history.md

simple-cipher
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/simple-cipher/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/simple-cipher/.aider.chat.history.md

space-age
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/space-age/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/space-age/.aider.chat.history.md

tournament
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/tournament/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/tournament/.aider.chat.history.md

triangle
True tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/triangle/.aider.chat.history.md
False tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/triangle/.aider.chat.history.md

yacht
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/yacht/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/yacht/.aider.chat.history.md

changed: 31 all-your-base,allergies,anagram,bank-account,bob,change,complex-numbers,house,ledger,linked-list,list-ops,luhn,matrix,nth-prime,pascals-triangle,perfect-numbers,queen-attack,resistor-color,robot-name,robot-simulator,rotational-cipher,run-length-encoding,satellite,say,secret-handshake,series,simple-cipher,space-age,tournament,triangle,yacht

unchanged: 102 accumulate,acronym,affine-cipher,alphametics,armstrong-numbers,atbash-cipher,beer-song,binary,binary-search,binary-search-tree,book-store,bottle-song,bowling,circular-buffer,clock,collatz-conjecture,connect,crypto-square,custom-set,darts,diamond,difference-of-squares,diffie-hellman,dnd-character,dominoes,dot-dsl,eliuds-eggs,error-handling,etl,flatten-array,food-chain,forth,gigasecond,go-counting,grade-school,grains,grep,hamming,hangman,hello-world,hexadecimal,high-scores,isbn-verifier,isogram,killer-sudoku-helper,kindergarten-garden,knapsack,largest-series-product,leap,markdown,matching-brackets,meetup,minesweeper,ocr-numbers,octal,paasio,palindrome-products,pangram,phone-number,pig-latin,point-mutations,poker,pov,prime-factors,protein-translation,proverb,pythagorean-triplet,rail-fence-cipher,raindrops,rational-numbers,react,rectangles,resistor-color-duo,resistor-color-expert,resistor-color-trio,rest-api,reverse-string,rna-transcription,roman-numerals,saddle-points,scale-generator,scrabble-score,sgf-parsing,sieve,simple-linked-list,spiral-matrix,square-root,strain,sublist,sum-of-multiples,transpose,tree-building,trinary,twelve-days,two-bucket,two-fer,variable-length-quantity,word-count,word-search,wordy,zebra-puzzle,zipper

New format (in this PR)

--- 2024-10-21-12-58-15--whole-editing-gemini1.5flash
+++ 2024-10-20-18-00-06--cedarscript-0.3.2-editing-gemini1.5flash
# ============= Failed Attempts per Test =============

@@ Improved, now PASSED (6) @@
++anagram: -3 -> 0
++ledger: -3 -> 0
++list-ops: -3 -> 0
++perfect-numbers: -3 -> 0
++secret-handshake: -3 -> 0
++triangle: -3 -> 0

@@ Improved, minor (5) @@
+ acronym: 2 -> 1
+ roman-numerals: 2 -> 0
+ sieve: 1 -> 0
+ two-fer: 2 -> 1
+ zebra-puzzle: 2 -> 1

@@ Worsened, now FAILED (25) @@
--all-your-base: 1 -> -4
--allergies: 0 -> -4
--bank-account: 1 -> -4
--bob: 2 -> -4
--change: 2 -> -4
--complex-numbers: 2 -> -4
--house: 2 -> -4
--linked-list: 0 -> -4
--luhn: 0 -> -4
--matrix: 0 -> -4
--nth-prime: 0 -> -4
--pascals-triangle: 1 -> -4
--queen-attack: 1 -> -4
--resistor-color: 0 -> -4
--robot-name: 1 -> -4
--robot-simulator: 2 -> -4
--rotational-cipher: 0 -> -4
--run-length-encoding: 2 -> -4
--satellite: 1 -> -4
--say: 1 -> -4
--series: 1 -> -4
--simple-cipher: 0 -> -4
--space-age: 0 -> -4
--tournament: 1 -> -4
--yacht: 1 -> -4

@@ Worsened, still PASSED (7) @@
- custom-set: 0 -> 1
- largest-series-product: 0 -> 1
- octal: 1 -> 2
- point-mutations: 1 -> 3
- protein-translation: 0 -> 1
- rational-numbers: 0 -> 1
- sum-of-multiples: 0 -> 1

@@ Stable: PASSED (34) @@
=+accumulate: 0
=+armstrong-numbers: 0
=+binary: 0
=+binary-search: 0
=+collatz-conjecture: 0
=+darts: 0
=+difference-of-squares: 0
=+diffie-hellman: 0
=+eliuds-eggs: 0
=+etl: 0
=+flatten-array: 0
=+gigasecond: 0
=+grains: 0
=+hamming: 0
=+hello-world: 0
=+hexadecimal: 1
=+isogram: 0
=+knapsack: 0
=+leap: 0
=+markdown: 0
=+matching-brackets: 0
=+pangram: 0
=+prime-factors: 0
=+pythagorean-triplet: 0
=+raindrops: 0
=+resistor-color-duo: 0
=+reverse-string: 0
=+rna-transcription: 0
=+saddle-points: 1
=+scrabble-score: 0
=+spiral-matrix: 0
=+square-root: 0
=+strain: 0
=+trinary: 0

@@ Stable: FAILED (56) @@
=-affine-cipher: -3 -> -4
=-alphametics: -3 -> -4
=-atbash-cipher: -3 -> -4
=-beer-song: -3 -> -4
=-binary-search-tree: -3 -> -4
=-book-store: -3 -> -4
=-bottle-song: -3 -> -4
=-bowling: -3 -> -4
=-circular-buffer: -3 -> -4
=-clock: -3 -> -4
=-connect: -3 -> -4
=-crypto-square: -3 -> -4
=-diamond: -3 -> -4
=-dnd-character: -3 -> -4
=-dominoes: -3 -> -4
=-dot-dsl: -3 -> -4
=-error-handling: -3 -> -4
=-food-chain: -3 -> -4
=-forth: -3 -> -4
=-go-counting: -3 -> -4
=-grade-school: -3 -> -4
=-grep: -3 -> -4
=-hangman: -3 -> -4
=-high-scores: -3 -> -4
=-isbn-verifier: -3 -> -4
=-killer-sudoku-helper: -3 -> -4
=-kindergarten-garden: -3 -> -4
=-meetup: -3 -> -4
=-minesweeper: -3 -> -4
=-ocr-numbers: -3 -> -4
=-paasio: -3 -> -4
=-palindrome-products: -3 -> -4
=-phone-number: -3 -> -4
=-pig-latin: -3 -> -4
=-poker: -3 -> -4
=-pov: -3 -> -4
=-proverb: -3 -> -4
=-rail-fence-cipher: -3 -> -4
=-react: -3 -> -4
=-rectangles: -3 -> -4
=-resistor-color-expert: -3 -> -4
=-resistor-color-trio: -3 -> -4
=-rest-api: -3 -> -4
=-scale-generator: -3 -> -4
=-sgf-parsing: -3 -> -4
=-simple-linked-list: -3 -> -4
=-sublist: -3 -> -4
=-transpose: -3 -> -4
=-tree-building: -3 -> -4
=-twelve-days: -3 -> -4
=-two-bucket: -3 -> -4
=-variable-length-quantity: -3 -> -4
=-word-count: -3 -> -4
=-word-search: -3 -> -4
=-wordy: -3 -> -4
=-zipper: -3 -> -4

# =============          TOTALS          =============
# IMPROVED: 11
#    Now PASSES: 6
#    Minor     : 5
# WORSENED: 32
#    Now FAILED: 25
#    Minor     : 7
# STABLE  : 90
#    PASSED: 34
#    FAILED: 56
# TOTAL  : 133

More details

The 2 run diffs above corresponds to the diff of these 2 benchmark runs:

- dirname: 2024-10-21-12-58-15--whole-editing-gemini1.5flash
  test_cases: 133
  model: gemini/gemini-1.5-flash-latest
  edit_format: whole
  commit_hash: 95df622-dirty
  pass_rate_1: 34.6
  pass_rate_2: 45.9
  pass_rate_3: 53.4
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 21
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 3
  exhausted_context_windows: 0
  test_timeouts: 2
  command: aider --model gemini/gemini-1.5-flash-latest
  date: 2024-10-21
  versions: 0.59.2.dev
  seconds_per_case: 8.9
  total_cost: 0.1169
- dirname: 2024-10-20-18-00-06--cedarscript-0.3.2-editing-gemini1.5flash
  test_cases: 133
  model: gemini/gemini-1.5-flash-latest
  edit_format: cedarscript-g
  commit_hash: df4c352-dirty
  pass_rate_1: 30.1
  pass_rate_2: 37.6
  pass_rate_3: 38.3
  pass_rate_4: 39.1
  percent_cases_well_formed: 69.9
  error_outputs: 803
  num_malformed_responses: 297
  num_with_malformed_responses: 40
  user_asks: 186
  lazy_comments: 0
  syntax_errors: 30
  indentation_errors: 58
  exhausted_context_windows: 0
  test_timeouts: 4
  command: aider --model gemini/gemini-1.5-flash-latest
  date: 2024-10-20
  versions: 0.59.2.dev
  seconds_per_case: 39.7
  total_cost: 1.3729

@elifarley elifarley marked this pull request as draft October 21, 2024 16:29
@paul-gauthier
Copy link
Collaborator

The existing benchmark script already has --diffs switch.

@elifarley
Copy link
Author

Thanks, I will look into it!

@elifarley elifarley closed this Oct 21, 2024
@elifarley elifarley changed the title Create benchmark_diff.py Create benchmark_diff_analysis.py Oct 21, 2024
@elifarley elifarley reopened this Oct 23, 2024
@elifarley elifarley marked this pull request as ready for review October 23, 2024 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants