Create benchmark_diff_analysis.py #2099
Open
+352
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a script that compares two benchmark run directories and analyzes the changes in individual test performance.
For each test in a benchmark run, it compares the first and the second run regarding how many failed attempts until the test passed (if a test didn't pass in any attempt, it considers how many failed attempts there were) and then categorizes the test as either improved, worsened, stable, or present in only the first or the second run (the original
--diffs
switch simply fails if a test is not present in both first and second runs).Comparison with the existing --diffs switch
Let's compare the output of the built-in switch
--diffs
and the new tool on this PR when both are pointed to the same pair of benchmark runs:Original
benchmark.py --diffs
./benchmark/benchmark.py --diffs tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/all-your-base
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/all-your-base/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/all-your-base/.aider.chat.history.md
allergies
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/allergies/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/allergies/.aider.chat.history.md
anagram
True tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/anagram/.aider.chat.history.md
False tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/anagram/.aider.chat.history.md
bank-account
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/bank-account/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/bank-account/.aider.chat.history.md
bob
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/bob/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/bob/.aider.chat.history.md
change
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/change/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/change/.aider.chat.history.md
complex-numbers
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/complex-numbers/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/complex-numbers/.aider.chat.history.md
house
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/house/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/house/.aider.chat.history.md
ledger
True tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/ledger/.aider.chat.history.md
False tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/ledger/.aider.chat.history.md
linked-list
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/linked-list/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/linked-list/.aider.chat.history.md
list-ops
True tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/list-ops/.aider.chat.history.md
False tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/list-ops/.aider.chat.history.md
luhn
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/luhn/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/luhn/.aider.chat.history.md
matrix
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/matrix/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/matrix/.aider.chat.history.md
nth-prime
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/nth-prime/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/nth-prime/.aider.chat.history.md
pascals-triangle
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/pascals-triangle/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/pascals-triangle/.aider.chat.history.md
perfect-numbers
True tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/perfect-numbers/.aider.chat.history.md
False tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/perfect-numbers/.aider.chat.history.md
queen-attack
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/queen-attack/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/queen-attack/.aider.chat.history.md
resistor-color
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/resistor-color/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/resistor-color/.aider.chat.history.md
robot-name
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/robot-name/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/robot-name/.aider.chat.history.md
robot-simulator
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/robot-simulator/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/robot-simulator/.aider.chat.history.md
rotational-cipher
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/rotational-cipher/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/rotational-cipher/.aider.chat.history.md
run-length-encoding
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/run-length-encoding/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/run-length-encoding/.aider.chat.history.md
satellite
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/satellite/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/satellite/.aider.chat.history.md
say
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/say/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/say/.aider.chat.history.md
secret-handshake
True tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/secret-handshake/.aider.chat.history.md
False tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/secret-handshake/.aider.chat.history.md
series
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/series/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/series/.aider.chat.history.md
simple-cipher
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/simple-cipher/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/simple-cipher/.aider.chat.history.md
space-age
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/space-age/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/space-age/.aider.chat.history.md
tournament
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/tournament/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/tournament/.aider.chat.history.md
triangle
True tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/triangle/.aider.chat.history.md
False tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/triangle/.aider.chat.history.md
yacht
False tmp.benchmarks/2024-10-20-18-00-06--gemini1.5-flash-editing-cedarscript-e0.3.2/yacht/.aider.chat.history.md
True tmp.benchmarks/2024-10-21-12-58-15--gemini-1.5-flash-editing-whole/yacht/.aider.chat.history.md
changed: 31 all-your-base,allergies,anagram,bank-account,bob,change,complex-numbers,house,ledger,linked-list,list-ops,luhn,matrix,nth-prime,pascals-triangle,perfect-numbers,queen-attack,resistor-color,robot-name,robot-simulator,rotational-cipher,run-length-encoding,satellite,say,secret-handshake,series,simple-cipher,space-age,tournament,triangle,yacht
unchanged: 102 accumulate,acronym,affine-cipher,alphametics,armstrong-numbers,atbash-cipher,beer-song,binary,binary-search,binary-search-tree,book-store,bottle-song,bowling,circular-buffer,clock,collatz-conjecture,connect,crypto-square,custom-set,darts,diamond,difference-of-squares,diffie-hellman,dnd-character,dominoes,dot-dsl,eliuds-eggs,error-handling,etl,flatten-array,food-chain,forth,gigasecond,go-counting,grade-school,grains,grep,hamming,hangman,hello-world,hexadecimal,high-scores,isbn-verifier,isogram,killer-sudoku-helper,kindergarten-garden,knapsack,largest-series-product,leap,markdown,matching-brackets,meetup,minesweeper,ocr-numbers,octal,paasio,palindrome-products,pangram,phone-number,pig-latin,point-mutations,poker,pov,prime-factors,protein-translation,proverb,pythagorean-triplet,rail-fence-cipher,raindrops,rational-numbers,react,rectangles,resistor-color-duo,resistor-color-expert,resistor-color-trio,rest-api,reverse-string,rna-transcription,roman-numerals,saddle-points,scale-generator,scrabble-score,sgf-parsing,sieve,simple-linked-list,spiral-matrix,square-root,strain,sublist,sum-of-multiples,transpose,tree-building,trinary,twelve-days,two-bucket,two-fer,variable-length-quantity,word-count,word-search,wordy,zebra-puzzle,zipper
New format (in this PR)
More details
The 2 run diffs above corresponds to the diff of these 2 benchmark runs: