CoCoNUT: Structural Code Understanding does not fall out of a tree

In this repository we investigate the capabilities of selected Large Language Models on understanding Structural Code Execution. Namely, we consider the task of reproducing the code lines executed for a specific set of input arguments for a given function or a set of them. This task is non-trivial due to a number of factors like recalling relevant parameter values, evaluating expressions, resolving method calls and keeping track of stacked executions. In addition, we investigate the execution of advanced code concepts like OOP, Concurrency and Recursion. An example of one such task is displayed in the following:

This repository contains the different scripts we have used for generating and eliciting traces from Java and Python programs. The trace dataset is also shared at HuggingFace.

Under Dataset, we have saved the different short programs according to their respective functionality. In addition, this folder contains the majority of traces, as well as the code to produce them.
The Results directory contains the traces produced by the different models and split between HumanEval and complex (e.g advanced concepts). Models contains specific configuration files for different language models to run locally. Visualizations contains a number of basic visualizations of the different findings.

A basic overview of the findings on HumanEval can be found below:

The table below summarizes the performance of different models across tasks, evaluated using various metrics. Metrics are grouped under two evaluation settings: COT (Chain-of-Thought reasoning) and Direct.

COT Evaluation

Chain-of-Thought (CoT) Evaluation

Task ID	Acc Hard (%)	Acc Mean (%)	Sim	False Sim
Gemini1.5-Pro 002	47.2	66.2	0.88	0.37
Claude3.5-Sonnet	41.0	61.6	0.87	0.43
GPT4o	16.8	39.4	0.75	0.50
Qwen2.5-Coder 32B	26.1	44.3	0.81	0.50
LLama3.1 70B	16.2	38.1	0.76	0.52
Codestral 22B	9.3	25.0	0.71	0.57
LLama3.1 8B	1.9	12.6	0.56	0.51
Qwen2.5-Coder 7B	1.9	11.0	0.61	0.56
CodeLLama 34B	1.2	7.6	0.46	0.43
CodeLLama 7B	0.0	0.1	0.28	0.28

Direct Prompting Evaluation

Task ID	Acc Hard (%)	Acc Mean (%)	Sim	False Sim
Gemini1.5-Pro 002	47.0	65.7	0.89	0.37
Claude3.5-Sonnet	41.0	58.7	0.88	0.44
GPT4o	21.2	38.8	0.75	0.50
Qwen2.5-Coder 32B	32.7	42.4	0.78	0.44
LLama3.1 70B	25.5	36.0	0.71	0.42
Codestral 22B	3.1	17.8	0.66	0.59
LLama3.1 8B	0.6	10.4	0.53	0.48
Qwen2.5-Coder 7B	0.0	4.1	0.56	0.55
CodeLLama 34B	2.5	10.0	0.57	0.52
CodeLLama 7B	0.0	0.0	0.41	0.41

Explanation of Metrics:

Acc Hard (%): Accuracy for tasks that are traced correctly across all tests as a percentage.
Acc Mean (%): Mean accuracy across all tests of all examples as a percentage.
Sim: Average overall similarity score between predictions and ground truth.
False Sim: Similarity score based only on incorrect predictions.

We further investigated performance on advanced concepts:

Task Performance (Advanced)

Object-Oriented (40 Programs, 200 Traces)

Task ID	CoT Acc Mean (%)	CoT Sim	Direct Acc Mean (%)	Direct Sim
Gemini1.5-Pro 002	14.0	0.79	20.0	0.81
Claude3.5-Sonnet	0.0	0.77	1.0	0.69
GPT4o	4.5	0.82	4.0	0.73
Qwen2.5-Coder 32B	14.5	0.78	4.0	0.73
LLama3.1 70B	15.0	0.74	10.0	0.75
Codestral 22B	1.5	0.62	1.5	0.6
LLama3.1 8B	0.5	0.58	1.0	0.48
Qwen2.5-Coder 7B	0.0	0.58	0.0	0.56
CodeLLama 34B	0.0	0.37	0.0	0.37
CodeLLama 7B	0.0	0.30	0.0	0.40

Recursion (66 Programs, 330 Traces)

Task ID	CoT Acc Mean (%)	CoT Sim	Direct Acc Mean (%)	Direct Sim
Gemini1.5-Pro 002	2.7	0.47	0.9	0.41
Claude3.5-Sonnet	0.3	0.42	1.2	0.41
GPT4o	2.7	0.49	1.8	0.38
Qwen2.5-Coder 32B	3.0	0.35	1.8	0.30
LLama3.1 70B	1.2	0.36	0.6	0.27
Codestral 22B	1.0	0.29	0.0	0.29
LLama3.1 8B	0.3	0.21	0.0	0.35
Qwen2.5-Coder 7B	0.0	0.15	0.0	0.16
CodeLLama 34B	0.0	0.29	0.0	0.27
CodeLLama 7B	0.0	0.23	0.0	0.26

Concurrency (20 Programs, 100 Traces)

Task ID	CoT Acc Mean (%)	CoT Sim	Direct Acc Mean (%)	Direct Sim
Gemini1.5-Pro 002	1.0	0.41	1.0	0.39
Claude3.5-Sonnet	0.0	0.4	0.0	0.42
GPT4o	0.0	0.39	0.0	0.4
Qwen2.5-Coder 32B	0.0	0.36	0.0	0.37
LLama3.1 70B	1.0	0.34	1.0	0.33
Codestral 22B	0.0	0.29	0.0	0.38
LLama3.1 8B	0.0	0.28	0.0	0.26
Qwen2.5-Coder 7B	0.0	0.21	0.0	0.18
CodeLLama 34B	0.0	0.24	0.0	0.27
CodeLLama 7B	0.0	0.23	0.0	0.28

Task Performance using bucketed trace lengths

Citation

@misc{beger2025coconutstructuralcodeunderstanding,
      title={CoCoNUT: Structural Code Understanding does not fall out of a tree}, 
      author={Claas Beger and Saikat Dutta},
      year={2025},
      eprint={2501.16456},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2501.16456}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
Dataset		Dataset
Results		Results
Visualizations		Visualizations
models		models
.gitignore		.gitignore
CodeLLMs.py		CodeLLMs.py
Gemini_HumanEvalSubset_Symbols_CoT_cleaned_T0.csv		Gemini_HumanEvalSubset_Symbols_CoT_cleaned_T0.csv
Google_API.py		Google_API.py
README.md		README.md
generate_trace.py		generate_trace.py
model_queries.log		model_queries.log
prompts.py		prompts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoCoNUT: Structural Code Understanding does not fall out of a tree

COT Evaluation

Chain-of-Thought (CoT) Evaluation

Direct Prompting Evaluation

Explanation of Metrics:

Task Performance (Advanced)

Object-Oriented (40 Programs, 200 Traces)

Recursion (66 Programs, 330 Traces)

Concurrency (20 Programs, 100 Traces)

Task Performance using bucketed trace lengths

Citation

About

Releases

Packages

Contributors 2

Languages

ClaasBeger/StructuralCodeUnderstanding

Folders and files

Latest commit

History

Repository files navigation

CoCoNUT: Structural Code Understanding does not fall out of a tree

COT Evaluation

Chain-of-Thought (CoT) Evaluation

Direct Prompting Evaluation

Explanation of Metrics:

Task Performance (Advanced)

Object-Oriented (40 Programs, 200 Traces)

Recursion (66 Programs, 330 Traces)

Concurrency (20 Programs, 100 Traces)

Task Performance using bucketed trace lengths

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages