RubberDuckBench

A Benchmark for AI Coding Assistants

Programmers are turning to AI coding assistants to answer questions about their code, but it is unclear how well models perform at answering contextualized questions. Do state of the art LLMs answer questions about contextualized code correctly? Do they hallucinate or lie about API or project specific facts? Do certain models perform better than others? We aim to answer these questions with RubberDuckBench: A benchmark for AI coding assistants. It includes 15 questions across Java, Python, and C++ derived from real world PR review comments. Each question is paired with a detailed rubric that was manually developed and applied to ensure a reliable evaluation. We evaluate a diverse set of 20 LLMs (proprietary & open-source) on answering these questions.

Check out our paper for more information!

Key Highlights

We find that even state of the art models fail to give consistent, correct responses across the benchmark.
Grok 4 (69.29%), Claude Opus 4 (68.5%), and GPT-5 (67.8%) perform best overall, but do not exhibit pairwise significant superiority over the next 9 best performing models.
Most models obtain points through partial credit, with the best performing models only answering at most 2 questions completely correctly across all trials.
Models often hallucinate with lies in 58.3% of responses on average.
Cost analysis reveals no correlation between expense (API pricing or parameter count) and performance.

Explore the full methodology and results in our paper.

Benchmark Results

LLM Leaderboard

The leaderboard below presents the performance of state-of-the-art LLMs across multiple metrics, including average score, binary correctness, and cost per query.

Sort By:

#	Model	LLM Family	Performance	Cost
1	Grok 4	xAI	69.29%	$0.050
2	Claude Opus 4	Anthropic	68.53%	$0.597
3	Gpt-5	OpenAI	67.80%	$0.030
4	Claude Opus 4.1	Anthropic	67.02%	$0.614
5	o3	OpenAI	64.93%	$0.039
6	Gemini 2.5 Flash	Google	64.30%	$0.021
7	Gemini 2.5 Pro	Google	64.01%	$0.087
8	gpt-oss-20	OpenAI	63.63%	N/A
9	Claude Sonnet 4	Anthropic	61.66%	$0.117
10	Claude Sonnet 3.7	Anthropic	61.47%	$0.110
11	Qwen 3	Alibaba	61.14%	N/A
12	gpt-oss-120	OpenAI	59.54%	N/A
13	Gpt-4.1	OpenAI	59.47%	$0.060
14	Llama3.3 70	Meta	56.36%	N/A
15	Grok 3	xAI	54.74%	$0.131
16	Deepseek-R1	Deepseek	54.40%	N/A
17	Gemini 2.0 Flash	Google	53.78%	$0.003
18	Llama 4 Scout	Meta	52.96%	N/A
19	Qwen 3 Coder	Alibaba	49.73%	N/A
20	Mistral Large	Mistral AI	48.67%	N/A

Detailed View

Heatmap Showing Performance Across Question Types

This heatmap shows LLM model performance on questions across different languages. Rows are models, columns are language and question numbers, and cell colors reflect average scores across trials. Click a cell to view details like the question, average score, trial answers, and type.

If you would like to use RubberDuckBench in your work, please cite our paper.

@inproceedings{mohammad2026rubberduckbench,
  title={RubberDuckBench: A Benchmark for AI Coding Assistants},
  author={Ferida Mohammed and Fatma Ayad and Petros Maniatis and Satish Chandra and Elizabeth Dinella},
  booktitle={Proceedings of the Workshop on Large Language Models for Code (LLM4Code) at ICSE 2026},
  year={2026},
  doi={10.1145/3786181.3788710},
  eprint={2601.16456},
  archivePrefix={arXiv},
  note={arXiv:2601.16456 [cs.SE]}
}