A Benchmark for AI Coding Assistants


Programmers are turning to AI coding assistants to answer questions about their code, but it is unclear how well models perform at answering contextualized questions. Do state of the art LLMs answer questions about contextualized code correctly? Do they hallucinate or lie about API or project specific facts? Do certain models perform better than others? We aim to answer these questions with RubberDuckBench: A benchmark for AI coding assistants. It includes 15 questions across Java, Python, and C++ derived from real world PR review comments. Each question is paired with a detailed rubric that was manually developed and applied to ensure a reliable evaluation. We evaluate a diverse set of 20 LLMs (proprietary & open-source) on answering these questions.

Check out our  paper  for more information!

Key Highlights

Explore the full methodology and results in our paper.

Benchmark Results

LLM Leaderboard

The leaderboard below presents the performance of state-of-the-art LLMs across multiple metrics, including average score, binary correctness, and cost per query.

#ModelLLM FamilyPerformancePerfect AnswersCost
1
Grok 4Grok 4
xAI69.29%StarStarStar$0.050
2
Claude Opus 4Claude Opus 4
Anthropic68.53%StarStar$0.597
3
Gpt-5Gpt-5
OpenAI67.80%StarStarStar$0.030
4
Claude Opus 4.1Claude Opus 4.1
Anthropic67.02%StarStarStar$0.614
5
o3o3
OpenAI64.93%StarStarStar$0.039
6
Gemini 2.5 FlashGemini 2.5 Flash
Google64.30%Star$0.021
7
Gemini 2.5 ProGemini 2.5 Pro
Google64.01%StarStar$0.087
8
gpt-oss-20gpt-oss-20
OpenAI63.63%StarStarStarN/A
9
Claude Sonnet 4Claude Sonnet 4
Anthropic61.66%$0.117
10
Claude Sonnet 3.7Claude Sonnet 3.7
Anthropic61.47%StarStar$0.110
11
Qwen 3Qwen 3
Alibaba61.14%StarStarN/A
12
gpt-oss-120gpt-oss-120
OpenAI59.54%StarN/A
13
Gpt-4.1Gpt-4.1
OpenAI59.47%StarStar$0.060
14
Llama3.3 70Llama3.3 70
Meta56.36%StarStarN/A
15
Grok 3Grok 3
xAI54.74%Star$0.131
16
Deepseek-R1Deepseek-R1
Deepseek54.40%StarN/A
17
Gemini 2.0 FlashGemini 2.0 Flash
Google53.78%Star$0.003
18
Llama 4 ScoutLlama 4 Scout
Meta52.96%N/A
19
Qwen 3 CoderQwen 3 Coder
Alibaba49.73%StarN/A
20
Mistral LargeMistral Large
Mistral AI48.67%N/A
Detailed View

Heatmap Showing Performance Across Question Types

This heatmap shows LLM model performance on questions across different languages. Rows are models, columns are language and question numbers, and cell colors reflect average scores across trials. Click a cell to view details like the question, average score, trial answers, and type.

If you would like to use RubberDuckBench in your work, please cite our paper.

@inproceedings{mohammad2026rubberduckbench,
  title={RubberDuckBench: A Benchmark for AI Coding Assistants},
  author={Ferida Mohammad and Fatma Ayad and Petros Maniatis and Satish Chandra and Elizabeth Dinella},
  booktitle={Proceedings of the Workshop on Large Language Models for Code (LLM4Code) at ICSE 2026},
  year={2026},
  doi={10.1145/3786181.3788710},
  eprint={2601.16456},
  archivePrefix={arXiv},
  note={arXiv:2601.16456 [cs.SE]}
}