A Benchmark for AI Coding Assistants


Programmers are turning to AI coding assistants to answer questions about their code, but it is unclear how well models perform at answering contextualized questions. Do state of the art LLMs answer questions about contextualized code correctly? Do they hallucinate or lie about API or project specific facts? Do certain models perform better than others? We aim to answer these questions with RubberDuckBench: A benchmark for AI coding assistants. It includes 15 questions across Java, Python, and C++ derived from real world PR review comments. Each question is paired with a detailed rubric that was manually developed and applied to ensure a reliable evaluation. We evaluate a diverse set of 20 LLMs (proprietary & open-source) on answering these questions.

Check out our  paper  for more information!

Key Highlights

Explore the full methodology and results in our paper.

LLM Leaderboard

The leaderboard below presents the performance of state-of-the-art LLMs across multiple metrics, including average score, binary correctness, and cost per query.

#ModelLLM FamilyPerformancePerfect AnswersCost
1
Grok 4Grok 4
xAI69.29%StarStarStar$0.050
2
Claude Opus 4Claude Opus 4
Anthropic68.53%StarStar$0.597
3
Gpt-5Gpt-5
OpenAI67.80%StarStarStar$0.030
4
Claude Opus 4.1Claude Opus 4.1
Anthropic67.02%StarStarStar$0.614
5
o3o3
OpenAI64.93%StarStarStar$0.039
6
Gemini 2.5 FlashGemini 2.5 Flash
Google64.30%Star$0.021
7
Gemini 2.5 ProGemini 2.5 Pro
Google64.01%StarStar$0.087
8
gpt-oss-20gpt-oss-20
OpenAI63.63%StarStarStarN/A
9
Claude Sonnet 4Claude Sonnet 4
Anthropic61.66%$0.117
10
Claude Sonnet 3.7Claude Sonnet 3.7
Anthropic61.47%StarStar$0.110
11
Qwen 3Qwen 3
Alibaba61.14%StarStarN/A
12
gpt-oss-120gpt-oss-120
OpenAI59.54%StarN/A
13
Gpt-4.1Gpt-4.1
OpenAI59.47%StarStar$0.060
14
Llama3.3 70Llama3.3 70
Meta56.36%StarStarN/A
15
Grok 3Grok 3
xAI54.74%Star$0.131
16
Deepseek-R1Deepseek-R1
Deepseek54.40%StarN/A
17
Gemini 2.0 FlashGemini 2.0 Flash
Google53.78%Star$0.003
18
Llama 4 ScoutLlama 4 Scout
Meta52.96%N/A
19
Qwen 3 CoderQwen 3 Coder
Alibaba49.73%StarN/A
20
Mistral LargeMistral Large
Mistral AI48.67%N/A

If you would like to use RubberDuckBench in your work, please cite our paper.

@inproceedings{dinella2026rubberduckbench,
  title={RubberDuckBench: A Benchmark for AI Coding Assistants},
  author={Elizabeth Dinella and Ferida Mohammad and Fatma Ayad and Petros Maniatis and Satish Chandra},
  booktitle={LLM4Code at ICSE},
  year={2026}
}