A Benchmark for AI Coding Assistants
Programmers are turning to AI coding assistants to answer questions about their code, but it is unclear how well models perform at answering contextualized questions. Do state of the art LLMs answer questions about contextualized code correctly? Do they hallucinate or lie about API or project specific facts? Do certain models perform better than others? We aim to answer these questions with RubberDuckBench: A benchmark for AI coding assistants. It includes 15 questions across Java, Python, and C++ derived from real world PR review comments. Each question is paired with a detailed rubric that was manually developed and applied to ensure a reliable evaluation. We evaluate a diverse set of 20 LLMs (proprietary & open-source) on answering these questions.
Check out our paper for more information!
Key Highlights
Explore the full methodology and results in our paper.
The leaderboard below presents the performance of state-of-the-art LLMs across multiple metrics, including average score, binary correctness, and cost per query.
Sort By:
| # | Model | LLM Family | Performance | Perfect Answers | Cost |
|---|---|---|---|---|---|
| 1 | Grok 4 | xAI | 69.29% | ![]() ![]() ![]() | $0.050 |
| 2 | | Anthropic | 68.53% | ![]() ![]() | $0.597 |
| 3 | Gpt-5 | OpenAI | 67.80% | ![]() ![]() ![]() | $0.030 |
| 4 | | Anthropic | 67.02% | ![]() ![]() ![]() | $0.614 |
| 5 | o3 | OpenAI | 64.93% | ![]() ![]() ![]() | $0.039 |
| 6 | Gemini 2.5 Flash | 64.30% | ![]() | $0.021 | |
| 7 | Gemini 2.5 Pro | 64.01% | ![]() ![]() | $0.087 | |
| 8 | gpt-oss-20 | OpenAI | 63.63% | ![]() ![]() ![]() | N/A |
| 9 | | Anthropic | 61.66% | $0.117 | |
| 10 | | Anthropic | 61.47% | ![]() ![]() | $0.110 |
| 11 | Qwen 3 | Alibaba | 61.14% | ![]() ![]() | N/A |
| 12 | gpt-oss-120 | OpenAI | 59.54% | ![]() | N/A |
| 13 | Gpt-4.1 | OpenAI | 59.47% | ![]() ![]() | $0.060 |
| 14 | Llama3.3 70 | Meta | 56.36% | ![]() ![]() | N/A |
| 15 | Grok 3 | xAI | 54.74% | ![]() | $0.131 |
| 16 | | Deepseek | 54.40% | ![]() | N/A |
| 17 | Gemini 2.0 Flash | 53.78% | ![]() | $0.003 | |
| 18 | Llama 4 Scout | Meta | 52.96% | N/A | |
| 19 | Qwen 3 Coder | Alibaba | 49.73% | ![]() | N/A |
| 20 | | Mistral AI | 48.67% | N/A |
@inproceedings{dinella2026rubberduckbench,
title={RubberDuckBench: A Benchmark for AI Coding Assistants},
author={Elizabeth Dinella and Ferida Mohammad and Fatma Ayad and Petros Maniatis and Satish Chandra},
booktitle={LLM4Code at ICSE},
year={2026}
}