Evaluating the problem-solving capabilities of modern Large Language Models in competitive programming
Abstract
in educational settings. The primary objective is to analyze how the performance of problem-solving varies between tasks
with differing structure, difficulty, and source, using partial acceptance automated test-based scoring. By examining
aggregate performance metrics, task-level difficulty, acceptance rates, failure patterns, and discriminative properties,
the study characterizes which types of problem are consistently solvable, which remain challenging, and which most effectively
differentiate levels of problem-solving capability. The results reveal substantial variation across tasks, highlighting
the influence of task difficulty and structure on acceptance outcomes and demonstrating that medium-difficulty problems
tend to provide the strongest discrimination. These findings contribute empirical insight into the characteristics of formally
designed Olympiad-style programming problems and their suitability for structured performance assessment in educational
and competitive programming contexts.
Full Text:
PDFReferences
K. Ali Abd Al-Hameed, Spearman's correlation coefficient in statistical analysis, International Journal of Nonlinear Analysis and Applications 13 (2022), no. 1, 3249-3255.
T. Coignion, C. Quinton, R. Rouvoy, A Performance Study of LLM-Generated Code on Leet- code, Proceedings of the 28th International Conference on Evaluation and Assessment in Soft¬ware Engineering - EASE'24, ACM, 79-89, 2024.
M. Cosulschi, M. Gabroveanu, F. Slabu, The problem-solving capabilities of modern LLMs in the competitive programming context, In: Proceedings of 2025 Balkan Conference on Informatics (BCI'25), Tirana, Albania.
A.M. Dumitran, A.C. Badea, S.-G. Muscalu, Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis, In: Proceedings of 2024 International Conference on INnovations in Intelligent SysTems and Applications (IN- ISTA), IEEE Explore, 1-7, 2024.
N. Dunder, S. Lundborg, J. Wong, O. Viberg, Kattis vs ChatGPT: Assessment and Evaluation of Programming Tasks in the Age of Artificial Intelligence, In: 14th International Conference on Learning Analytics and Knowledge (LAK 2024), ACM, 821-827, 2024.
S.E. Elhambakhsh, Evaluating ChatGPT-3's efficacy in solving coding tasks: implications for academic integrity in English language assessments, Language Testing in Asia 15 (2025), 37.
OpenAI: A. El-Kishky et al, Competitive programming with large reasoning models, arXiv.2502.06807, 2025.
Google, Start building with Gemini 2.5 Flash, 2025. [Online]. Available: https://developers.googleblog.com/en/start-building-with-gemini-25-flash/
xAI, Grok 4, 2025. [Online]. Available: https://x.ai/news/grok-4
xAI, Grok 3 Beta - The Age of Reasoning Agents, 2025. [Online]. Available: https://x.ai/news/grok-3.
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, arXiv preprint, arXiv:2501.12948, 2025.
Y. Huang, Z. Lin, X. Liu, Y. Gong, S. Lu, F. Lei, Y. Liang, Y. Shen, C. Lin, N. Duan, W. Chen, Competition-Level Problems are Effective LLM Evaluators, In: Findings of the Association for Computational Linguistics: ACL 2024, 13526—13544, 2024.
J. Li, S. Tworkowski, Y. Wu, R. Mooney, Explaining Competitive-Level Programming Solutions using LLMs, 10.48550/arXiv.2307.05337, 2023.
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu,, et al, Deepseek-v3 technical report, arXiv preprint, arXiv:2412.19437, 2024.
DeepSeek-V3.1 deepseek, [Online]. Available: https://huggingface.co/deepseek-ai/DeepSeek-V3.1.
M. Mirzayanov, O. Pavlova, P. Mavrin, R. Melnikov, A. Plotnikov, V. Parfenov, A. Stankevich, Codeforces as an Educational Platform for Learning Programming in Digitalization, Olympiads in Informatics Jurnal 14 (2020), 133—142.
OpenAI: GPT-4 Technical Report, arXiv preprint, arXiv:2303.08774, 2023.
OpenAI: Hello GPT-4o, OpenAI, [Online]. Available: https://openai.com/index/hello-gpt-4o.
OpenAI: Introducing OpenAI o3 and o4-mini, 2025. [Online]. Available: https://openai.com/index/introducing-o3-and-o4-mini/.
OpenAI: GPT-5 is here, 2025. [Online]. Available: https://openai.com/gpt-5/.
S. Quan, Y. Jiaxi, Y. Bowen, Z. Bo, et al, CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings, arXiv preprint, arXiv:2501.01257, 2025.
R. Shakya, F. Vadiee, M. Khalil, A Showdown of ChatGPT vs DeepSeek in Solving Program¬ming Tasks, 2025 International Conference on New Trends in Computing Sciences (ICTCS), Amman, Jordan, 413—418, 2025.
D. Souza, R. Gheyi, L. Albuquerque, G. Soares, M. Ribeiro, Code Generation with Small Language Models: A Deep Evaluation on Codeforces, arXiv preprint, arXiv:2504.07343, 2025.
N.D. Tran, J. May, N. Ho, L.B. Ngo, Exploring ChatGPT's Ability to Solve Programming Problems with Complex Context, Journal of Computing Sciences in Colleges 38 (2023), no. 3, 195-209.
T.Y. Yeh, K. Tran, G. Gao, T. Yu, W.O. Fong, T.Y. Chen, Bridging novice programmers and LLMs with interactivity, In: Proc. 56th ACM Tech. Symp. Comput. Sci. Educ. (SIGCSETS 2025), 1295-1301, 2025.
Z. Zheng, Z. Cheng, Z. Shen, S. Zhou, K. Liu, H. He, D. Li, S. Wei, H. Hao, J. Yao, P. Sheng, LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?, arXiv preprint, arXiv:2506.11928, 2025.
Z. Wang, Y. Liu, Y. Wang, W. He, B. Gao, M. Diao, Y. Chen, K. Fu, F. Sung, Z. Yang, T. Liu, Ojbench: A competition level code benchmark for large language models, arXiv preprint, arXiv:2506.16395, 2025
DOI: https://doi.org/10.52846/ami.v53i1.2394