LLMs Exposed: Are They Just Cheating on Math Tests?

5 min read


Large Language Models (LLMs) are advanced natural language processing models that have achieved remarkable success in various benchmarks for mathematical reasoning. These models are designed to process and understand human language, enabling them to perform tasks such as question answering, language translation, and text generation. LLMs are typically trained on large datasets scraped from the internet, allowing them to learn and understand complex language patterns and structures. But are LLMs genuine masters of language, or are they merely adept at cheating on math tests? Let’s find out!


What are Large Language Models (LLMs)?

Large Language Models (LLMs) are state-of-the-art natural language processing models trained on vast amounts of data to understand and process human language. These models can perform various language-related tasks, including question-answering, translation, and text generation. They have achieved impressive success in various benchmarks for mathematical reasoning, showcasing their ability to comprehend and reason with mathematical concepts.

Why are LLMs Important?

Large Language Models (LLMs) are important due to their potential applications across various domains. These models can revolutionize natural language processing tasks, including language translation, text summarization, and conversational agents. Additionally, they can be utilized in educational settings to assist with learning and understanding complex concepts. Furthermore, LLMs have the potential to enhance human-computer interaction and automate language-related tasks, leading to increased efficiency and productivity.

Also read: What are Large Language Models(LLMs)?

The Problem of Benchmark Bias: Can LLMs Think?

There is growing concern regarding benchmark bias and data contamination in training Large Language Models (LLMs). The reliance on public benchmarks for training LLMs raises concerns about the inadvertent inclusion of examples closely resembling the benchmark questions in the training data. This contamination may lead to models needing stronger reasoning capabilities, as they can simply repeat correct answers encountered during training. This issue raises questions about the true reasoning abilities of LLMs and the need for rigorous evaluation to ensure their proficiency in understanding and reasoning with language and mathematical concepts.


The Experiment: Putting LLMs to the Test

Large language models (LLMs) have garnered significant attention for their mathematical reasoning capabilities. The research paper showed a comprehensive experiment to evaluate these models’ true reasoning abilities, rigorously testing their performance on the Grade School Math 1000 (GSM1k) and Grade School Math 8000 (GSM8k) benchmarks.

Evaluatation of LLM Performance on GSM1k and GSM8k

The experimental setup involved meticulously evaluating leading open- and closed-source LLMs on GSM1k and GSM8k. The evaluation process utilized a standardized prompt, drawing 5 randomly selected examples from the GSM8k train set for each question. This approach ensured a consistent and fair evaluation across all models. The evaluation harness extracted the last numeric answer in the response and compared it to the correct answer, enabling a precise assessment of model performance.

Additionally, the study employed a temperature of 0 for reproducibility and utilized vLLM to expedite model inference where compatible with the library. Closed-source models were queried through the LiteLLM library, unifying the API call format for all proprietary models evaluated. The evaluation process was conducted with the utmost attention to detail and adherence to standardized procedures.


Results Revealed: Did LLMs Pass the Test?

The evaluation’s findings revealed compelling insights into the performance of LLMs on GSM1k and GSM8k. Notably, the study uncovered accuracy drops of up to 13% across certain model families, indicating potential overfitting and limitations in reasoning abilities. However, exceptions were observed amidst these observations, particularly among models on the frontier, such as Gemini, GPT, and Claude, which exhibited minimal signs of overfitting.

These exceptions shed light on the nuanced performance of LLMs and the varying degrees of reasoning capabilities across different model families. The experiment results provide valuable insights into the true reasoning abilities of LLMs and their performance on grade school arithmetic benchmarks.

LLM Overfitting: A Cause for Concern?


Large language models (LLMs) have achieved impressive success on many mathematical reasoning benchmarks. However, there is growing concern that some of this performance may reflect dataset contamination, where data closely resembling benchmark questions leaks into the training data instead of true reasoning ability. The commissioning of Grade School Math 1000 (GSM1k) was a response to this concern, designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning.

The evaluation of leading open- and closed-source LLMs on GSM1k revealed substantial evidence that many models have been contaminated by benchmark data, showing performance drops of up to 13% accuracy. Notably, several families of models, such as the Mistral and Phi families, showed consistent overfitting across almost all model sizes and versions. This raises significant concerns about the true reasoning capabilities of these models and the potential impact of dataset contamination on their performance.

The Future of LLM Development

The findings from the evaluation of LLMs on GSM1k highlight the need for improvements in LLM training and evaluation to ensure the development of more robust AI. One key aspect that needs to be addressed is mitigating data contamination, which has been identified as a significant issue in the field. Methods such as removing data with high n-gram overlap with benchmark data and using embedding similarity to remove contaminated data have been proposed to minimize the likelihood of data contamination.

Additionally, functional evaluations, where benchmarks are written in the form of functions that can generate an infinite number of specific evaluation data points, have been suggested to reduce the worry of data contamination by ensuring that no data point is ever used twice. These approaches aim to improve the quality and integrity of benchmark datasets, thereby enhancing the reliability of LLM training and evaluation.


The study on overfitting large language models (LLMs) on grade school arithmetic benchmarks has revealed important insights into the reasoning abilities of these models. The findings suggest that systematic overfitting exists in certain model families, such as Phi and Mistral, indicating potential limitations in their reasoning capabilities. On the other hand, frontier models, including Gemini, GPT, and Claude, show minimal signs of overfitting, pointing towards stronger reasoning abilities. These observations raise questions about the true reasoning capacity of LLMs and the factors influencing their performance on mathematical reasoning benchmarks.

The study’s key takeaways emphasize the need for rigorous benchmarking and evaluation of LLMs to ensure that progress in enhancing reasoning abilities is accurately measured. Future directions should focus on developing benchmarks that are less susceptible to data contamination and exploring alternative evaluation methods, such as functional evaluations, to mitigate overfitting. Additionally, investigating the training processes of LLMs to understand how they acquire reasoning abilities and generalize to new problems will be crucial in determining the true extent of their reasoning capabilities. Overall, the road ahead involves addressing the challenges posed by overfitting and data contamination while striving to uncover the genuine reasoning capacity of LLMs.

Stay tuned to Analytics Vidhya Blogs to get the latest updates on LLMs!


Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers