The race to achieve Artificial General Intelligence (AGI) is heating up, but a fundamental question remains unanswered: what exactly is AGI? This lack of a unified definition is creating deep divisions within the AI community, hindering the development of universally accepted benchmarks and creating challenges for businesses aiming to leverage its potential. While tech giants like OpenAI, Google DeepMind, and Anthropic announce increasingly sophisticated AI models, the absence of clear criteria for evaluating true AGI makes it difficult to assess progress and manage expectations. Is AGI simply matching human performance across a range of tasks, or does it require something more – genuine understanding, adaptability, and ethical reasoning? The answer to this question will shape the future of AI development and its impact on society.
Defining the Elusive Goal: Why AGI Remains a Moving Target
The very definition of Artificial General Intelligence (AGI) is a battleground. Some researchers define AGI as an AI’s ability to perform any intellectual task that a human being can. Others focus on economic impact, internal mechanisms, or even subjective assessments. This lack of consensus has created a significant obstacle in the development of effective testing methodologies.
The core of the problem lies in the nature of intelligence itself. As Geoffrey Hinton, a renowned AI researcher, aptly put it: “We are building alien intelligences.” Comparing machines to humans becomes increasingly challenging as AI systems develop capabilities that diverge from human strengths and weaknesses.
This divergence complicates the creation of universal tests, as AI excels at tasks where humans falter, and vice versa. The ability to navigate complex social situations, demonstrate common sense, or exercise ethical judgment, which are seemingly innate to humans, remain significant hurdles for AI systems.

The Turing Test and Beyond: Limitations of Traditional Benchmarks
The quest to measure machine intelligence has a long history, marked by both milestones and limitations. The Turing Test, proposed by Alan Turing in 1950, challenged machines to convincingly imitate human conversation. While groundbreaking, the test has been criticized for focusing on deception rather than genuine understanding. An AI could theoretically pass the Turing Test by simply mimicking human language patterns without possessing any real comprehension.
Later achievements, such as Deep Blue’s victory over Garry Kasparov in chess, demonstrated the power of AI in specific domains. However, these victories didn’t address the broader challenge of general intelligence. Chess-playing AI excels within the confines of the game’s rules, but it lacks the capacity to apply its reasoning skills to other areas.
Even advanced models like GPT-4.5, capable of generating remarkably human-like text, can make elementary errors that no human would commit. For instance, these models might struggle with simple counting tasks, highlighting the difference between superficial imitation and genuine understanding. These shortcomings have spurred the search for benchmarks that cannot be easily circumvented through clever programming or statistical shortcuts.
The Abstraction and Reasoning Corpus (ARC): A New Frontier in AI Evaluation
Recognizing the limitations of traditional benchmarks, researchers have developed new approaches to evaluate Artificial General Intelligence (AGI) with greater rigor. One notable example is the Abstraction and Reasoning Corpus (ARC), created by François Chollet.
The ARC test focuses on an AI’s ability to learn new skills from limited examples. It presents visual puzzles that require the AI to infer abstract rules and apply them to novel situations. These puzzles, seemingly trivial for humans, pose a significant challenge for AI systems.
While humans typically solve these puzzles with ease, machines often struggle. OpenAI achieved a notable milestone when one of its models surpassed human-level performance on ARC. However, this achievement came at a considerable computational cost, raising questions about the scalability and efficiency of the approach.
In 2024, Chollet and the ARC Prize Foundation launched a more challenging version of the test, ARC-AGI-2, with a $1 million prize for teams whose AI systems achieve an accuracy rate of over 85% under stringent conditions. Currently, the highest-performing AI systems achieve only a 16% accuracy rate, compared to 60% for humans, demonstrating the substantial gap that remains in abstract reasoning capabilities. This test highlights the difference between narrow AI, which excels at specific tasks, and AGI, which should possess the ability to generalize knowledge and apply it to new situations.
Critiques and Evolution of Benchmarking: Beyond Abstract Reasoning
The ARC test, while influential, has also faced criticism. Jiaxuan You, from the University of Illinois, acknowledges its value as a theoretical benchmark but cautions that it doesn’t fully represent the complexities of the real world or encompass social reasoning abilities. The link to AI in the Workplace: How Tech Professionals Are Using It Now is a valuable resource to explain how AI is now
Melanie Mitchell, of the Santa Fe Institute, recognizes its strengths in evaluating the ability to extract rules from limited examples. However, she emphasizes that it “does not reflect what people understand by general intelligence.” This highlights the subjective nature of intelligence and the difficulty of creating a single test that captures all its facets.
In response to these critiques, Chollet is developing a new version of ARC that incorporates tasks inspired by mini-games, broadening the range of skills being evaluated. This iterative approach reflects the ongoing effort to refine benchmarks and better capture the multifaceted nature of AGI.
Other tests have emerged to address different aspects of AGI. General-Bench, for example, utilizes modalities that integrate text, images, video, audio, and 3D to assess performance in areas such as recognition, reasoning, creativity, and ethical judgment.
No existing system currently excels across all these dimensions in an integrated manner. Dreamer, an algorithm developed by Google DeepMind, has demonstrated proficiency in over 150 virtual tasks, but its ability to handle the unpredictability of the physical world remains unclear.
The Tong test takes a different approach, proposing the assignment of random tasks to “virtual people” to assess not only their comprehension and skills but also their values and adaptability. The authors of this test argue that a comprehensive evaluation of AGI must encompass autonomous exploration, alignment with human values, causal understanding, physical control, and a continuous stream of unpredictable tasks. This highlights the need for AI systems to be both intelligent and ethical.
The Physical Embodiment Debate: Does AGI Require a Body?
A fundamental debate persists: must AGI demonstrate physical capabilities, or are cognitive abilities sufficient? A study by Google DeepMind argued that software alone is sufficient for AGI, while Melanie Mitchell maintains that evaluating an AI’s ability to complete real-world tasks and respond to unexpected problems is essential.
Jeff Clune, from the University of British Columbia, suggests that measuring observable performance isn’t enough. Internal processes of AI should also be measured, because they tend to find ingenious but unreliable shortcuts. This points to the importance of transparency and explainability in AI systems. Understanding how an AI arrives at a decision is crucial for ensuring its reliability and trustworthiness.
“The real test for AI is its impact on the real world,” Clune asserted. For him, the automation of labor and the generation of scientific discoveries provide more reliable indicators than any benchmark. This perspective emphasizes the practical value of AGI and its potential to solve real-world problems. The link to AI for Global Good: Solving the World’s Biggest Challenges is a valuable resource to explain how AI is now

The Ever-Evolving Definition: A Moving Target
Despite progress and the emergence of new tests, achieving a consensus on the definition of AGI and how to demonstrate its existence remains elusive. Anna Ivanova, a psychologist at Georgia Tech, emphasizes that societal perceptions of intelligence and what is considered valuable are constantly evolving.
The detailed report from IEEE Spectrum concluded that the term AGI serves as a useful shorthand for expressing aspirations and fears. However, it always requires precise clarification and a specific benchmark. This highlights the importance of context and clear communication when discussing AGI.
Ultimately, the pursuit of AGI is a journey of continuous discovery, pushing the boundaries of what’s possible with artificial intelligence. While the destination remains uncertain, the challenges and debates surrounding AGI are driving innovation and shaping the future of technology. As we strive to create more intelligent machines, it’s crucial to maintain a clear understanding of our goals, limitations, and the ethical implications of our work.
The lack of a universally accepted definition of AGI might seem like a setback, but it also represents an opportunity. It encourages us to think critically about the nature of intelligence, to explore different approaches to AI development, and to consider the broader societal implications of creating truly intelligent machines. As AI continues to evolve, so too must our understanding of what it means to be intelligent.