Signed in as:
filler@godaddy.com
Signed in as:
filler@godaddy.com
This section provides a framework and background information for assessing the abilities of current AI technologies in performing human tasks. This section also explore the foreseeable improvements based on current research activities.
AI abilities and capabilities are commonly measured on academic ‘benchmarks’ specifically designed for AI systems. Most of them are adapted from tests applied to humans. Examples of common benchmarks include:
In addition, advanced AI models, notably LLMs are now commonly tested on professional exams designed for humans such as bar exams, Advanced Placement in environmental science, History, psychology, economics, math, physics, etc.
Results of GPT4 and GPT3.5 by OpenAI on different exams designed for humans (source: OpenAI 2023).
Results of GPT4 and GPT3.5 on standard benchmarks (source: OpenAI 2023).
Results of Anthropic models on standard benchmark (source: Anthropic 2023).
Comparison of Open AI models with Anthropic model on TruthfulQA, and AI safety benchmark developed by Anthropic (source: OpenAI 2023).
Various organizations have developed or are developing taxonomies to map abilities common to humans and AI. This field is likely to expend significantly over the next few years.
Building on benchmarks and abilities, efforts are underway to systematically map occupational tasks with AI benchmarks, allowing to monitor in real time how progress of AI systems impact the future of jobs.
There are however many hurdles (not discussed comprehensively here). For instance, counter-intuitively, in many cases, the limitation of AI systems come from abilities that are common among most humans, and not formally evaluated in educational tests, such extracting data from a chart for a person with standard reading skills, visual and cognitive abilities.