News

With AI models clobbering every benchmark, it's time for human evaluation The latest frontier in AI research is having more humans in the loop assessing just how good the models are.
OpenAI announced that its tuned o3 models have broken the ARC-AGI benchmark, a critical test of human-like reasoning ability for AI systems.
OpenAI’s GPT-4.5 and Meta’s Llama-3.1 models have passed the Turing Test, a benchmark proposed by Alan Turing in the 1950s to assess whether machines can exhibit intelligent behaviour ...