Benchmark Measuring Head Models

Measuring What Matters in Large Language Model Performance

As large language models (LLMs) gain momentum worldwide, there’s a growing need for reliable ways to measure their performance. Benchmarks that evaluate LLM outputs allow developers to track ...

SiliconANGLE

MLCommons releases new AILuminate benchmark for measuring AI model safety

MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.

SiliconANGLE

Researchers develop new LiveBench benchmark for measuring AI models’ response accuracy

A group of researchers has developed a new benchmark, dubbed LiveBench, to ease the task of evaluating large language models’ question-answering capabilities. The researchers released the benchmark on ...

Business Wire

MLCommons Launches AILuminate, First-of-Its-Kind Benchmark to Measure the Safety of Large Language Models

SAN FRANCISCO--(BUSINESS WIRE)--MLCommons today released AILuminate, a first-of-its-kind safety test for large language models (LLMs). The v1.0 benchmark – which provides a series of safety grades for ...

ZDNet

This new AI benchmark measures how much models lie

As more AI models show evidence of being able to deceive their creators, researchers from the Center for AI Safety and Scale AI have developed a first-of-its-kind lie detector. On Wednesday, the ...

Decrypt

OpenAI Says Benchmark Used to Measure AI Coding Skill Is 'Contaminated'—Here's Why

OpenAI wants to retire the leading AI coding benchmark—and the reasons reveal a deeper problem with how the whole industry measures itself.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results