Breaking

OpenAI Launches GeneBench-Pro to Test AI Agents on Biology Research

June 30, 2026 at 13:10 EDT

AI Agents
Research & Papers
Healthcare

On June 30, 2026, OpenAI released GeneBench-Pro, a research-level benchmark measuring how well AI agents handle messy biological data, choose the right analysis path, and make the judgment calls that real computational research depends on. Even the latest GPT-5.6 Sol Pro scored just 31.5%, underscoring how far current models remain from the judgment required in real computational biology research. According to the official announcement, the benchmark comprises 129 evaluation tasks spanning 10 major domains and 21 subdomains, centered on genetics and covering functional genomics, spatial transcriptomics, proteomics, epigenomics, and cancer somatic genomics.

June 30, 2026 · OpenAI

GeneBench-Pro: AI Still Scores ~30% on Real Computational Biology

A 129-problem benchmark of messy biological data and judgment-heavy, multi-stage analysis. Even the strongest models clear less than a third — proof that real-world bioinformatics remains hard.

31.5%

Best score: GPT-5.6 Sol (Pro) on GeneBench-Pro

129

Multi-stage problems on synthetic data

60%+

of problems where Pro models score under 20%

Pass rate by model & setting

Columns proportional to score · scale 0–40%

33.2%

31.5%

28.7%

25.0%

11.2%

GPT-5.5 Pro
GeneBench

GPT-5.6 Sol (Pro)
GeneBench-Pro

5.6 Sol (max reason)
GeneBench-Pro

GPT-5.5 (xhigh)
GeneBench

Gemini 3.1 Pro
GeneBench

Rapid climb in months

From near-zero to ~30% — columns proportional to score

<5%

Early GPT-5

→

31.5%

GPT-5.6 Sol Pro

What makes it hard

Built on real-world practice — UK Biobank cohorts & GWAS — with obstacles that cascade if a model picks wrong mid-stream.

Measurement error Selection bias Confounding QC failures Multi-stage inferential forks

The promise

A problem takes human experts 20–40 hours; AI inference costs just a few dollars. Captures judgment-heavy work prior benchmarks miss.

The limit

A "noticing vs. acting" gap persists — models spot a diagnostic issue but fail to act on it. OpenAI warns the benchmark may saturate by year-end.

Continue reading

The rest of this article is for AI News Blitz readers. Choose an option below to keep reading.