Dr. Enrique AguilarClinical × Compute
LIVEMTY/CST --:--:--SECTION PULSE 72 BPMSAT 98%
← Back to Notes

Written by Dr. Enrique Aguilar · June 14, 2026

General-Purpose Beats Purpose-Built: a Nature Medicine head-to-head on clinical AI

GPT-5.2, Gemini 3.1 and Claude outscored OpenEvidence and UpToDate's own AI on medical knowledge, clinician alignment, and real physician queries. On live clinical questions the specialist tools barely matched a free Google AI summary. A note on why the medical label may be selling certainty, not accuracy.

Update, 14 June 2026. Within hours of publication, OpenEvidence posted a detailed rebuttal: an alleged undisclosed conflict of interest, plus screenshots of a frontier model reciting benchmark answers verbatim. It lands real hits. My full read is at the end of this note, under The rebuttal. The short version: it guts two of the three benchmarks, and pointedly leaves the third one alone.

The assumption this paper breaks

There is a quiet consensus in medicine: general-purpose chatbots are fine for drafting a discharge summary, but for real clinical reasoning you want a medical model: something trained on the literature, wrapped in retrieval, sold with a stethoscope on the logo. OpenEvidence and Wolters Kluwer's UpToDate Expert AI are built on exactly that promise.

A new Nature Medicine Brief Communication from NYU Langone tested the promise instead of assuming it. The result is uncomfortable for the specialist camp: the general-purpose frontier models won every stage.

What they actually measured

Two commercial clinical AI tools (OpenEvidence, UpToDate Expert AI) against three frontier LLMs (GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6), with Google Search's auto-enabled AI Overview thrown in as a real-world control. Three stages:

That third benchmark is the one that matters. MedQA and HealthBench are public and may have leaked into training; the real-query set is the part that can't be gamed.

The numbers

Knowledge (MedQA): Gemini 97.4%, GPT 94.2%, Claude 90.2%, all ahead of OpenEvidence (89.6%) and UpToDate (88.4%).

Alignment (HealthBench): GPT 88.0, Gemini 79.3, Claude 77.0, versus OpenEvidence 62.6 and UpToDate 61.3. Not a rounding error; a chasm.

Real clinical queries (1–4 scale): the three frontier models formed the top tier (3.52–3.62). The specialist tools landed in a second tier (OpenEvidence 3.24, UpToDate 3.17), statistically indistinguishable from Google's free AI Overview (3.27).

Read that last line again. On real questions physicians actually ask, the purpose-built medical products did not beat the summary box at the top of a Google search.

A few more telling details: UpToDate's tool refused 19% of queries (the others, 1–6%). OpenEvidence scored lowest on clarity: its weakness was communication, not knowledge. And critically, no model produced more harmful content or hallucinations than the others. The specialists weren't safer. They were just narrower.

Why the specialists lose

The authors' best explanation is the most interesting part. Retrieval-augmented generation, the RAG layer that is supposed to be the specialist's edge, may actively hurt when irrelevant material gets retrieved and poorly integrated by the base model. Meanwhile the frontier labs win on the boring fundamentals: bigger training corpora, faster iteration cycles, and heavier alignment work. Scale and cross-domain reasoning are beating domain-specific tuning.

There's a second, sharper point buried in the discussion: industry-built benchmarks tend to favor their builders. HealthBench is an OpenAI artifact; GPT both competed and helped judge. Which is the whole argument for independent, real-world evaluation, and the whole reason the RCQ stage carries the weight here.

The part that ages this paper before you finish reading it

The models tested (Claude Opus 4.6, GPT-5.2, Gemini 3.1 Pro) are already superseded. By the time you read this, the frontier has moved to the next Claude, the next GPT, and models like Fable. The general-purpose tier doesn't just lead; it re-releases every few months. A medical-specialist model shipped on an annual cadence is racing a treadmill that speeds up.

Which leads to the thing I keep turning over. If a "medical AI" can't out-reason the general model it's built on, what exactly are you buying when you buy the medical one? Often: a certainty premium. The stethoscope on the logo, the citations, the institutional branding: they manufacture trust faster than they manufacture accuracy. The risk isn't that specialist tools are bad. It's that the medical label makes clinicians trust an answer because of the brand rather than because it was measured to be better. That's a bias, and it's one we'd flag instantly in a drug trial.

Caveats, because honesty is the point

This is a Brief Communication, not a definitive verdict. The real-query benchmark is a single institution and n = 100. The clinical tools were accessed through their browser UIs, not APIs, which may have shaped output. The public benchmarks carry contamination risk. And the authors are explicit that deeply subspecialized tasks may still favor curated retrieval and domain tuning. The generalist advantage is not a law of nature, but a snapshot of a fast-moving field.

But the snapshot is clear enough to act on: don't buy the medical label on faith. Ask for the independent, real-world evaluation. If a tool can't show it beats the frontier model underneath it, and the free search box beside it, then what you're paying for is confidence, not capability.

The rebuttal: OpenEvidence responds (14 June 2026)

Hours after the paper appeared, OpenEvidence published a point-by-point rebuttal. It is worth taking seriously, because part of it is right.

The contamination receipts. OpenEvidence pasted benchmark questions into a frontier model (Google's Gemini) and asked it to name the source. The model did not just identify the dataset; it recited the answer. For MedQA's most famous item (the orthopaedic resident who cuts a flexor tendon and is told to hide it), the model called it "the canonical first example used in USMLE-style benchmarks" and gave the letter answer outright. The implication is hard to dodge: a 97% on MedQA can be recall, not reasoning. And contamination flatters a raw model that answers from memory more than a retrieval tool like OpenEvidence that reformulates from a knowledge base. So part of the headline gap (Gemini 97.4% versus OpenEvidence 89.6%) may be an artifact rather than a capability difference.

The HealthBench critique. They showed a case where OpenEvidence scored about 0.41 against roughly 0.59 for the frontier models, then pointed at the rubric: points for saying "I am managing a patient with…", points for a specific subject heading, and a deduction for a "long disclaimer." OpenEvidence had written a clean referral letter and lost points on phrasing, not on medicine. HealthBench was built by OpenAI, and GPT topping an OpenAI-authored style rubric is precisely the "industry benchmark favors its creator" problem the authors themselves had flagged.

The peer-review record. A reviewer had already raised exactly this. OpenEvidence quotes Reviewer 2 verbatim: with no decontamination analysis and no held-out set, "claims of superiority lack epistemic grounding." That lends weight to their account that the private RCQ benchmark was added later, to shore up a submission whose public benchmarks a reviewer had found insufficient.

Where they are right, and where they stop. On MedQA and HealthBench, OpenEvidence wins the argument; those two numbers should not be used to rank systems. But notice what every screenshot avoids: not one of them touches RCQ, the 100 real and private queries the authors call their primary evidence, and the single benchmark where OpenEvidence tied the free Google AI Overview (3.24 versus 3.27). OpenEvidence cannot refute RCQ on the merits, so it attacks a different axis: it is not public, so no one can audit it. A fair point, but a different one.

The conflict cuts both ways. The authors' hospital runs its own general-purpose AI, so the paper's conclusion aligns with its own institutional bet, and that deserved a clearer disclosure. But OpenEvidence is a roughly $12B company defending its flagship product, and its specific story (that the authors asked for API access to build a competitor and were refused) is its own account, still unverified.

So where does this leave the piece above? Weaker on the headline numbers, intact on the thesis. Discount the MedQA and HealthBench figures. The clean signal is RCQ, which is real but unauditable. OpenEvidence demonstrated that the paper's public evidence is thin. It did not demonstrate that its product is better. Those are two different claims, and the distance between them is the whole story. The one thing both camps now agree on is the thing the paper argued for in the first place: evaluation that is independent, real-world, and open, whoever it embarrasses.

Reference

Vishwanath K, Alyakin A, Ghosh M, et al. General-purpose large language models outperform specialized clinical AI tools on medical benchmarks. Nat Med. 2026. doi.org/10.1038/s41591-026-04431-5

© 2026 · Enrique Aguilar Martínez, M.D.NOTES / SECTION C · --:--:-- MTYMonterrey, NL · Mexico