Research tool. Not intended for clinical decision-making, diagnosis, or treatment. Outputs surface and synthesize published literature; they do not constitute medical advice. Verify against primary sources before any decision.
Evidence engines

Honest literature synthesis on hard research questions.

Caliper and Discovery surface and synthesize published medical literature with verifiable citations. Both are evaluated against the same honesty bar: zero fabricated PMIDs, zero invented evidence, explicit refusal to claim what the literature does not show.

Two engines, one evidence backbone.

Both engines retrieve from the same indexed corpus using the same hybrid retrieval chain (lexical + dense-vector + cross-encoder rerank), indexed against ~29.6M medical documents (SPECTER2-embedded medical subset of a 48M+ document corpus). The difference is in the synthesis layer and the output shape — built for two distinct audiences.

Caliper

Audit-grade evidence synthesis

Structured findings with primary-venue grounding, confidence statements, and verifiable citations — on hard research questions, in under a minute.

  • Output: Finding · Confidence · Strongest point · Key uncertainty
  • Primary-venue grounding (NEJM, Lancet, JAMA, JACC, Circulation, EHJ, Nature Medicine)
  • Latency: 25 – 55 seconds per question
  • Built for: regulatory-evidence teams, medical-affairs leads, research directors
Discovery

Research-grade literature synthesis

Prose synthesis with mechanism-level context, inline citations, and methodology-pending transparency — for literature reviews and research workflows.

  • Output: prose synthesis with numbered inline citations
  • Methodology badge when validation gates (literature, CI, confounding, FDR, replication) are not yet asserted
  • Latency: 8 – 25 minutes per question (deeper synthesis)
  • Built for: medical-affairs research teams, evidence-review committees, literature reviewers
The evaluation

Five fresh research questions. Zero hallucinations.

Internal validation, May 2026. Each engine was run independently on five pharma-grade questions drawn from distinct medical domains. Every cited PMID was verified against the source corpus.

5
Fresh research questions, distinct domains (sepsis HAT protocol; GLP-1 cardiovascular outcomes in non-diabetic obesity; DOAC in advanced CKD; niclosamide repurposing; metformin in pancreatic adenocarcinoma)
70+
Unique PMIDs verified real across the evaluation
100%
Of cited PMIDs verified real in the source corpus
0
Hallucinated citations, fabricated trial names, or invented evidence

The questions were chosen to stress-test honest behavior, not to flatter the system: drug-trial contradictions (HAT protocol), recent paradigm shifts (SELECT trial), RCT-real-world gaps (DOAC in eGFR < 30), obscure repurposing signals (niclosamide for MDR-TB), and evidence discrepancies (metformin retrospective vs prospective in pancreatic cancer).

What the engines do not do: they do not generate novel scientific findings. They surface, organize, and explain what is already in the published literature, with citations a reader can verify. They refuse to fill gaps with model knowledge when the literature is silent.

From the evaluation

What the discipline looks like in practice.

Two outputs from the same evaluation: one where the engine surfaced and explained a known confounder, and one — the more important example — where it refused to claim evidence that does not exist.

Companion example · Surfacing a known confounder

The engine correctly explained the confounding pattern in the literature.

In pancreatic ductal adenocarcinoma, does adjunctive metformin therapy improve overall survival, and how do retrospective observational findings compare to prospective interventional trial results?

Discovery synthesis (verbatim):

“The retrospective survival advantage attributed to adjunctive metformin in pancreatic ductal adenocarcinoma is fundamentally distorted by methodological confounders, particularly concomitant statin use and variable glycemic control [1], [3]. While unadjusted observational cohorts consistently link metformin exposure to prolonged overall survival [5], this association vanishes in prospective interventional trials where the drug’s unfavorable pharmacokinetic profile obscures antitumor efficacy [2]. Although pathobiological models propose AMPK/mTOR-mediated antiproliferative effects as a theoretical rationale [3], the systematic divergence between uncontrolled registry data and rigorously monitored clinical trials confirms that metformin does not deliver validated survival benefits for PDAC [4].”

What this proves. The statin co-use confounder is a known but underdiscussed pattern in the literature. The engine surfaced it from the cited papers (it did not invent it) and explained how it accounts for the retrospective–prospective discrepancy. The discipline is in the framing: the engine reads what is in the literature and reports it; it does not claim new mechanistic discovery.

Cited evidence (all verified): PMID 26067687 Lancet Oncology 2015 (Reni RCT) · PMID 26474429 Pancreas 2016 (statin + metformin) · PMID 27069086 J Clin Oncol 2016 (Cautionary Lesson).

Who these engines are for — and who they are not for.

Honest scope. Honest exclusions.

Built for

  • Pharma research leads evaluating evidence on a target, mechanism, or trial design
  • Medical-affairs teams preparing publications or scientific responses
  • Regulatory-evidence teams compiling supporting literature for submissions
  • Research-grade literature reviewers replacing analyst hours on hard questions

Not for

  • Clinical care decisions or treatment recommendations
  • Diagnostic use of any kind
  • Patient-facing medical advice
  • Replacing expert judgment — the engines support analysts, they do not substitute for them
  • A substitute for reading the primary literature — outputs cite published papers; verify the cited papers themselves before relying on conclusions

Want to try this on your own questions?

We deploy the engines against your evidence corpus. You evaluate the output on questions you already know the right answer to. You tell us what works and what does not. The same honesty discipline applies to the pilot itself — we report what fired and what did not, and we do not soften the result.

Start a pilot conversation →