Surg-R1

A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

7B Parameters
320K CoT Pairs
13 Datasets
64.9% Arena Score
arXiv Paper Dataset (coming soon) Code (coming soon)
Surg-R1: Hierarchical Chain-of-Thought Reasoning

Abstract

Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Language Model that addresses this gap through hierarchical reasoning trained via a four-stage pipeline. Our approach introduces three key contributions: (1) a three-level reasoning hierarchy decomposing surgical interpretation into perceptual grounding, relational understanding, and contextual reasoning; (2) the largest surgical chain-of-thought dataset with 320,000 reasoning pairs; and (3) a four-stage training pipeline progressing from supervised fine-tuning to group relative policy optimization and iterative self-improvement.

Evaluation on SurgBench, comprising four public benchmarks and six multi-center external validation datasets from five institutions, demonstrates that Surg-R1 achieves the highest Arena Score (57.7%) on public benchmarks versus Gemini 3.0 Pro (29.8%) and GPT-5.1 (28.5%), outperforming both proprietary reasoning models and specialized surgical VLMs on the majority of tasks spanning triplet recognition, phase recognition, action recognition, and critical view of safety assessment, with a 15.2 percentage point improvement over the strongest surgical baseline on external validation.

Authors

An international collaboration across fourteen institutions.

Jian Jiang1,†
Chenxi Lin1,†
Yiming Gu1
Zengyi Qin5
Zhitao Zeng3
Kun Yuan9
Yonghao Long4
Xiang Xia6
Cheng Yuan1
Yuqi Wang1
Zijie Yue8
Kunyi Yang1
Yuting Zhang1
Zhu Zhuo3
Dian Qin14
Xin Wang7
NG Chi Fai13
Brian Anthony5
Daguang Xu12
Guy Rosman10,11
Ozanan Meireles10
Zizhen Zhang6
Nicolas Padoy9
Hesheng Wang2
Qi Dou4
Yueming Jin3
Yutong Ban1,*

Equal contribution    * Corresponding author

  1. Global College, Shanghai Jiao Tong University, Shanghai, China
  2. School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
  3. National University of Singapore, Singapore
  4. The Chinese University of Hong Kong, Hong Kong SAR, China
  5. Massachusetts Institute of Technology, Cambridge, MA, USA
  6. Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
  7. West China Hospital of Sichuan University, Chengdu, China
  8. Tongji University, Shanghai, China
  9. ICube, University of Strasbourg, CNRS, IHU Strasbourg, France
  10. Massachusetts General Hospital, Massachusetts, US
  11. Duke University, Durham, NC, USA
  12. Nvidia, US
  13. Department of Surgery, The Chinese University of Hong Kong, Hong Kong SAR, China
  14. Chengdu Withai Innovations Technology Company, Chengdu, China

Contact

Corresponding author: Yutong Banyban@sjtu.edu.cn

Method Overview

Level 1 — Perceptual Grounding

Identifies surgical instruments and anatomical structures with their visual characteristics (color, shape, texture).

Level 2 — Relational Understanding

Analyzes tool-tissue-action interactions to interpret what each instrument is doing and to which tissue.

Level 3 — Contextual Reasoning

Synthesizes perceptual and relational evidence for phase recognition, safety assessment, and workflow analysis.

4-Stage Training Pipeline

SFT pretraining → Cold-start CoT supervision → GRPO reinforcement learning → Iterative dual-pathway refinement.

Surg-R1 overview: hierarchical reasoning, evaluation radar charts, and training pipeline
Figure 1. (a) Hierarchical 3-level chain-of-thought reasoning example. (b) Surgical anatomy coverage. (c) Radar charts comparing Surg-R1 with proprietary and open-source models on public benchmarks (left) and external validation (right). (d) Four-stage training pipeline overview.

Experimental Results

Comprehensive evaluation across 13 datasets, 6 surgical AI tasks, and comparisons against 20+ state-of-the-art models.

Training data statistics, inhouse benchmark results, public benchmark arena scores, and CoT data composition
Figure 2. (a) Training data distribution across datasets. (b) Inhouse dataset benchmark. (c) Public dataset benchmark arena scores. (d) Chain-of-thought training data composition — 320K total reasoning pairs.

Qualitative Results

Representative examples of Surg-R1's hierarchical reasoning across five surgical AI tasks.

Qualitative results: instrument localization, phase recognition, action recognition, triplet recognition, and critical view of safety assessment
Figure 3. Qualitative results across five tasks — instrument localization, phase recognition, action recognition, triplet recognition, and critical view of safety assessment — showing Surg-R1's structured question, chain-of-thought thinking process, and final answer for each case.

BibTeX

@article{jiang2026surgr1,
  title={Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation},
  author={Jian Jiang and Chenxi Lin and Yiming Gu and Zengyi Qin and Zhitao Zeng and Kun Yuan and Yonghao Long and Xiang Xia and Cheng Yuan and Yuqi Wang and Zijie Yue and Kunyi Yang and Yuting Zhang and Zhu Zhuo and Dian Qin and Xin Wang and NG Chi Fai and Brian Anthony and Daguang Xu and Guy Rosman and Ozanan Meireles and Zizhen Zhang and Nicolas Padoy and Hesheng Wang and Qi Dou and Yueming Jin and Yutong Ban},
  journal={arXiv preprint arXiv:2603.12430},
  year={2026}
}