Surg-R1 | Hierarchical Reasoning for Surgical Decision Support

Abstract

Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Language Model that addresses this gap through hierarchical reasoning trained via a four-stage pipeline. Our approach introduces three key contributions: (1) a three-level reasoning hierarchy decomposing surgical interpretation into perceptual grounding, relational understanding, and contextual reasoning; (2) the largest surgical chain-of-thought dataset with 320,000 reasoning pairs; and (3) a four-stage training pipeline progressing from supervised fine-tuning to group relative policy optimization and iterative self-improvement.

Evaluation on SurgBench, comprising four public benchmarks and six multi-center external validation datasets from five institutions, demonstrates that Surg-R1 achieves the highest Arena Score (57.7%) on public benchmarks versus Gemini 3.0 Pro (29.8%) and GPT-5.1 (28.5%), outperforming both proprietary reasoning models and specialized surgical VLMs on the majority of tasks spanning triplet recognition, phase recognition, action recognition, and critical view of safety assessment, with a 15.2 percentage point improvement over the strongest surgical baseline on external validation.

Authors

An international collaboration across fourteen institutions.

Jian Jiang^1,†

Chenxi Lin^1,†

Yiming Gu¹

Zengyi Qin⁵

Zhitao Zeng³

Kun Yuan⁹

Yonghao Long⁴

Xiang Xia⁶

Cheng Yuan¹

Yuqi Wang¹

Zijie Yue⁸

Kunyi Yang¹

Yuting Zhang¹

Zhu Zhuo³

Dian Qin¹⁴

Xin Wang⁷

NG Chi Fai¹³

Brian Anthony⁵

Daguang Xu¹²

Guy Rosman^10,11

Ozanan Meireles¹⁰

Zizhen Zhang⁶

Nicolas Padoy⁹

Hesheng Wang²

Qi Dou⁴

Yueming Jin³

Yutong Ban^1,*

^† Equal contribution ^* Corresponding author

Global College, Shanghai Jiao Tong University, Shanghai, China
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
National University of Singapore, Singapore
The Chinese University of Hong Kong, Hong Kong SAR, China
Massachusetts Institute of Technology, Cambridge, MA, USA
Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
West China Hospital of Sichuan University, Chengdu, China
Tongji University, Shanghai, China
ICube, University of Strasbourg, CNRS, IHU Strasbourg, France
Massachusetts General Hospital, Massachusetts, US
Duke University, Durham, NC, USA
Nvidia, US
Department of Surgery, The Chinese University of Hong Kong, Hong Kong SAR, China
Chengdu Withai Innovations Technology Company, Chengdu, China

Contact

Corresponding author: Yutong Ban — yban@sjtu.edu.cn

Method Overview

Level 1 — Perceptual Grounding

Identifies surgical instruments and anatomical structures with their visual characteristics (color, shape, texture).

Level 2 — Relational Understanding

Analyzes tool-tissue-action interactions to interpret what each instrument is doing and to which tissue.

Level 3 — Contextual Reasoning

Synthesizes perceptual and relational evidence for phase recognition, safety assessment, and workflow analysis.

4-Stage Training Pipeline

SFT pretraining → Cold-start CoT supervision → GRPO reinforcement learning → Iterative dual-pathway refinement.

Surg-R1 overview: hierarchical reasoning, evaluation radar charts, and training pipeline — **Figure 1.** (a) Hierarchical 3-level chain-of-thought reasoning example. (b) Surgical anatomy coverage. (c) Radar charts comparing Surg-R1 with proprietary and open-source models on public benchmarks (left) and external validation (right). (d) Four-stage training pipeline overview.

Experimental Results

Comprehensive evaluation across 13 datasets, 6 surgical AI tasks, and comparisons against 20+ state-of-the-art models.

Training data statistics, inhouse benchmark results, public benchmark arena scores, and CoT data composition — **Figure 2.** (a) Training data distribution across datasets. (b) Inhouse dataset benchmark. (c) Public dataset benchmark arena scores. (d) Chain-of-thought training data composition — 320K total reasoning pairs.

Qualitative Results

Representative examples of Surg-R1's hierarchical reasoning across five surgical AI tasks.

BibTeX

@article{jiang2026surgr1,
  title={Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation},
  author={Jian Jiang and Chenxi Lin and Yiming Gu and Zengyi Qin and Zhitao Zeng and Kun Yuan and Yonghao Long and Xiang Xia and Cheng Yuan and Yuqi Wang and Zijie Yue and Kunyi Yang and Yuting Zhang and Zhu Zhuo and Dian Qin and Xin Wang and NG Chi Fai and Brian Anthony and Daguang Xu and Guy Rosman and Ozanan Meireles and Zizhen Zhang and Nicolas Padoy and Hesheng Wang and Qi Dou and Yueming Jin and Yutong Ban},
  journal={arXiv preprint arXiv:2603.12430},
  year={2026}
}