AnatomiX: Anatomy-Aware Grounded Multimodal LLM for Chest X-Ray Interpretation

Anees Hashmi, Numan Saeed, Christoph Lippert

Hasso Plattner Institute, Germany | CVPR 2026 - Findings

Abstract

Multimodal medical LLMs have shown substantial progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. AnatomiX introduces a two-stage approach: first, identifying anatomical structures and features; second, using a language model to perform downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments demonstrate >25% improvement in anatomy grounding and grounded tasks compared to existing approaches.

Key Contributions

🫁 Anatomy-Aware MLLM: Precisely interprets chest X‑rays with true anatomical grounding.
⚙️ Two-Stage Radiology Workflow: Extracts anatomical features first, then reasons with language intelligence.
📈 Performance Boost: 25%+ improvement on CXR grounding.
🛡️ Robust Reasoning: Maintains anatomical accuracy even under flipped or challenging images.

Method

Fig 1: Anatomy Perception Module (APM) architecture. Encoder outputs image embeddings, decoder and feature module output bounding boxes and anatomical tokens. Vector database used for contrastive retrieval during inference.

Results

Comparison between AnatomiX and RadVLM in anatomy understanding

Fig 2: AnatomiX vs RadVLM in anatomy understanding. Red = model output, Green = ground truth. AnatomiX shows superior anatomical recognition, including flipped images.

Model	NLG Metrics (GD / GC)			Clinical Metrics (GD / GC)		Phrase Grounding		Anatomy Grounding
Model	BERTScore	ROUGE	METEOR	RadGraph-F1	CheXbert-14-F1	IoU	mAP	IoU	mAP
MAIRA-2	0.01 / 0.08	0.01 / 0.06	0.01 / 0.04	0.00 / 0.02	0.03 / 0.02	0.32	0.24	0.35	0.24
RadVLM	0.15 / 0.27	0.06 / 0.11	0.05 / 0.07	0.00 / 0.12	0.32 / 0.40	0.39	0.30	0.60	0.49
CheXagent	0.49 / 0.56	0.43 / 0.44	0.29 / 0.37	0.40 / 0.39	0.40 / 0.61	0.33	0.24	0.18	0.09
AnatomiX (ours)	0.63 / 0.65	0.60 / 0.56	0.42 / 0.48	0.58 / 0.50	0.54 / 0.78	0.46	0.35	0.73	0.66

Table 1: Performance on four grounding tasks. GD = Grounded Diagnosis, GC = Grounded Captioning.

BibTeX

@article{hashmi2026anatomix,
  title={AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation},
  author={Hashmi, Anees Ur Rehman and Saeed, Numan and Lippert, Christoph},
  journal={arXiv preprint arXiv:2601.03191},
  year={2026}
}