In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review

Jiménez-Sánchez, Amelia; Avlona, Natalia-Rozalia; de Boer, Sarah; Campello, Víctor M.; Feragen, Aasa; Ferrante, Enzo; Ganz, Melanie; Gichoya, Judy Wawira; González, Camila; Groefsema, Steff; Hering, Alessa; Hulman, Adam; Joskowicz, Leo; Juodelyte, Dovile; Kandemir, Melih; Kooi, Thijs; Lérida, Jorge del Pozo; Li, Livie Yumeng; Pacheco, Andre; Rädsch, Tim; Reyes, Mauricio; Sourget, Théo; van Ginneken, Bram; Wen, David; Weng, Nina; Xu, Jack Junchi; Zając, Hubert Dariusz; Zuluaga, Maria A.; Cheplygina, Veronika

doi:10.1145/3715275.3732035

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.10727 (cs)

[Submitted on 18 Jan 2025 (v1), last revised 2 Jun 2025 (this version, v2)]

Title:In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review

View PDF HTML (experimental)

Abstract:Datasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static -- they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at this http URL.

Comments:	ACM Conference on Fairness, Accountability, and Transparency - FAccT 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Image and Video Processing (eess.IV)
Cite as:	arXiv:2501.10727 [cs.CV]
	(or arXiv:2501.10727v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.10727
Related DOI:	https://doi.org/10.1145/3715275.3732035

Submission history

From: Amelia Jiménez-Sánchez [view email]
[v1] Sat, 18 Jan 2025 11:03:59 UTC (4,531 KB)
[v2] Mon, 2 Jun 2025 12:18:57 UTC (8,976 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators