\useunder

\ul

Distilling Large Language Models
for Efficient Clinical Information Extraction

Karthik S. Vedula* Poolesville High School, Poolesville, MD, USA Annika Gupta University of California Santa Cruz, Santa Cruz, CA, USA Akshay Swaminathan Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
Ivan Lopez
Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
Suhana Bedi Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA Nigam H. Shah Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA Clinical Excellence Research Center, Stanford University School of Medicine, Stanford, CA, USA Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, USA
(December 21, 2024)

Equal Contribution Corresponding Author: [email protected]

Abstract

Objective: Large language models (LLMs) excel at clinical information extraction but their computational demands limit practical deployment. Knowledge distillation—the process of transferring knowledge from larger to smaller models—offers a potential solution. We evaluate the performance of distilled BERT models, which are approximately 1,000 times smaller than modern LLMs, for clinical named entity recognition (NER) tasks.

Materials and Methods: We leveraged state-of-the-art LLMs (Gemini and OpenAI models) and medical ontologies (RxNorm and SNOMED) as teacher labelers for medication, disease, and symptom extraction. We applied our approach to over 3,300 clinical notes spanning five publicly available datasets, comparing distilled BERT models against both their teacher labelers and BERT models fine-tuned on human labels. External validation was conducted using clinical notes from the MedAlign dataset.

Results: For disease extraction, F1 scores were 0.82 (teacher model), 0.89 (BioBERT trained on human labels), and 0.84 (BioBERT-distilled). For medication, F1 scores were 0.84 (teacher model), 0.91 (BioBERT-human), and 0.87 (BioBERT-distilled). For symptoms: F1 score of 0.73 (teacher model) and 0.68 (BioBERT-distilled). Distilled BERT models had faster inference (12x, 4x, 8x faster than GPT-4o, o1-mini, and Gemini Flash respectively) and lower costs (85x, 101x, 2x cheaper than GPT-4o, o1-mini, and Gemini Flash respectively). On the external validation dataset, the distilled BERT model achieved F1 scores of 0.883 (medication), 0.726 (disease), and 0.699 (symptom).

Conclusions: Distilled BERT models were up to 101x cheaper and 12x faster than state-of-the-art LLMs while achieving similar performance on NER tasks. Distillation offers a computationally efficient and scalable alternative to large LLMs for clinical information extraction.

Introduction

Clinical notes in electronic health records contain valuable unstructured information that often isn’t captured in structured fields [1]. Converting this free-text information into structured data enables cohort selection[2], observational analysis[3], and question-answering systems that enhance clinician efficiency.[4] However, extracting information from these clinical notes remains challenging.[5][6] Named entity recognition (NER), which classifies key entities in text into predefined categories like diseases, medications, or symptoms, is an important task in this process.[7]

Traditional approaches to clinical NER include rule-based methods using string matching and medical ontologies like the Unified Medical Language System (UMLS).[8][9][10] While these approaches are interpretable and computationally efficient, they often fail to capture the diverse representations of clinical entities, including synonyms, abbreviations, nuanced descriptions, and misspellings.[9]

Machine learning approaches, such as BERT-based models, have demonstrated superior performance.[11][12] Domain-specific BERT variants like BioBERT[13] and ClinicalBERT[14] have been developed to better handle biomedical and clinical terminology. However, current clinical NER models (fine-tuned BERT Models) tend to be narrowly focused on specific domains or entity types, like radiology, limiting their broad applicability.[15] Additionally, fine-tuning requires large amounts of annotated data, which is expensive and time-consuming to produce. Weak supervision using rule-based methods and ontologies—such as TROVE, which generates weak labels from UMLS ontologies to train a BERT-based model for NER—offers one solution.[12]

Large language models (LLMs) have demonstrated strong performance in clinical NER tasks through zero-shot or few-shot prompting, reducing the need for extensive labeled data.[16] However, these models require significant computational resources for local deployment and can be costly.[17] Additionally, proprietary LLMs often require HIPAA-compliant endpoints to handle protected health information (PHI), which further complicates their deployment in healthcare settings. These challenges highlight the need for more efficient and compliant solutions in the healthcare domain.

Knowledge distillation offers a promising solution to these challenges. This technique transfers knowledge from larger models to smaller ones, potentially addressing the limitations of both domain-specific BERT models and computationally expensive LLMs.[18] Recent studies have demonstrated successful distillation from large models such as GPT-4 into medium-sized LLMs such as LLaMA[19], and from BERT-based models to even smaller architectures.[20] In the medical domain, distilled models have achieved impressive results — DistilFLERT and distilled PubMedBERT models have shown success in various medical applications.[20][21]

However, existing approaches have several limitations. First, they typically focus on single note type (e.g., discharge summaries) or single entity type (e.g., medications only), limiting their practical utility across diverse clinical settings. Second, prior work has not rigorously investigated the generalizability of distilled models through external validation using notes from different health systems and note types. Third, existing approaches rely on single teacher models rather than exploring the potential benefits of combining multiple teacher labelers that leverage both LLMs and medical ontologies. This gap is particularly significant given that different teacher labelers may capture complementary aspects of clinical entities, potentially improving the robustness and accuracy of the distilled models. In this paper, we present a novel approach to clinical NER using BERT-based models distilled from multiple teacher labelers, addressing the computational and scalability challenges associated with deploying large LLMs in clinical settings. We make three key contributions:

  1. 1.

    We develop teacher labelers combining state-of-the-art LLMs (Gemini and OpenAI models) with medical ontologies (RxNorm and SNOMED) for clinical NER across various note types, validated against expert-labeled datasets.

  2. 2.

    We create and release distilled BERT-based models—approximately 1,000 times smaller than modern LLMs—trained on teacher labels from over 2,000 clinical documents, including oncology progress notes, discharge summaries, radiology reports, and scientific abstracts.

  3. 3.

    We conduct a comprehensive evaluation of our distilled BERT models across five publicly available clinical datasets, including an analysis of model failure modes and an external validation analysis to evaluate the generalizability of our approach across health systems.

Methods

Refer to caption
Figure 1: Clinical documents were passed to teacher labelers—LLMs and ontologies—for medication, symptom, and disease entity recognition tasks. We selected the optimal combination of teacher labelers based on F1 score for subsequent experiments. BERT models were distilled from the teacher labels via supervised fine-tuning and performance was measured on in-distribution datasets as well as an external validation dataset.

This study follows the TRIPOD-LLM[22] reporting guidelines for the use of LLMs. All experiments were performed with publicly available, de-identified datasets that did not require IRB protocol approval.

NER Tasks and Datasets

We evaluated our approach on three distinct NER tasks, each utilizing different datasets to ensure comprehensive validation across various clinical contexts:

For the medication extraction task, we used the National NLP Clinical Challenges (n2c2) 2018 Track 2 Medication Extraction dataset.[23] This dataset comprises 505 discharge summaries from MIMIC-III (Medical Information Mart for Intensive Care III)[24], with expert-annotated medication mentions. Following the n2c2 annotation guidelines, we used 202 notes for testing, 303 notes for training, and randomly sampled 25 notes from the training set for development purposes (Table 12). The disease extraction task utilized the National Center for Biotechnology Information (NCBI) Disease Corpus.[25] This corpus contains 793 PubMed abstracts with expert-annotated disease mentions. We adhered to the official dataset splits for training, development, and testing. For the symptom extraction task, we used the CORAL dataset[26], which consists of de-identified progress notes from 40 patients (20 with breast cancer and 20 with pancreatic cancer). These notes, collected at the University of California, San Francisco (UCSF) Information Commons between 2012 and 2022, were de-identified using the Philter tool and annotated at the entity level. We focused on symptoms as they were the most frequent entity type in the dataset. Since CORAL does not provide predefined splits, we randomly selected 5 notes for a development set and 35 for testing, while using the unannotated notes for training through teacher labeling.

Teacher Labeling Dataset Construction

Since teacher labeling does not require gold standard annotations, we combined data from all available datasets irrespective of their original annotation status, maximizing data diversity. We leveraged the training splits from our primary datasets (NCBI, n2c2, and CORAL) and augmented them with 1,000 clinical notes sampled from MIMIC-III using a stratified approach to ensure representation across different documentation styles: 250 notes each from progress notes, nursing notes, discharge summaries, and radiology reports. The final teacher labeling dataset used for model fine-tuning consisted of 2,096 documents drawn from NCBI, n2c2, CORAL, and MIMIC-III.

External Validation

To assess the generalizability of our distilled BERT models, we conducted an external validation study on clinical notes from the MedAlign dataset[27], a collection of de-identified electronic health records (EHRs) from Stanford Hospital and Lucile Packard Children’s Hospital. From this dataset of 276 longitudinal patient records, we sampled notes across different types to ensure comprehensive evaluation: 250 progress notes, 129 nursing notes, 117 discharge summaries, and 250 procedure notes. Since MedAlign lacks NER labels, two fourth year medical students (AS and IL) independently annotated 10 randomly selected notes, with 2 notes doubly annotated to assess inter-rater agreement. Following our model’s output format, annotators labeled each token using the Inside-Outside (IO) scheme: “I-MED” for medications, “I-DIS” for disease, “I-SYM” for symptoms, or “O” for all other entities.

Teacher Labeling Pipeline

LLM-based Labeling

We evaluated four state-of-the-art LLMs as teacher labelers: GPT-4o (version 2024-08-06)[28], GPT-4o-mini (version 2024-07-18)[29], o1-mini (version 2024-09-12)[30], and Gemini 1.5 Flash (gemini-1.5-flash-002).[31] Each model was prompted to perform the NER tasks and return the extracted entities. All models were executed through HIPAA-compliant API endpoints with standardized parameters (temperature=0.01, top-p=0.9) to ensure consistent outputs. The final optimized prompts are provided in the Supplementary Material.

Ontology-based Labeling

We leveraged BioPortal[32] Annotator API for accessing comprehensive biomedical ontologies: RxNorm[33] for medication extraction and SNOMED CT[34] for disease and symptom extraction. For each NER task, we mapped the relevant semantic types to their respective tasks, ensuring that only task-relevant entities were extracted. A complete list of the semantic types assigned to each task is provided in the Supplementary Material.

Optimal Teacher Labeling Regimen

We hypothesized that different teacher labelers would exhibit varying levels of performance and that an optimal combination of labelers could maximize the F1 score for a given NER task. Our experiment evaluated all 31 possible subsets of five teacher labelers, comprising four LLM labelers and the ontology labeler. To combine teacher labelers, we took a union of the entities identified by each teacher labeler. For each task and dataset, the combination achieving the highest F1 score on the development set was selected for subsequent experiments.

Model Distillation Implementation

For each NER task (medication, disease, and symptom extraction), we implemented knowledge distillation using the optimal teacher labeling pipeline to generate training labels. These labels were converted into ”Inside-Outside” (IO) format, where words belonging to an entity are labeled as “Inside,” while all other words are labeled as “Outside.” We then fine-tuned separate BERT models for each task using standardized hyperparameters: learning rate=2x10e-5, batch size=8, and weight decay=0.01. All models were trained for 10 epochs on NVIDIA 4xH100 GPUs. The fine-tuned models were used to perform inference on the test sets for each NER task. We report token-level precision, recall, and F1 score, treating the human annotations as the gold standard. To assess the impact of domain-specific pretraining on downstream performance, we fine-tuned and compared three BERT variants:

  • BERT base[11], a general-purpose language model

  • BioBERT[13], pretrained on biomedical literature

  • BioClinBERT[35], specialized for clinical text

We evaluated the quality of teacher labels by comparing the performance of distilled models fine-tuned on teacher labels with those fine-tuned on human labels. For the medication and disease tasks, we used human-labeled data from the n2c2 and NCBI training sets, respectively. Due to limited human-labeled data in the CORAL dataset, we could not perform this comparison for symptom extraction. We directly evaluated the teacher labeling pipelines by measuring their performance without model distillation.

Error Analysis

To better characterize model failure modes and estimate the prevalence of labeling errors in the test sets, we conducted an error analysis of the best-performing models for each task. For each false positive (model assigns a non-“O” label, ground truth is “O”) and false negative (model assigns “O”, ground truth is a non-“O” label), model labels were compared to ground truth labels by two annotators (AS and IL, fourth year medical students). Each false positive or false negative was categorized as either “incorrect”, indicating that the model label was truly incorrect; “partially correct”, indicating that the model labels partially overlapped with the ground truth labels for a given entity, but not completely; or “correct”, indicating that the model label was correct and that the ground truth label was incorrect. For each NER task, we randomly annotated 170 false negatives and false positives, including 90 instances that were doubly annotated for inter-rater agreement calculation.

Inference Time and Cost Analysis

To quantify the practical benefits of deploying smaller models for clinical NER, we compare inference time per note and cost per note between our distilled BERT models and LLM teacher labelers. To calculate cost per note for LLMs, we use input and output API pricing for OpenAI and Google, and we use the tiktoken[36] Python library to calculate token counts. We estimate cost per note for BERT models by multiplying inference time by $28/hour, which is the average cost of a virtual machine with 4xH100 (our compute resources) listed by six cloud vendors[37] as of September 2024 (Table 6).

Results

Performance of Teacher Labeler Combinations

We evaluated all 31 possible combinations of LLM and ontology labelers across our three extraction tasks (Tables 7 8 9). For symptom extraction, the Gemini 1.5 flash + GPT-4o combination achieved the highest F1 score of 0.801, notably outperforming other combinations including Gemini 1.5 flash + GPT-4o + GPT-4o-mini (F1 = 0.784) and o1-mini + GPT-4o (F1 = 0.778). Interestingly, none of the top-performing combinations for symptom extraction included the ontology-based labeler.

The medication extraction task showed similar patterns, with Gemini 1.5 flash + GPT-4o achieving the highest F1 score of 0.881. This was followed closely by Gemini-1.5-flash + GPT-4o + GPT-4o-mini (F1 = 0.872) and Gemini-1.5-flash + ontology + GPT-4o (F1 = 0.870).

For disease extraction, the single o1-mini model achieved the highest F1 score of 0.787, with combinations of o1-mini + ontology (F1 = 0.773) and o1-mini + GPT-4o (F1 = 0.760) performing slightly lower.

Task Teacher labeler(s) F1-Score Precision Recall
Disease Extraction o1-mini 0.787 0.724 0.862
o1-mini + ontology 0.773 0.686 0.885
o1-mini + GPT-4o 0.760 0.652 0.911
o1-mini + ontology + GPT-4o 0.748 0.629 0.923
GPT-4o 0.748 0.717 0.781
Medication Extraction Gemini-1.5-flash + GPT-4o 0.881 0.947 0.824
Gemini-1.5-flash + GPT-4o + GPT-4o-mini 0.872 0.896 0.849
Gemini-1.5-flash + ontology + GPT-4o 0.870 0.865 0.876
Gemini-1.5-flash + ontology 0.862 0.876 0.848
Gemini-1.5-flash + GPT-4o-mini 0.859 0.912 0.811
Symptom Extraction Gemini-1.5-flash + GPT-4o 0.801 0.871 0.741
GPT-4o 0.787 0.900 0.700
Gemini-1.5-flash + GPT-4o + GPT-4o-mini 0.784 0.810 0.759
o1-mini + GPT-4o 0.778 0.752 0.806
o1-mini + Gemini-1.5-flash + GPT-4o 0.770 0.734 0.809
Table 1: Top five teacher labeler combinations with highest F1 scores for each NER task.

BERT Model Performance

BioBERT demonstrated superior performance in disease extraction with an F1 score of 0.865, compared to 0.830 for both BaseBERT and BioClinBERT (Table 10). For the medication extraction task, both BioBERT and BioClinBERT achieved an F1 of 0.89, slightly outperforming BaseBERT (F1 = 0.885). Symptom extraction proved more challenging across all models, with BioBERT and BioClinBERT achieving F1 scores of 0.34 and BaseBERT reaching 0.33.

Task Model F1-Score NPV PPV Sensitivity Specificity
Disease Extraction Human + BERT 0.89 0.99 0.87 0.92 0.99
Teacher + BERT 0.84 0.99 0.78 0.90 0.98
Teacher only 0.82 0.99 0.79 0.86 0.98
Medication Extraction Human + BERT 0.91 1.00 0.89 0.93 1.00
Teacher + BERT 0.87 1.00 0.89 0.85 1.00
Teacher only 0.84 0.99 0.91 0.79 1.00
Symptom Extraction Teacher + BERT 0.68 0.99 0.80 0.59 1.00
Teacher only 0.73 0.99 0.78 0.69 1.00
Table 2: Performance of BioBERT models fine-tuned on human labels, BioBERT models distilled from teacher labelers, and the teacher labelers themselves.

Comparative Analysis of Label Sources

Our analysis revealed that models fine-tuned on human labels consistently outperformed those using teacher labels, which in turn exceeded direct teacher labeler performance. For disease extraction, human-labeled models achieved an F1 of 0.89, compared to 0.84 for teacher-labeled models and 0.82 for direct teacher labelers (Table 2. Similarly, in medication extraction, we observed F1 scores of 0.91, 0.87, and 0.84 respectively. For symptom extraction, where human-labeled comparison was not possible, teacher-labeled models achieved an F1 of 0.68, while direct teacher labelers reached 0.73.

Error Analysis

Across all three tasks, the majority of false positives were due to incorrect ground truth labels. For symptom extraction, 2.20% of false negatives and 82.05% of false positives had incorrect ground truth labels (Table 3). For medication extraction, 21.05% of false negatives and 62.93% of false positives had incorrect ground truth labels. For disease extraction, 4.08% of false negatives and 73.33% of false positives had incorrect ground truth labels. More analysis of specific cases for all three tasks are in the Supplementary Material.

Symptom Extraction Medication Extraction Disease Extraction
False Negatives
N 91 57 49
Correct 2.20% 21.05% 4.08%
Partially Correct 16.49% 14.04% 38.78%
Incorrect 81.32% 64.91% 57.14%
Examples “She is asymptomatic from bone lesions but we can…”

“…presented with acholic stool and dark urine

“…3 years of hot flashes.”

“…such as fatigue, neuropathy, skin and nail changes…”
“…Cardura 2 q.d….”

“DNR / DNI / no pressors…”

NG SL PRN”

“We generally recommend taking an over the counter stool softener…”
“…remarkable propensity to bacterial infections…”

“Royal National Hospital for Rheumatic Diseases database…”

“…in three prostate cancer cell lines…”

“…rescues the gastrulation defect.”
False Positives
N 78 116 120
Correct 82.05% 62.93% 73.33%
Partially Correct 15.38% 12.93% 20.00%
Incorrect 2.56% 24.14% 6.67%
Examples “…found to have sepsis…”

“…skin and nail changes, myalgias, alopecia, myelosuppression, nausea…”
“Bilateral injected sclera…”

“At initail [sic] deployment the patient…”
“…showed the father to be hypohaptoglobinemic…”
Table 3: Error analysis for all NER tasks. A random sample of false positives and false negatives were reviewed and categorized as “Correct” (model label is correct; ground truth label is incorrect), “Partially correct” (model label partially overlaps with ground truth label for the entity), or “Incorrect” (model label is incorrect; ground truth label is correct). Representative examples of false negatives and false positives for each task are shown above. Abbreviations: “NG SL PRN” = nitroglycerin sublingual pro re nata (as needed); “2 q.d.” = two per day.

External Validation

To evaluate the generalizability of our distilled BERT models for clinical NER tasks, we conducted an external validation study on clinical notes sampled from the MedAlign dataset. For disease extraction, the model demonstrated a recall of 89.0% and a precision of 61.3%, leading to an F1 of 0.726 (Table 11). For medication extraction, the distilled BERT model achieved a recall of 96.4%, precision of 81.5%, and F1 of 0.883. Symptom extraction showed the weakest performance, with a recall of 56.0%, a precision of 92.9%, and an F1 of 0.699. These results highlight the strong performance of the model for medication and disease extraction tasks, even when applied to an out-of-distribution dataset.

Model Total cost (USD) Total inference time (s) Cost per note (USD) Inference time per note (s)
Distilled BioBERT 0.02 14 0.000187 0.14
GPT-4o 1.59 (+7850%) 166 (+1086%) 0.0159 (+8402%) 1.66 (+1086%)
o1-mini 1.89 (+9350%) 58 (+314.3%) 0.0189 (+1001%) 0.58 (+314.3%)
Gemini 1.5 Flash 0.05 (+150%) 117 (+735.7%) 0.000460 (+146.0%) 1.17 (+735.7%)
Table 4: On 100 notes sampled from the MedAlign dataset, we report average inference time per note and average cost per note aggregated across medication, disease, and symptom extraction tasks. Parentheses indicate percent difference compared to the Distilled BioBERT model. BioBERT was run on 1xA100 80GB, for which the average cost per hour was estimated at $4.74 (Table 6). For teacher LLMs, we use reported token pricing as of 12/11/2024.

Inference Time and Cost

To quantify efficiency gains from our knowledge distillation approach, we compared the inference time per note and cost per note of our distilled BERT models against several teacher labelers, including state-of-the-art LLMs. The distilled BERT model demonstrated superior efficiency, with an average inference time of 0.14 seconds per note and a cost of $0.000187 per note, calculated based on an estimated $4.74/hour for a 4xH100 virtual machine (Tables 4 6). In contrast, teacher LLMs incurred significantly higher inference times and costs: GPT-4o required 1.66 seconds per note and cost $0.0159 per note; o1-mini model achieved slightly better performance with 0.58 seconds per note and a cost of $0.0189 per note; and Gemini 1.5 Flash was the cheapest among the teacher labelers, with 1.17 seconds per note and $0.000460 per note.

Discussion

Our study found that distilled BERT models outperformed teacher labelers and approached the performance of BERT models fine tuned on human labels, highlighting the effectiveness of knowledge distillation for clinical NER. In external validation, the distilled BERT models demonstrated strong performance on the medication and disease extraction tasks. Importantly, the distilled BERT models were faster (2x, 4x, 8x faster than GPT-4o, o1-mini, and Gemini Flash respectively) and cheaper (85x, 101x, 2x cheaper than GPT-4o, o1-mini, and Gemini Flash respectively) than their LLM counterparts, making them a practical alternative for real-world clinical applications. Together, these findings highlight the potential of distillation to facilitate efficient and scalable clinical NER while maintaining high performance.

Unlike other studies, which distilled from a single large model, our study assessed 31 different model combinations for different medical NER tasks, and used the best combinations to then distill down to smaller BERT-based models. Additionally, we assessed the effect of including ontology-outputs in the distillation process, finding that their inclusion resulted in poorer performance, due to increased false positives. We tested these models on discharge summary and medical research publication data, along with an external dataset, demonstrating generalizability.

This study has several limitations. First, the quality of teacher LLMs used to fine-tune the distilled BERT models was often variable, particularly for symptoms. The inconsistency in symptom labeling, particularly between the development and test sets, likely contributed to the lower F1 scores observed for symptom extraction tasks. Second, we focus on only three types of entities; other entity types such as procedures, social determinants of health, diagnosis dates, lab values, and vital signs also need to be extracted for comprehensive clinical information extraction. Third, our approach did not address more complex NER tasks, such as capturing assertion status (e.g., negations or hypothetical statements) or relational extraction tasks (e.g., drug-dosage relationships). Fourth, we did not explore prompt engineering by model and used the same prompts for all LLMs.[38] Finally, the test sets for our three NER tasks have errors. As confirmed by others, they frequently contained labels that were inconsistent with the annotation guidelines of their respective datasets.[21][25] This inconsistency led to outputs that often did not align with the test set labels, leading to lower performance during evaluation.

An error analysis of the model outputs revealed that human-labeled test sets for all three tasks—medication, disease, and symptom extraction—consistently missed several entities that were correctly identified by the models: 63–82% of the model’s false positives were actually correct, suggesting that the reported precision and F1 scores of our models may be lower bounds.

Conclusion

Our work provides a roadmap for leveraging state-of-the-art LLMs to develop efficient, performant, and generalizable clinical NER models through distillation. Ultimately, this study underscores the potential of distilled BERT models as a computationally efficient and scalable alternative to LLMs for clinical NER, paving the way for broader applications in healthcare information extraction.

References

References

Author Contributions

The study was conceptualized by KSV and AS. Coding and data analysis were performed by KSV, AS, and AG, while data annotation was carried out by AS and IL. All authors contributed to the interpretation of the data and the writing of the manuscript. Supervision was provided by NHS.

Funding

No funding was obtained for this study.

Competing Interests

AS is a paid advisor to Daybreak Health, holds stock options in Cerebral and Daybreak Health, and holds stock in Roche (RHHVF). NHS reported being a cofounder of Prealize Health (a predictive analytics company), Atropos Health (an on-demand evidence generation company) and serving on the Board of the Coalition for Healthcare AI (CHAI), a consensus-building organization providing guidelines for the responsible use of artificial intelligence in health care. NHS serves as a scientific advisor to Opala, Curai Health, Arsenal Capital and JnJ Innovative Medicines.

Code Availability

Supplementary Material

Error Analysis of Selected Entities

Disease Extraction

False negatives like “remarkable propensity to bacterial infections” and “…rescues the gastrulation defect,” suggest difficulty in identifying generic disease categories. False positives, such as “hypohaptoglobinemic,” indicate challenges in distinguishing disease entities from lab findings.

Medication Extraction

False negatives like “Cardura 2 q.d.” may reflect difficulty in identifying uncommon medication names, while missed phrases like “NG SL PRN” suggest difficulty in identifying abbreviations. Descriptions of medication classes or types were also missed, such as “over-the-counter stool softener” or “pressors”. False positives such as “Bilateral injected sclera” reflect confusion of the anatomic sclera with an injected drug, while “At initail deployment” may indicate confusing a misspelling with a drug name.

Symptom extraction

False negatives like “acholic stool”, “dark urine”, and “hot flashes” suggest challenges in capturing less common symptoms, while partial mentions like “skin and nail changes” indicate difficulty in handling symptoms embedded within lists. False positives such as “sepsis” and “myelosuppression” reflect confusion between symptoms and diagnoses or treatment side effects.

Semantic Type Unique Identifiers

The following semantic type unique identifiers (TUIs) were used for the ontology teacher labelers.[39]

Entity Type TUIs Ontology
Medications T195 - Antibiotic (antb) T123 - Biologically Active Substance (bacs) T200 - Clinical Drug (clnd) T125 - Hormone (horm) T121 - Pharmacologic Substance (phsu) RxNorm
Diseases T020 - Acquired Abnormality (acab) T190 - Anatomical Abnormality (anab) T019 - Congenital Abnormality (cgab) T047 - Disease or Syndrome (dsyn) T050 - Experimental Model of Disease (emod) T037 - Injury or Poisoning (inpo) T191 - Neoplastic Process (neop) T046 - Pathologic Function (patf) SNOMED CT
Symptoms T184 - Sign or Symptom (sosy) SNOMED CT
Table 5: TUIs used for each Semantic Type & Ontology

Data Tables

Vendor 1xA100 80GB cost per hour (USD) Notes
Google Cloud $5.58 Available in europe-west4 region.
Amazon AWS $8.19 Based on pricing for A100 40GB x8 ($32.77/hr) available in us-east-1 region.
Microsoft Azure $3.67 Available in East US region.
OVHcloud $3.07
Paperspace $3.18 Based on pricing for A100 80GB x8 ($25.44/hr).
Average $4.74
Table 6: GPU cost estimates across cloud vendors, published September 20, 2024 by Data Crunch.
Model F1-Score Precision Recall
o1-mini 0.787 0.724 0.862
o1-mini + ontology 0.773 0.686 0.885
o1-mini + gpt-4o 0.760 0.652 0.911
o1-mini + ontology + gpt-4o 0.748 0.629 0.923
gpt-4o 0.748 0.717 0.781
ontology + gpt-4o 0.738 0.682 0.803
o1-mini + gemini-1.5-flash 0.706 0.583 0.894
o1-mini + gemini-1.5-flash + ontology 0.693 0.561 0.905
o1-mini + gemini-1.5-flash + gpt-4o 0.686 0.547 0.918
o1-mini + gemini-1.5-flash + ontology + gpt-4o 0.677 0.532 0.928
gemini-1.5-flash + gpt-4o 0.664 0.562 0.811
gemini-1.5-flash + ontology + gpt-4o 0.658 0.546 0.830
gemini-1.5-flash + ontology 0.638 0.580 0.707
gemini-1.5-flash 0.634 0.605 0.665
o1-mini + gpt-4o-mini 0.632 0.487 0.901
o1-mini + gpt-4o + gpt-4o-mini 0.622 0.469 0.924
o1-mini + ontology + gpt-4o-mini 0.622 0.473 0.909
o1-mini + ontology + gpt-4o + gpt-4o-mini 0.613 0.458 0.927
o1-mini + gemini-1.5-flash + gpt-4o-mini 0.613 0.461 0.914
o1-mini + gemini-1.5-flash + gpt-4o + gpt-4o-mini 0.603 0.446 0.928
o1-mini + gemini-1.5-flash + ontology + gpt-4o-mini 0.601 0.447 0.917
gpt-4o + gpt-4o-mini 0.598 0.466 0.831
o1-mini + gemini-1.5-flash + ontology + gpt-4o + gpt-4o-mini 0.594 0.436 0.931
ontology + gpt-4o + gpt-4o-mini 0.590 0.455 0.839
gemini-1.5-flash + gpt-4o + gpt-4o-mini 0.581 0.443 0.846
gemini-1.5-flash + ontology + gpt-4o + gpt-4o-mini 0.575 0.433 0.854
gemini-1.5-flash + gpt-4o-mini 0.566 0.450 0.762
ontology + gpt-4o-mini 0.564 0.458 0.732
gemini-1.5-flash + ontology + gpt-4o-mini 0.561 0.438 0.779
gpt-4o-mini 0.557 0.466 0.694
ontology 0.499 0.761 0.371
Table 7: Performance metrics (F1-score, precision, and recall) for all model configurations evaluated on the Disease Extraction task, sorted by F1-score from highest to lowest.
Model F1-Score Precision Recall
gemini-1.5-flash + gpt-4o 0.881 0.947 0.824
gemini-1.5-flash + gpt-4o + gpt-4o-mini 0.872 0.896 0.849
gemini-1.5-flash + ontology + gpt-4o 0.870 0.865 0.876
gemini-1.5-flash + ontology 0.862 0.876 0.848
gemini-1.5-flash + gpt-4o-mini 0.859 0.912 0.811
ontology + gpt-4o 0.857 0.869 0.845
gemini-1.5-flash + ontology + gpt-4o + gpt-4o-mini 0.856 0.824 0.889
gemini-1.5-flash + ontology + gpt-4o-mini 0.854 0.835 0.874
gemini-1.5-flash 0.852 0.969 0.760
ontology + gpt-4o + gpt-4o-mini 0.848 0.826 0.872
gpt-4o + gpt-4o-mini 0.848 0.897 0.804
gpt-4o 0.838 0.955 0.747
ontology + gpt-4o-mini 0.826 0.833 0.819
ontology 0.766 0.868 0.684
gpt-4o-mini 0.762 0.906 0.657
o1-mini + gemini-1.5-flash + gpt-4o 0.707 0.597 0.867
o1-mini + gemini-1.5-flash + gpt-4o + gpt-4o-mini 0.702 0.582 0.883
o1-mini + gemini-1.5-flash + ontology + gpt-4o 0.699 0.571 0.900
o1-mini + gemini-1.5-flash 0.696 0.595 0.839
o1-mini + gemini-1.5-flash + ontology 0.696 0.572 0.890
o1-mini + gemini-1.5-flash + gpt-4o-mini 0.694 0.581 0.862
o1-mini + ontology + gpt-4o 0.693 0.570 0.885
o1-mini + gemini-1.5-flash + ontology + gpt-4o + gpt-4o-mini 0.691 0.557 0.910
o1-mini + gemini-1.5-flash + ontology + gpt-4o-mini 0.691 0.558 0.905
o1-mini + gpt-4o + gpt-4o-mini 0.690 0.577 0.858
o1-mini + gpt-4o 0.688 0.589 0.827
o1-mini + ontology + gpt-4o + gpt-4o-mini 0.688 0.556 0.901
o1-mini + ontology + gpt-4o-mini 0.685 0.557 0.891
o1-mini + ontology 0.675 0.563 0.844
o1-mini + gpt-4o-mini 0.669 0.569 0.811
o1-mini 0.611 0.551 0.685
Table 8: Performance metrics (F1-score, precision, and recall) for all model configurations evaluated on the Medication Extraction task, sorted by F1-score from highest to lowest.
Model F1-Score Precision Recall
gemini-1.5-flash + gpt-4o 0.801 0.871 0.741
gpt-4o 0.787 0.900 0.700
gemini-1.5-flash + gpt-4o + gpt-4o-mini 0.784 0.810 0.759
o1-mini + gpt-4o 0.778 0.752 0.806
o1-mini + gemini-1.5-flash + gpt-4o 0.770 0.734 0.809
gemini-1.5-flash + ontology + gpt-4o 0.768 0.787 0.750
gpt-4o + gpt-4o-mini 0.767 0.819 0.722
o1-mini + gpt-4o + gpt-4o-mini 0.764 0.716 0.819
o1-mini + gpt-4o-mini 0.763 0.725 0.806
o1-mini 0.763 0.772 0.753
o1-mini + gemini-1.5-flash + gpt-4o + gpt-4o-mini 0.758 0.706 0.819
ontology + gpt-4o 0.758 0.790 0.728
o1-mini + gemini-1.5-flash + gpt-4o-mini 0.758 0.715 0.806
gemini-1.5-flash + ontology + gpt-4o + gpt-4o-mini 0.754 0.742 0.766
o1-mini + gemini-1.5-flash 0.754 0.742 0.766
o1-mini + ontology + gpt-4o 0.747 0.689 0.816
gemini-1.5-flash + gpt-4o-mini 0.746 0.828 0.678
ontology + gpt-4o + gpt-4o-mini 0.744 0.744 0.744
o1-mini + gemini-1.5-flash + ontology + gpt-4o 0.744 0.683 0.816
o1-mini + ontology + gpt-4o + gpt-4o-mini 0.738 0.668 0.825
o1-mini + ontology + gpt-4o-mini 0.738 0.675 0.813
o1-mini + gemini-1.5-flash + ontology + gpt-4o + gpt-4o-mini 0.735 0.663 0.825
o1-mini + gemini-1.5-flash + ontology + gpt-4o-mini 0.734 0.670 0.813
o1-mini + ontology 0.731 0.694 0.772
gemini-1.5-flash + ontology + gpt-4o-mini 0.730 0.756 0.706
o1-mini + gemini-1.5-flash + ontology 0.728 0.688 0.772
gemini-1.5-flash 0.710 0.903 0.584
gemini-1.5-flash + ontology 0.697 0.798 0.619
ontology + gpt-4o-mini 0.648 0.728 0.584
gpt-4o-mini 0.598 0.797 0.478
ontology 0.480 0.723 0.359
Table 9: Performance metrics (F1-score, precision, and recall) for all model configurations evaluated on the Symptom Extraction task, sorted by F1-score from highest to lowest.
Task Model F1-Score NPV PPV Sensitivity Specificity
Disease Extraction BaseBERT 0.830 0.990 0.785 0.890 0.980
BioBERT 0.865 0.990 0.825 0.910 0.985
BioClinBERT 0.830 0.990 0.780 0.890 0.975
Medication Extraction BaseBERT 0.885 1.000 0.885 0.885 1.000
BioBERT 0.890 1.000 0.890 0.890 1.000
BioClinBERT 0.890 1.000 0.895 0.890 1.000
Symptom Extraction BaseBERT 0.330 0.985 0.365 0.295 1.000
BioBERT 0.340 0.985 0.400 0.295 1.000
BioClinBERT 0.340 0.985 0.385 0.305 1.000
Table 10: Performance of BaseBERT, BioBERT, and BioClinBERT for each NER task, averaged across teacher labels and human labels.
Task Precision (%) Recall (%) F1 (%)
Disease 61.3 89.0 72.6
Medication 81.5 96.4 88.3
Symptom 92.9 56.0 69.9
Table 11: Performance of the best performing distilled BioBERT model on the external validation dataset.
Tokens per Document Labels
Dataset Task Split N Min Median Max O Entity
NCBI Disease Train 593 36 202 555 98637 7507
Test 100 72 203.5 487 17832 1385
Dev 100 83 202.5 366 17724 1185
n2c2 Medication Train 303 159 2479 9285 642067 18099
Test 202 191 2576 8705 419779 11995
Dev 25 168 2560 5404 54512 1553
CORAL Symptom Train 200 631 2224 6818 - -
Test 35 1087 2374 4939 78887 1402
Dev 5 1762 2509 3167 10586 274
MIMIC-III - Train 1000 42 315.5 3008 - -
Medalign All Test 746 20 378.5 3047 - -
Table 12: Summary statistics for all clinical document datasets, including document counts (N), token counts per document, and entity label counts based on ground truth human labels. Development splits were used for prompt engineering, training splits were used for generating teacher labels, and testing splits were used for evaluation. Abbreviations: “O” = “Outside” (indicating non-entity).
Task Cohen’s Kappa
NCBI 0.88
n2c2 0.86
CORAL 0.67
MedAlign 0.61
Table 13: Inter-rater reliability results for error analysis and external validation (MedAlign). Cohen’s Kappa for MedAlign is a multiclass metric.

Prompts

The following prompts yielded the highest F1 score on the development sets for each task. GPT-4o was used for all prompt tuning.

Medication Extraction

List all medications, drugs, and drug classes mentioned in the following clinical note. Illicit drugs and alcohol should not be listed. 1. Make sure to include: (a) Specific medication names (both brand and generic). If present in the note, include the full name, like tiotropium bromide or albuterol sulfate, instead of just tiotropium or albuterol. (b) Drug class names, both singular and plural, including (but not limited to): NSAID, anticoagulant, pain medication, PPI, steroids, antibiotics, ACE inhibitors, pressors, sedating medications, etc. (c) Substances that are injected or infused, such as: Fluids, iron, contrast dye, red blood cells (pRBC), platelets, etc. (d) Substances that are inhaled, such as: Oxygen, FIO2, nebulized medications, inhaled bronchodilators, etc. 2. Do not include: (a) Medical devices or equipment, such as: Inhaler, nebulizer, BiPAP machine, etc. (b) Modes of administration or formulations, such as: IV, drip, gtt (drops), liquid form, transfused, supplement, etc. (c) Methods of delivery or routes of administration, unless part of the medication’s name. (d) General descriptors or measurements (e.g., units, mg, ml, 80, 30) unless these are part of a medication’s name. 3. Additional Notes: Avoid listing terms like increased, solution (Soln), isotonic, or daycare, which are not medications or drug classes. Include only pharmacologically relevant terms (e.g., antibiotic, anticoagulant, steroid). 4. Examples of Challenging Cases: (a) False Positives to Avoid: Oxygen as a standalone word unless explicitly used as a therapy or treatment. Do not list FIO2 unless explicitly described as oxygen therapy. Avoid terms like isotonic or Sodium unless part of a drug name (e.g., heparin sodium). (b) False Negatives to Include: Drug classes (e.g., steroids, antibiotics, ACE inhibitors). Medications with brand or generic names (e.g., albuterol, tiotropium bromide, Zithromax). Injectable medications (e.g., ceftriaxone, heparin).

Output Format Output a string delimited by // to separate the medications. Write them exactly as they were written in the note. Do not output anything else.

Example: Note: ”Patient is experiencing muscle pain, secondary to statin therapy for coronary artery disease. The patient suffers from steroid-induced hyperglycemia. Patient prescribed 1 x 20 mg Prednisone tablet daily for 5 days. Patient has been switched to lisinopril tablet 10mg 1 tablet PO QD. Patient received 100 Units/kg IV heparin sodium injection for treatment of deep vein thrombosis. Sulfa (sulfonamide antibiotics). Tylenol (Acetaminophen) B.i.d. (twice a day).”

Output: {{”entities”: ”statin // steroid // Prednisone // lisinopril // heparin sodium // Sulfa (sulfonamide antibiotics) // Tylenol (Acetaminophen)”, ”rationale”: ”All medication names, including drug classes and drug names in parentheses, were extracted. Dosages (eg. 20mg) and other administration information (eg. injection) were not extracted as per the instructions.”}}

Here is the note: {note}

Symptom Extraction

List all symptoms explicitly mentioned in the following clinical note. 1. Include: All mentions of symptoms or complaints, such as ”fatigue,” ”nausea,” ”vomiting,” or ”pain.” Negated symptoms (e.g., ”denies nausea” should still include ”nausea”). Minimal entity spans: Use the simplest terms that convey the symptom information (e.g., ”nausea” instead of ”the patient presents with nausea”). Severity modifiers, when relevant (e.g., ”minor fatigue,” ”severe pain”). 2. Do not include: Signs or clinical findings that are not symptoms (e.g., ”hyperbilirubinemia,” ”ascites”). Information about the location of the symptom (e.g., ”low back pain” \rightarrow Include only ”pain”). Adjectives or descriptors unrelated to the symptom itself (e.g., ”low,” ”new,” ”right-sided”). Conjunctions, prepositions, or other grammatical words unrelated to the symptoms (e.g., ”and,” ”of,” ”at”). 3. Edge Cases: For combinations of symptoms (e.g., ”nausea and vomiting”), list each symptom separately (e.g., ”nausea // vomiting”). Avoid listing any context or causes of the symptom. For example: ”pain secondary to surgery” \rightarrow Include only ”pain.” ”headache from dehydration” \rightarrow Include only ”headache.” Make sure to list any symptoms mentioned, even those that the patient is negative for.

Output Format Output a JSON with two keys: 1. ”entities”: a single string of symptoms, separated by ‘//‘. Write the symptoms exactly as they appear in the note. 2. ”negated_symptoms_included”: a string, affirming that all negated symptoms were included

Example: Note: ”Patient is experiencing muscle pain, lower back pain, and fatigue, secondary to statin therapy for coronary artery disease. Patient denies nausea and vomiting. The patient suffers from steroid-induced hyperglycemia. Negative for fever, weight loss, and dysuria. Patient prescribed 1 x 20 mg Prednisone tablet daily for 5 days. Patient has been switched to lisinopril tablet 10mg 1 tablet PO QD. Patient received 100 Units/kg IV heparin sodium injection for treatment of deep vein thrombosis. Sulfa (sulfonamide antibiotics). Tylenol (Acetaminophen) B.i.d. (twice a day).”

Output: ”entities”: ”pain // fatigue // nausea // vomiting // fever // weight loss // dysuria”, ”negated_symptoms_included”: ”Yes, all negated symptoms were listed (nausea, vomiting, fever, weight loss, dysuria)”

Here is the note: {note}

Disease Extraction

List all diseases, disorders, and clinical conditions mentioned in the following clinical note. 1. Include: Specific diseases and disorders Examples: ”ataxia-telangiectasia”, ”Phenylketonuria”, ”Aniridia” Disease categories or classes Examples: ”Inherited human disease”, ”Chromosome abnormalities”, ”Cancer” Composite mentions indicating diseases or conditions Examples: ”Combined deficiency of C6 and C7”, ”Stage II colorectal carcinoma”, ”Segmental necrotizing glomerulonephritis” Modifiers that describe diseases or conditions Examples: ”Deficiency of hepatic phenylalanine hydroxylase”, ”Myocardial lesions”, ”DCC-negative tumors” Abbreviations or acronyms referring to diseases or conditions Examples: ”A-T” for ataxia-telangiectasia, ”HD” for Huntington’s disease Plural forms and variations of disease terms Examples: ”Lipomas”, ”Cancers”, ”Tumors”, ”Pleural effusions” Symptoms, signs, and clinical findings Examples: ”Bradycardia”, ”Hypotension”, ”Pleural effusion”, ”Tonic-clonic seizures”, ”Dyspnoea”, ”Chest pain”, ”Fever”, ”Neurologic impairment”, ”Fatiguability”, ”Dizziness”, ”Syncopal attacks”, ”Blindness”, ”Eosinophilia”, ”Urinary abnormalities”, ”Red eyes”, ”Asystolic”, ”Overdose”, ”Toxicity”, ”Stable angina” 2. Exclude: Genetic mutations or hypotheses without explicit disease mention Examples: ”Disease-causing mutations”, ”Hypothesis of a defective gene” Hypothetical or unconfirmed conditions Examples: ”Hypothesis of a defective C9”, ”Compound heterozygote for uncharacterized genes” Traits or responses not specifying a disease Examples: ”Radio-sensitive phenotype”, ”Defective cell cycle checkpoints” Descriptions of biological processes or impairments not representing a specific disease Examples: ”Functional impairment”, ”T-cell-dependent immune responses”, ”Secretion abnormalities” General observations or modifiers Examples: ”Reduced immune function”, ”Impaired secretion” Broad functional or descriptive terms unless tied directly to a disease Examples: ”Impairment”, ”Deficiency” (unless part of a recognized condition like ”T-cell deficiency”) 3. Additional Instructions for Acronyms: Focus on identifying acronyms that represent diseases, disorders, and findings. Ensure no acronyms are omitted from the output.

Output Format Output a string delimited by ‘//‘ to separate the diseases. Write them exactly as they were written in the note. Do not output anything else.

Here is the note: {note}