1 Introduction

The number of digital texts in the medical field has increased rapidly due to the continuous change and development in the internet and technology [1,2,3,4]. The biomedical information and literature, which are critical and priceless in managing global diseases, are accessible in the form of scientific papers, clinical reports, patient health records, and web documents [1, 5, 6]. A wide range of sources provides information including electronic biomedical research databases, online medical reports, and electronic health record systems (EHRs) [1, 7]. For example, the number of biomedical papers published in the PubMed database exceeds 30 million, with an annual increase of over 1 million new papers [2]. Returning to COVID-19, the volume of research papers has increased significantly since November 2019, when the pandemic started [8]. In addition, in November 2019 the Medline database contained over 24 million articles from over 5500 biomedical journals [7]. Navigating this massive volume of information manually is impossible. Researchers and clinicians who search these sources, to understand current practices in a specific field, access recent works, capture new ideas, conduct experiments, and obtain results found it very challenging to retrieve and read all relevant information they are seeking for [4, 9].

Automatic text summarization (ATS) has been proposed to assist clinicians and researchers in extracting the relevant information from large collections of biomedical literature [10]. ATS reduces long textual documents into shorter versions while preserving their essential meaning and informational content [11]. ATS provides significant benefits in the biomedical field. It manages the overloaded information by providing concise summaries that enable researchers/healthcare professionals to capture the essential information (such as potential treatments, key trends, gaps, and findings) without reading the entire document. This optimizes time and improves productivity [7, 8, 12,13,14,15]. The generated summaries produced by ATS can be utilized in other fields of study such as information retrieval, text classification, and questions-answering [2]. Although abstracts are often included in scientific papers, there are several reasons to generate summaries from full-text sources. First, no ideal summary exists as it depends on each user with different information needs and domains. Second, a document’s abstract may ignore important content in the full text. Finally, customized summaries can be valuable in question-answering systems where they provide personalized information [16].

ATS has two main methods which are extractive and abstractive summarization. The first is a method that identifies and retrieves important sentences from the source material to build a concise brief [17]. It is difficult to maintain coherence between sentences in extractive summarization and to simplify lengthy, complex sentences [18]. Instead, abstractive summarization rewrites essential parts of the source text into new sentences [19]. For summarizing biomedical research papers, the extractive method is more suitable than abstractive as it keeps the original terminology and vocabulary used by researchers; this guarantees the accuracy of facts represented in the source text. In contrast, clinical documents and EHRs are better suited for abstractive summarization, which helps medical professionals understand patient reports more concisely and humanly [6, 20].

Traditional summarization techniques focused on simple term frequency approaches and diverse attributes like title relevance, sentence position, length, extracted keywords, and numerical content for extracting salient sentences from the source document. However, these generic features were less effective in summarizing biomedical documents [21]. Another critical issue is the duplication of sentences in the produced summary, with lack of coherence and semantic accuracy. In addition, summarizing lengthy documents that contain multiple subtopics is considered one of the most significant challenges when using ATS. This leads to a lack of diversity in the summarized content. Traditional summarization methods often provide biased and partial summaries because they cannot capture all subtopics represented in the source documents [22, 23]. Many studies try to solve the problem of redundancy and coherence by using techniques like attention mechanisms, coverage head models, and secondary encoders. While this to some extent eliminates repetition, it also makes summaries excessively biased toward certain subtopics covered in the text, especially in lengthy documents and multi-document summarization. Recent works used topic modeling and clustering to address this problem by picking an equal number of sentences according to the distribution of topics without ranking the sentences with the topic itself [24]. Therefore, this work attempted to focus on subtopics by using K-medoid to cluster sentence vectors and score sentences inside the subtopic itself.

Based on the above, we introduce an unsupervised methodology that combines topic modeling and clustering with bidirectional encoder representation from transformers (BERT) for an extractive summarization of a single document. The first stage in the proposed methodology is to improve text readability by preprocessing the source document. Next, latent Dirichlet allocation (LDA) is applied to identify hidden topics in the document, with the coherence measure employed to optimize the number of topics distributed in the document. The allocating topics are distributed over sentences, where every sentence is related to a specific topic, and sentences related to the same topic are grouped together. BERT is incorporated to transform the text into deep conceptualized embeddings for accurate sentence vectorization. The vectorized sentences are then fed to K-medoid clustering to extract the top representative sentences, which are finally used to construct the final summary.

The major contributions of this study are:

  1. 1.

    A new corpus comprising 200 biomedical research papers on knee osteoarthritis management was collected and introduced for extractive summarization of single documents.

  2. 2.

    An unsupervised methodology is introduced and effectively evaluated on the new corpus for single-document extractive summarization.

  3. 3.

    The coherence measure is integrated into the proposed methodology to allocate subtopics.

  4. 4.

    Different variants of BERT were tested and compared to select the best one.

  5. 5.

    The use of topic modeling and clustering in conjunction with BERT yielded better results when compared to prior efforts in developing topic-modeled ATS systems.

  6. 6.

    Two types of evaluation (qualitative and quantitative) were utilized.

This paper is structured as follows. Section 2 presents a comprehensive review of relevant studies. The methodology is represented in Sect. 3. In Sect. 4, results and discussion were provided. Finally, the closing comments and outline of potential areas for future work are presented in Sect. 5.

2 Related work

ATS is a branch of natural language processing (NLP) where the computer creates a summary of single/multiple documents, ensuring the summary aligns with the main topic and concept of the source text document [25]. Different factors can be considered to classify ATS as described in Fig. 1. For the input document, summarization can be done on a single or multi-document. A single document includes summarizing each document individually, while multi-document ones summarize more than one document together to produce one summary [26]. Also, the summarization method can be extractive, abstractive, and hybrid. The extractive method entails ranking and identifying the most significant sentences in the document to provide a short, representative summary [27]. In contrast, abstractive summarization redrafts the key ideas represented in the document instead of selecting individual sentences [28]. Hybrid summarization starts by selecting important sentences from text to generate an extractive summary, and then abstractive methods are applied to rewrite and convert the extractive summary to an abstractive one [29]. In addition, the content of the summary is classified as indicative or informative. The indicative summary provides a brief idea about the topic and the issues presented in the document, the informative summary provides a complete and more information covered in the input document [1, 30]. The purpose of summarization is another factor for classification. It can be generic or query-based. Generic summarization summarizes the overall information content available in the source text, while query-based answers a specific user’s query which focuses on providing information related to this query [31, 32].

Fig. 1
figure 1

Different factors for classify ATS systems [33]

Extractive summarization of biomedical documents is usually preferred over abstractive summarization due to its ease of sentence extraction and higher accuracy [1]. Extractive methods are categorized to (1) statistical-based techniques, (2) concept-based techniques, (3) topic-based techniques, (4) graph-based techniques, and (5) machine learning methods [2, 32]. Recent studies in the biomedical domain used these methods to improve the quality of the produced summaries. An unsupervised method based on semantic similarity and keyword extraction was proposed for single- and/or multi-document extractive summarization. This method combined a concept map and RAKE approach to produce summaries for 1040 biomedical transcripts [34]. Many studies combined itemset mining with domain knowledge to create a concept model for summarizing biomedical documents [35, 36]. Moradi et al. [7] introduced a Bayesian summarizer that mapped the text into unified medical language system (UMLs) concepts, also six different features were incorporated to define the critical concepts. This method was evaluated on a medical corpus consisting of 400 biomedical documents. Moradi et al. [37] developed a graph-based method where Helmholtz principle was used to identify the essential concepts from text then a graph-based method was built based on these concepts to capture the essential sentences for the summary. Davoodijam et al. [2] proposed a multi-layer graph called MultiGBS that incorporates MultiRank algorithm for selecting sentences from multi-layers to produce an extractive summary of single biomedical document. Different studies leveraged a graph-based method that combines itemset mining and sentence clustering to enhance and improve biomedical text summarization tasks. For example, Rouane et al. [30] used UMLs to represent biomedical articles as a collection of concepts. Related sentences are grouped together using K-means clustering algorithm. Next, Apriori model was used to identify the common itemsets among the grouped sentences. Finally, the important sentences were picked from each group to generate an extractive summary. Azadani et al. [3] combined a minimum spanning tree-based clustering with frequent itemset mining for extractive summarization. CIBS summarizer was introduced in 2018 for extractive summarization. This method used itemset mining to extract the main topics and then clustered sentences with the same topic together. Finally, the extractive summary includes sentences from each cluster, ensuring that the summary covers all topics [38].

Recently, many studies incorporated pre-trained language models (PLMs) for extractive summarization of biomedical documents. For instance, Du et al. [39] proposed BioBERTSum, a PLM encoder which has been optimized and finetuned for biomedical extractive summarization tasks. Kanwal et al. [40] finetuned BERT on MIMIC-III dataset for extractive summarization of Digital Health Records. Moradi et al. [21] utilized a hierarchical clustering method to group contextual embeddings of phrases using the BERT encoder. The most informative sentences from each group are selected to construct the final summary. Padmakumar et al. [41] introduced an unsupervised extractive summarization approach that encodes phrases using GPT-2 model and uses pointwise mutual information (PMI) to determine semantic similarity between texts. This approach was evaluated on a medical journal dataset. A new approach that combines graph-based and domain-specific word embedding BioBERT for summarizing biomedical articles was proposed by Moradi et al. [6]. Xie et al. [42] proposed a KeBioSum framework for biomedical extractive summarization tasks. It improved performance of PLMs by incorporating fine-grained domain knowledge (PICO components) and employing sophisticated training approaches. CovSumm is an unsupervised approach that leverages strengths of both transformer-based models and graph-based methods for summarizing COVID-19 literature [8]. Overall, there are many PLMs pre-trained specifically for biomedical texts, such as BioBERT [43], PubMed BERT [44], SciBERT [45], BlueBERT [46], ClinicalBERT [47], and ALBERT [48]. Meng et al. [49] suggested splitting the knowledge graph into subgraphs and injecting them with several PLMs like BioBERT, SciBERT, and PubMed BERT.

Topic modeling was first introduced in 2000 by [50]. Topic modeling plays a crucial role in enhancing text summarization, especially for long complex documents with topic diversity such as biomedical research papers. It identifies main topics or themes within a document or collection of documents, which guarantees that the generated summaries contain a comprehensive coverage of key concepts represented in the source document/documents [51]. To avoid topics biased on the summaries generated from long or multi-documents, topic modeling is incorporated to manage topics diversity [22]. Scarce studies applied topic modeling in summarizing medical documents. A study by [52] proposed a method for extractive summarization for long documents. The framework leverages topic information to capture dependencies in long documents by using a heterogeneous graph neural network. This method evaluated over three datasets PubMed, arXiv, and GovReport. Also Xie et al.[53] integrated domain-knowledge and graph-based topic modeling into transformer architecture for biomedical text summarization.

In fields other than biomedicine, Issam et al. [54] combined topic modeling and TextRank together for summarizing WikiHow dataset. Also, Liu et al. [55] combined topic modeling and statistical-based features for summarizing multi-documents. Srivastava et al. [22] examined a novel method that combines clustering with topic modeling for single-document extractive summarization and then evaluate it over three datasets: the DUC2002, WikiHow, and CNN-DailyMail.

The following conclusions were drawn from the previous extractive summarization review:

  1. 1.

    The benefits of topic modeling in summarizing documents with topic diversity are well known.

  2. 2.

    The benefits of organizing sentences to determine the most significant ones are well established.

  3. 3.

    PLMs that pre-trained on biomedical corpora showed their effectiveness in enhancing the quality of the extractive summary.

  4. 4.

    Although some studies have employed LDA in topic-modeling summarization, few studies have tried to fine-tune the model to identify an appropriate number of subtopics.

Based on these conclusions, this paper proposes a new methodology to summarize a single biomedical document based on an unsupervised extractive summarization. The proposed methodology employs LDA to find and assign topic to every sentence in the document. To enhance performance, we used coherence to find a suitable number of topics in the source document. Once each sentence is assigned to a specific topic, the sentences with the same topic are grouped together. For each topic, we tested variants of BERT (S-BERT, BlueBERT, SciBERT, PubMed BERT) to map the text in each group to its conceptualized embedding, then each group clustered by applying K-medoid clustering. Finally, the complete summary is constructed by taking the top n sentences from each group. Section 3 gives more details about the proposed method.

3 Proposed methodology

3.1 Summarization method

The proposed biomedical extractive summarization method consists of four phases: (1) document preprocessing, (2) topic assignment, (3) deep conceptualized embedding, and (4) clustering and sentence selection as shown in Fig. 2. The experiments and all tests were performed on Google Colaboratory with the following computational resources: NVIDIA T4 GPU, 12.7 GB RAM, 2 vCPUs, and approximately 107 GB of disk space.

Fig. 2
figure 2

Overview of our biomedical extractive text summarization method

3.1.1 Document preprocessing

The proposed summarization method begins with a preparatory step to prepare the document, as the content of the documents doesn’t align with the phases of the proposed methodology. The preprocessing involves the following actions:

  1. (a)

    Unnecessary sections in the document have been removed, such as (the references section, title, author’s information, keywords, acknowledgments, competing interests, headers, sub-headers, citations, etc.).

  2. (b)

    Figures and tables have been omitted since they aren’t involved in the summarization process.

  3. (c)

    Abbreviations and their expansions are extracted from the abbreviation section, and the occurrences of the abbreviations with the expansions are replaced throughout the document.

  4. (d)

    We separated the abstract from the document’s body to use it as a reference summary for further evaluation.

  5. (e)

    The text of the document’s body is split into sentences using sent_tokenize from nltk tokenize.

  6. (f)

    We utilized a list of stop words from Medline to remove the words that do not contribute to the identification of meaningful sentences.

3.1.2 Topic assignment using LDA

LDA is used to extract the main topics of the preprocessed document. It was introduced by Blei et al. [56]. LDA is an unsupervised generative probabilistic model that seeks to discover latent (hidden) topics in unstructured text. It consists of a Bayesian network structure with three levels, namely “document-topic-word” [57, 58]. Figure 3 describes the generative process of LDA, which supposes that any document is created by identifying the number of topics (θ). Then a number of words are selected for each topic. Therefore, it considers every document as a combination of topics. Both document-topic distribution (θ) and topic-word distribution (β) are controlled by (α) and (η) correspondingly. This process is repeated N times for each word in the document, and D times for each document in the collection [51, 59, 60].

Fig. 3
figure 3

Latent Dirichlet allocation [56]

Topic assignment phase includes several steps. First, the preprocessed document from the previous phase is used to create a corpus and dictionary, which are crucial steps for creating topic distributions in the document. Each word in the document is mapped to its respective word ID using a dictionary. All words in the document are represented as a bag-of-words namely a corpus. The corpus consists of many tuples, where each tuple contains word ID and the corresponding count for each word in the document [54]. Second, after creating the corpus, the LDA model must be initialized and trained on that corpus. One of the most critical steps in initializing the LDA model is determining a suitable number of topics, for that we used the coherence score to determine the best number of topics. The coherence score is a qualitative measure used to assign a score for a single topic by measuring semantic similarity between high-scoring words in the topic [22]. Due to the semantic similarity of coherence, it better aligns and correlates with human interpretation unlike perplexity measure [61]. Topic modeling using LDA we used LDA model and the coherence model from the GENSIM library, and for the coherence score, we utilized the C_V measure due to its ability to effectively integrate various statistical methods such as sliding windows, pointwise mutual information (PMI), and cosine similarity. This measure assesses the degree of word co-occurrences within a specified context, which helps in understanding relationships between the dominant words of a topic; this feature makes it ideal for LDA-based model and considers both polysemy and synonymy. In addition, it is considered one of the most effective measures of coherence for topic modeling due to its strong correlation with human assessments of topic coherence. It is computed as the following equations [62]:

$${\text{C}}\_{\text{V }} = \left( {\frac{2}{{N{* }\left( {N - 1} \right)}}} \right){* }\mathop \sum \limits_{{{i} < {j}}} {\text{NPMI}}\left( {{\text{wi}},{\text{wj}}} \right)$$
(1)
$${\text{NPMI}}\left( {{\text{wi}},{\text{ wj}}} \right) = { }\frac{{{\text{PMI}}\left( {{\text{wi}},{\text{ wj}}} \right)}}{{ - {\text{log}}\left( {{\text{p}}\left( {{\text{wi}},{\text{ wj}}} \right)} \right)}}$$
(2)
$${\text{PMI}}\left( {{\text{wi}},{\text{ wj}}} \right) = { }\frac{{{\text{log}}\left( {p\left( {{\text{wi}},{\text{ wj}}} \right)} \right)}}{{p\left( {{\text{wi}}} \right){*}p\left( {{\text{wj}}} \right)}}{ }$$
(3)

where N is the number of top words in the topic, wi, wj are words in the top N, NPMI is the normalized pointwise mutual information, PMI is pointwise mutual information, p(wi, wj) is the probability of wi and wj happen simultaneously, and p(wi) and p(wj) are the individual specific probability of wi and wj.

Once the LDA is fitted on the corpus and the number of topics is determined, every sentence in the document is associated with a dominant topic. Finally, after assigning topics to sentences, the sentences with the same topic are grouped to facilitate the clustering task in the next phases.

3.1.3 Deep conceptualized embedding

In this phase, every sentence is associated with an n-dimensional vector composed of real values. The summarization methodology employs the BERT language model to generate contextualized embeddings for the sentences. BERT differs from other models by being bidirectional, which allows to concurrently evaluating context from both left and right. Also, It is built upon a transformer framework that employs a self-attention mechanism and encoder to create contextual representations of text [63]. Figure 4 illustrates BERT input Representation. For the proposed method, we tested various variants of BERT for mapping sentences to their conceptualized embeddings. These variants are S-BERT, PubMed BERT, BlueBERT, and SciBERT.

Fig. 4
figure 4

BERT input representation [63]

S-BERT (sentence BERT) is optimized for sentence-level comparison tasks by incorporating Siamese and triplet network architectures to learn semantic similarities and differences while processing pairs or triplets of sentences. SERT is suitable for handling large volumes of sentence comparisons quickly as it adapted BERT to reduce computational time and resources required [64] making S-BERT suitable for the proposed methodology. PubMed BERT is a specific domain language model that is pre-trained on PubMed abstracts and full articles using BERT architecture [44] which making it highly recommended for our biomedical dataset. For scientific text, SciBERT model was trained on large corpora including 1.14 Million papers from semantic scholars [45]. BlueBERT is also a domain-specific BERT model that is pre-trained on PubMed clinical notes and abstracts from MIMIC-III [46]. Table 1 summarizes these variants based on different factors.

Table 1 Comparison between BERT variants mentioned

3.1.4 Clustering and sentence selection

After assigning topics to sentences, sentences with the same topic are gathered into a single list for clustering. The objective of our summarization method is to choose representative sentences from each topic to create the summary. To achieve this, we employ the K-medoid clustering in conjunction with an Euclidean-distance (ED) metric. Compared to K-means clustering, the K-medoid algorithm demonstrates a faster rate of convergence and is less susceptible to noise and outliers [22]. The ED metric evaluates the degree of correspondence between objects in a space of vectors by taking into account the size of vectors in various dimensions. Consider two vectors X = {× 1, × 2, …, xN} and Y = {y1, y2, …, yN} represent contextualized representations of two sentences, the ED between them is computed as follows:

$${\text{Euclidean}}\;{\text{distance }}\left( {{\text{X}},{\text{ Y}}} \right) \, = \sqrt {\mathop \sum \limits_{i = 1}^{N} \left( {xi - yi} \right)^{2} }$$
(4)

For sentence selection, the centroid of each cluster is extracted as it represents the most representative sentence for the topic among the other sentences. The sentence that represents the centroid combined from each cluster to form the final extractive summary.

Algorithm 1 provides the pseudo-code outlining the clustering and sentence selection process employed by the proposed method.

  1. 1.

    Two empty sets are initialized for clusters C and representative sentences TRS. The set C is created to store the collection of clusters obtained from K-medoid clustering, while TRS is created to store the top representative sentence selected from each cluster that exist in clusters C.

  2. 2.

    For each topic Ni contains a set of sentences representing this topic, if the number of these sentences is large than 1, we fitted K-medoid on sentences’ vectors V obtained from BERT variant and number of clusters K (lines 3,4,5).

  3. 3.

    If number of sentences for each Ni is equal 1, the sentence directly added to representative sentences TRS (lines 7,8).

  4. 4.

    After fitting K-medoid, the created clusters Ci are added to clusters C set (line 6).

  5. 5.

    Iterating over C set, we captured the centroid of each cluster Ci as this centroid represent the most representative sentence for this topic (lines 11, 12).

  6. 6.

    The centroids from each cluster Ci are then added to TRS, where each centroid is a sentence (line 13).

  7. 7.

    Finally, the sentences in TRS are grouped together to build the final extractive summary.

Algorithm 1
figure a

Sentences clustering and selection algorithm employed by the proposed method

3.2 Evaluation method

3.2.1 Evaluation corpus

There is currently no standardized, manually annotated corpus that can be used to assess this form of biomedical summarization. We followed the same method used by previous studies [7, 30]. We created an evaluation corpus by randomly retrieving 200 biomedical documents from the BioMed Central (Biomed) repository using the search keyword “Knee Osteoarthritis Management” to gather the relevant articles and research papers. Each document of 200 documents was preprocessed to remove the irrelevant content as described in Sect. 3.1.1. Then, each document was split into two parts (abstract and full text). The abstracts were saved in a folder named “Abstracts” where each abstract is stored as a separate txt file. Similarly, the full texts were saved in a folder named “Full-Texts” where each full document was stored as an individual txt file. The size of this corpus is sufficient to guarantee that the results are significant according to [65]. Table 2 describes the statistical characteristics of our corpus.

Table 2 Detailed statistical analysis of the introducing corpus characteristics

3.2.2 Quantitative analysis method

In this study, we used Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric for evaluating the quality of the generated extractive summary created by the introduced methodology. ROUGE evaluates the generated summary by quantifying the shared terms (Rouge-N) or longest common subsequence (Rouge-L) between the generated and reference summary [26]. In our experiments, we utilized three variants of ROUGE which are ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L). R-1 calculates number of unigrams (individual words) that appear in both the generated and the reference summaries, while R-2 considers number of bigrams (pairs of consecutive words) that appear in both the generated and the reference summaries [35]. For Rouge-L, it utilizes Longest common subsequence (LCS), which represents longest subsequence of words that appear in both generated and reference summaries in the same order, but not necessarily continuously [66]. For each variant, we calculated recall, precision, and f-score. We utilized the python rouge library to compute the mean f-score, precision, and recall for R-1, R-2, and R-L.

3.2.3 Qualitative analysis method

In addition to ROUGE metrics, we assessed the generated summaries using human evaluations. We asked eight orthopedic surgeons in the orthopedic surgery department (Faculty of Medicine, Egypt) to review and evaluate the generated summaries. For evaluation, 50 random summaries generated by BlueBERT were randomly selected. First, we collected different criteria from previous studies to evaluate the quality of the generated summaries as explained in Table 3. These criteria were provided to the eight reviewers to select the most suitable criteria for evaluation. Based on the reviewers’ experience, the following criteria were chosen to evaluate the generated summaries of the medical documents: completeness, relevance, conciseness, informativity, and readability. After defining these criteria, all reviewers assessed the total 50 summaries and their corresponding reference summaries independently and were asked to rate each summary using a 5-point Likert scale, where (1 = poor, 2 = fair, 3 = good, 4 = very good, 5 = excellent) for each of the five previous criteria.

Table 3 Different criteria used for human evaluation

To ensure accuracy and consistency in the evaluation process, the reviewers were given a questionnaire containing the mentioned criteria and detailed guidelines explaining each criterion and what factors determine the various rating levels (Table 11 in Appendix). They were also instructed to read both generated and reference summaries carefully before providing their ratings. Once all reviewers completed the questionnaire, we calculated the mean scores for each criterion across the eight reviewers as shown in Fig. 6.

Also, we used Fleiss’ Kappa to measure the inter-rater agreement among reviewers across multiple criteria, as it is suitable when multiple raters are involved. The values of Fleiss’ Kappa range from − 1 to 1, where higher values indicate stronger agreement. A Kappa value of 0 suggests no agreement beyond what could be expected by chance, while values above 0 indicate varying levels of agreement. Table 10 presents the inter-rater agreement among reviewers across multiple criteria.

4 Results and discussion

The proposed methodology was evaluated on the medical corpus presented in Sect. 3.2.1 to ensure its efficiency. For topic modeling and sentence clustering, the LDA and K-medoid implementations in scikit-learn were used. All data points were used simultaneously without any splits for training, testing, or validation since the suggested methodology is unsupervised.

To optimize the performance, we conducted experiments on the parameters specified in our algorithm. The optimization of LDA hyperparameters was conducted by evaluating different combinations of their values. We experimented (symmetric, asymmetric, auto) for α, and for β we utilized (auto). Consequently, we examined a different number of topics (between 2:10) and a fixed number of passes at 20 to balance the complexity and training time of the mode during the optimization process. For each combination, we trained LDA model, and then the coherence score was calculated using the Coherence Model from Gensim Library for every combination. Therefore, the combination of hyperparameters that result from the highest score of coherence measure was considered the best combination of LDA.

For the K parameter, representing the number of clusters required for sentence clustering, we tested several values ranging from 2 to 5 to select the appropriate number. For summary evaluation, we used three matrices of ROUGE as mentioned before R-1, R-2, and R-L. For each one of the Rouge matrices, the recall, precision, and F-score were calculated. By varying the parameter K and employing different BERT variants for mapping sentences to their conceptualized embeddings, as described in Sect. 3.1.3, the results obtained by our extractive method on the medical corpus are presented in Tables 4, 5, 6, 7 and 8. For the BERT variant models, Tables 4, 5, and 6 represent the recall, precision, and F-score values of R-1, R-2, and R-L.

Table 4 Recall scores achieved using various variants of BERT model and different values of K on the proposed corpus
Table 5 Precision scores achieved using various variants of BERT model and different values of K on the proposed corpus
Table 6 F-scores scores achieved using various variants of BERT model and different values of K on the proposed corpus
Table 7 The Rouge values achieved by the proposed methodology and other methods.The highest score in each column is shown in bold type.
Table 8 The average time in seconds taken by each method for summarizing 200 biomedical documents

As observed from Tables 4 and 5, when the number of clusters K increased, the recall scores for all variants models increased. This indicates that the generated summary captured the essential contents represented in the reference summary. At K = 5, the SciBERT gave the best results for recall scores compared to the other models. On the other hand, precision scores decreased when increasing the number of clusters, the lower values of precision indicate that the generated summary contains more irrelevant information compared to the reference summary. For precision, S-BERT resulted from the highest scores at K = 2.

We considered F-score values to choose the K number of clusters and Bert model that led to a robust evaluation of the summarization system. F-score considers both recall and precision values, which provides a balanced evaluation metric that considers both the completeness and accuracy of the generated summaries. As shown in Table 6, the optimal number of clusters K varies across different Bert-based models. Generally, K = 3 seems to provide the best scores for most models, in both R-1 and R-2 metrics, S-BERT and SciBERT showed strong f-score while BlueBERT outperformed the other models at K = 3. However, increasing the number of clusters K beyond 3 tends to return a less relevant information. This suggested that using a moderate number of clusters K enables the models to capture relevant information and maintain summary coherence.

Additionally, we compared our extractive summarization method with two other similar methods, (1) a method proposed by Srivastava et al. [22]. This study applied word2vec vector representation in conjunction with topic modeling using LDA and K-medoid clustering for extractive summarization, evaluated on the WikiHow dataset, CNN-DailyMail dataset, and DUC 2002 dataset. (2) the second method was developed by Issam et al. [54], the method applied LDA for topic modeling but instead of using clustering, it used TextRank for summary generation. To evaluate the relevance of topic modeling using latent Dirichlet allocation (LDA) in the proposed methodology, we eliminate LDA from the framework and instead focus on summarization primarily based on K-medoid clustering and BlueBERT. The rationale behind this adjustment is to center the study on the impacts of LDA in measuring the performance and effectiveness of producing relevant summaries. Table 7 presents the Rouge scores achieved by the proposed method and other methods previously mentioned. As shown, the proposed model achieved R-1 score of 0.4838, significantly higher than word2vec method (0.4215), the TextRank method (0.3866), and clustering with BlueBERT method (0.3933), this suggested that the BlueBERT with LDA is more effective in capturing and generating relevant unigrams compared to the other techniques. R-2 which measures the overlap of bigrams, the proposed method also led to the best score of 0.2174. This improvement in R-2 compared to other methods indicates that the BlueBERT with LDA better captured the contextual relationships between pairs of words, leading to more coherent summaries. When comparing the proposed method with other mentioned studies, it gave the highest score of R-L (0.2206). This indicates that our method better captures the LCS between the generated and reference summary. Overall, the proposed method achieves the highest performance of R-1, R-2, and R-L scores.

Also, we calculated the average time taken by the suggested methodology and the comparison methods for summarizing our biomedical corpus which contains 200 biomedical documents. Table 8 represents the average time taken by each method. As illustrated, the proposed method took the longest time (55.82 s) compared to word2vec, TextRank and clustering with BlueBERT methods. This is due to the complexity of the BlueBERT model which leveraged deep learning and advanced language processing capabilities that require more computational resources and time and the iterative process of LDA that needs multiple passes through the data to assign topics and find the best hyperparameters for LDA model.

Moreover, we analyzed the number of topics captured across various document lengths to explore the impact of document length (number of words) on topic selection. This clarifies how different numbers of words influence the diversity of the identified topics. The observation showed that as the number of words increased, the number of topics decreased as illustrated in Table 9. In addition, Fig. 5 shows an inverse relationship between document length and the number of topics. Where short documents capture more topics than longer ones. The value of R2 (0.699) suggests a moderately strong correlation between document length and the number of topics.

Table 9 The impact of document length (number of words) on topics selection
Fig. 5
figure 5

Relationship between number of words and best number of topics

Rouge metric suffers from four main limitations: (1) it relies on n-gram overlap and discard the semantic meaning of the summary, (2) it lacks coherence and readability of the summary, (3) it requires human-written reference summaries for evaluation, (4) it cannot determine if the information represented in the summary is correct or not [72,73,74]. Due to these limitations, eight orthopedic surgeons were asked to evaluate 50 random summaries generated by the best-performance algorithm (BlueBERT). Eight orthopedic surgeons evaluated the summaries based on five criteria: completeness, relevance, conciseness, informativity, and readability, as described in Sect. 3.2.3. The average scores for each criterion are calculated and shown in Fig. 6.

Fig. 6
figure 6

Qualitative analysis results

From results shown in Fig. 6, the summaries generated by our method meet the five criteria we established for evaluation. While comparing reference summaries and generated summaries, we notice strong performance across all criteria. The average score of completeness is 4.10 which indicates that our generated summaries successfully cover the main points and important information represented in the reference summaries. The relevance score of 4.06 shows how well the salient features from the reference summaries have been captured in the generated summaries. Informativity and readability both received a score of 4.0 indicating that our generated summaries can encapsulate important information, easy to read and understand. Conciseness has a moderately high score of 3.55 which points out how concise, but not much more than that, the generated summaries are. The whole results indicate that our summarization method works effectively across all criteria.

The results presented in Table 10 provide confidence that, in most criteria, the evaluations were conducted reliably, reducing the likelihood that the outcomes were influenced by chance or bias. Relevance (0.682), completeness (0.680), readability (0.690), and conciseness (0.703) all demonstrated substantial agreement, meaning that the reviewers generally aligned in their assessments, while informativity (0.58) indicates moderate agreement. Example of the generated summary by the proposed methodology and its associated reference summary (paper’s abstract) are shown in Figs. 7 and 8 in Appendix.

5 Conclusion

This study effectively proposed a new methodology for summarizing biomedical research papers. The proposed method combined topic modeling, clustering, and BERT. The methodology utilizes the online variant of LDA and K-medoids clustering. For vector representation, we tested and compared four variants of BERT (S-BERT, PubMed BERT, BlueBERT, SciBERT). Additionally, this study applied coherence in finding the most suitable number of topics and allocating them at the sentence-level rather than the word-level. The effectiveness of the proposed method was evaluated using two types of evaluation, (1) quantitative evaluation using metrics R-1, R-2, and R-L and (2) qualitative evaluation where eight orthopedic surgeons evaluated the generated summaries based on five criteria: completeness, relevance, conciseness, informativity, and readability. The evaluation was conducted on 200 biomedical research papers about knee osteoarthritis management collected randomly from BioMed central. The results showed that the proposed method outperformed the other models with 0.4838 (R-1), 0.2174 (R-2), and 0.2206 (R-L) for f-score measurement. Also, it meets the five criteria effectively with an average score of 4.10 (completeness), 4.06 (relevance), 4.0 (informativity and readability), and 3.55 (conciseness). Also, we applied Fleiss' Kappa to measure the inter-rater agreement among reviewers, and the results demonstrated substantial agreement, meaning that the reviewers generally aligned in their assessments as illustrated in Table 10. Furthermore, we conducted additional experiments to ascertain the significance of topic modeling using LDA within the proposed methodology by omitting the LDA and instead relying on K-medoid clustering and BERT-based models. The results presented in Table 8 indicate that the proposed methodology surpasses the other methods. Results indicate the efficacy of the topic-modeling method and K-medoids clustering in conjunction with BERT vectorizations over the Ranking algorithm (TextRank), static embedding (word2vec) and clustering with BERT-based. Moreover, we explored the impact of document length (number of words) on topic selection through analyzing the number of topics captured across various document length. The observation showed that as the number of words increased, the number of topics decreased, which explains the inverse relationship between the number of words and the number of topics, as illustrated in Table 9 and Fig. 5. The encouraging results of this study provide a strong basis for additional research. In the future, we can apply the proposed method for multi-document summarization. Given the poor R-2 scores and the bigram overlap in the generated summaries, future research may enhance its values. Finally, to improve the sentence selection phase, we can investigate variations in performance using different clustering, and embedding algorithms. In addition, we can investigate another criterion to select the top sentences from each cluster to build the final summary.

Table 10 Inter-rater agreement among reviewers across multiple criteria