May 2024: Top10 Cited Articles in Natural
Language Computing
International Journal on Natural Language
Computing (IJNLC)
https://airccse.org/journal/ijnlc/index.html
ISSN: 2278 - 1307 [Online]; 2319 - 4111 [Print]
Google Scholar
https://scholar.google.com/citations?user=A5tqIdoAAAAJ&hl=en
Rag-Fusion: A New Take on Retrieval Augmented Generation
Zackary Rackauckas, Infineon Technologies, California
Abstract
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain
product information. This problem is traditionally addressed with retrieval-augmented generation
(RAG) chatbots, but in this study, I evaluated the use of the newly popularized RAG-Fusion
method. RAG-Fusion combines RAG and reciprocal rank fusion (RRF) by generating multiple
queries, reranking them with reciprocal scores and fusing the documents and scores. Through
manually evaluating answers on accuracy, relevance, and comprehensiveness, I found that RAG-
Fusion was able to provide accurate and comprehensive answers due to the generated queries
contextualizing the original query from various perspectives. However, some answers strayed off
topic when the generated queries' relevance to the original query is insufficient. This research
marks significant progress in artificial intelligence (AI) and natural language processing (NLP)
applications and demonstrates transformations in a global and multi-industry context.
Keywords
Chatbot, Retrieval-augmented Generation, Reciprocal Rank Fusion, Natural Language
Processing
Full Text: https://aircconline.com/ijnlc/V13N1/13124ijnlc03.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol13.html
Performance, Energy Consumption and Costs: A Comparative Analysis of Automatic Text
Classification Approaches in the Legal Domain
Leonardo Rigutini1, Achille Globo1, Marco Stefanelli2, Andrea Zugarini1, Sinan Gultekin1,
Marco Ernandes1, 1expert.ai spa, Italy, 2University of Siena, Italy
Abstract
The common practice in Machine Learning research is to evaluate the top-performing models
based on their performance. However, this often leads to overlooking other crucial aspects that
should be given careful consideration. In some cases, the performance differences between
various approaches may be insignificant, whereas factors like production costs, energy
consumption, and carbon footprint should be taken into account. Large Language Models
(LLMs) are widely used in academia and industry to address NLP problems. In this study, we
present a comprehensive quantitative comparison between traditional approaches (SVM-based)
and more recent approaches such as LLM (BERT family models) and generative models (GPT2
and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only
performance parameters (standard indices), but also alternative measures such as timing, energy
consumption and costs, which collectively contribute to the carbon footprint. To ensure a
complete analysis, we separately considered the prototyping phase (which involves model
selection through training-validation-test iterations) and the in-production phases. These phases
follow distinct implementation procedures and require different resources. The results indicate
that simpler algorithms often achieve performance levels similar to those of complex models
(LLM and generative models), consuming much less energy and requiring fewer resources.
These findings suggest that companies should consider additional considerations when choosing
machine learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary
for the scientific world to also begin to consider aspects of energy consumption in model
evaluations, in order to be able to give real meaning to the results obtained using standard
metrics (Precision, Recall, F1 and so on).
Keywords
NLP, text mining, green AI, green NLP, carbon footprint, energy consumption, evaluation.
Full Text: https://aircconline.com/ijnlc/V13N1/13124ijnlc02.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol13.html
A Study on the Appropriate Size of the Mongolian General Corpus
Choi Sun Soo1 and Ganbat Tsend2, 1University of the Humanities, Mongolia, 2Otgontenger
University, Mongolia
Abstract
This study aims to determine the appropriate size of the Mongolian general corpus. This study
used the Heaps’ function and Type-Token Ratio (TTR) to determine the appropriate size of the
Mongolian general corpus. This study’s sample corpus of 906,064 tokens comprised texts from
10 domains of newspaper politics, economy, society, culture, sports, world articles and laws,
middle and high school literature textbooks, interview articles, and podcast transcripts. First, we
estimated the Heaps’ function with this sample corpus. Next, we observed changes in the number
of types and TTR values while increasing the number of tokens by one million using the
estimated Heaps’ function. As a result of observation, we found that the TTR value hardly
changed when the number of tokens exceeded 39~42 million. Thus, we conclude that an
appropriate size for a Mongolian general corpus is 39-42 million tokens.
Keywords
Mongolian general corpus, Appropriate size of corpus, Sample corpus, Heaps’ function, TTR,
Type, Token.
Full Text: https://aircconline.com/ijnlc/V12N3/12323ijnlc02.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol12.html
Evaluating BERT and ParsBERT for Analyzing Persian Advertisement Data
Ali Mehrban1 and Pegah Ahadian2, 1Newcastle University, UK, 2Kent State University,
USA
Abstract
This paper discusses the impact of the Internet on modern trading and the importance of data
generated from these transactions for organizations to improve their marketing efforts. The paper
uses the example of Divar, an online marketplace for buying and selling products and services in
Iran, and presents a competition to predict the percentage of a car sales ad that would be
published on the Divar website. Since the dataset provides a rich source of Persian text data, the
authors use the Hazm library, a Python library designed for processing Persian text, and two
state-of-the-art language models, mBERT and ParsBERT, to analyze it. The paper's primary
objective is to compare the performance of mBERT and ParsBERT on the Divar dataset. The
authors provide some background on data mining, Persian language, and the two language
models, examine the dataset's composition and statistical features, and provide details on their
fine-tuning and training configurations for both approaches. They present the results of their
analysis and highlight the strengths and weaknesses of the two language models when applied to
Persian text data. The paper offers valuable insights into the challenges and opportunities of
working with low-resource languages such as Persian and the potential of advanced language
models like BERT for analyzing such data. The paper also explains the data mining process,
including steps such as data cleaning and normalization techniques. Finally, the paper discusses
the types of machine learning problems, such as supervised, unsupervised, and reinforcement
learning, and the pattern evaluation techniques, such as confusion matrix. Overall, the paper
provides an informative overview of the use of language models and data mining techniques for
analyzing text data in low-resource languages, using the example of the Divar dataset.
Keywords
Text Recognition, Persian text, NLP, mBERT, ParsBERT
Full Text: https://aircconline.com/ijnlc/V12N2/12223ijnlc02.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol12.html
Understanding Chinese Moral Stories with Further Pre-Training
Jing Qian1, Yong Yue1, Katie Atkinson2 and Gangmin Li3, 1Xi’an Jiaotong Liverpool
University, China, 2University of Liverpool, UK, 3University of Bedfordshire, UK
Abstract
The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by
delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is
compacted into a single statement without involving any characters within the original text,
necessitating a more astute language model that can comprehend connotative morality and
exhibit commonsense reasoning. The “pre-training + fine-tuning” paradigm is widely embraced
in neural language models. In this paper, we propose an intermediary phase to establish an
improved paradigm of “pre-training + further pre-training + fine-tuning”. Further pre-training
generally refers to continual learning on task-specific or domain-relevant corpora before being
applied to target tasks, which aims at bridging the gap in data distribution between the phases of
pre-training and fine-tuning. Our work is based on a Chinese dataset named STORAL-ZH that
composes of 4k human-written story-moral pairs. Furthermore, we design a two-step process of
domain-adaptive pre-training in the intermediary phase. The first step depends on a newly-
collected Chinese dataset of Confucian moral culture. And the second step bases on the Chinese
version of a frequently-used commonsense knowledge graph (i.e. ATOMIC) to enrich the
backbone model with inferential knowledge besides morality. By comparison with several
advanced models including BERT-base, RoBERTa-base and T5-base, experimental results on
two understanding tasks demonstrate the effectiveness of our proposed three-phase paradigm.
Keywords
Moral Understanding, Further Pre-training, Knowledge Graph, Pre-trained Language Model
Full Text: https://aircconline.com/ijnlc/V12N2/12223ijnlc01.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol12.html
LOCATION-BASED SENTIMENT ANALYSIS OF 2019 NIGERIA PRESIDENTIAL
ELECTION USING A VOTING ENSEMBLE APPROACH
Ikechukwu Onyenwe1, Samuel N.C. Nwagbo2, Ebele Onyedinma1, Onyedika Ikechukwu-
Onyenwe1, Chidinma A. Nwafor3 and Obinna Agbata1
1*
Computer Science Department, Nnamdi Azikiwe University, Onitsha-Enugu Expressway,
Awka, PMB 5025, Anambra, Nigeria.
2*
Political Science Department, Nnamdi Azikiwe University, Onitsha-Enugu Expressway, Awka,
PMB 5025, Anambra, Nigeria.
3*
Computer Science Department, Nigerian Army College of Environmental Science and
Technology, North-Bank, Makurdi,PMB 102272, Benue, Nigeria
Abstract
Nigeria president Buhari defeated his closest rival Atiku Abubakar by over 3 million votes. He
was issued a Certificate of Return and was sworn in on 29 May 2019. However, there were
claims of widespread hoax by the opposition. The sentiment analysis captures the opinions of the
masses over social media for global events. In this paper, we use 2019 Nigeria presidential
election tweets to perform sentiment analysis through the application of a voting ensemble
approach (VEA) in which the predictions from multiple techniques are combined to find the best
polarity of a tweet (sentence). This is to determine public views on the 2019 Nigeria Presidential
elections and compare them with actual election results. Our sentiment analysis experiment is
focused on location-based viewpoints where we used Twitter location data. For this experiment,
we live-streamed Nigeria 2019 election tweets via Twitter API to create tweets dataset of 583816
size, pre-processed the data, and applied VEA by utilizing three different Sentiment Classifiers
to obtain the choicest polarity of a given tweet. Furthermore, we segmented our tweets dataset
into Nigerian states and geopolitical zones, then plotted state-wise and geopolitical-wise user
sentiments towards Buhari and Atiku and their political parties. The overall objective of the use
of states/geopolitical zones is to evaluate the similarity between the sentiment of location-based
tweets compared to actual election results. The results reveal that whereas there are election
outcomes that coincide with the sentiment expressed on Twitter social media in most cases as
shown by the polarity scores of different locations, there are also some election results where our
location analysis similarity test failed.
Keywords
Nigeria, Election, Sentiment Analysis, Politics, Tweets, Exploration Data Analysis, location data
Full Text: https://aircconline.com/ijnlc/V12N1/12123ijnlc01.pdf
Volume URL: https://airccse.org/journal/ijnlc/vol12.html
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional Context
for Continuous Speech Recognition
Piyush Behre, Sharman Tan, Padma Varadharajan and Shuangyu Chang, Microsoft
Corporation
Abstract
While speech recognition Word Error Rate (WER) has reached human parity for English,
continuous speech recognition scenarios such as voice typing and meeting transcriptions still
suffer from segmentation and punctuation problems, resulting from irregular pausing patterns or
slow speakers. Transformer sequence tagging models are effective at capturing long bi-
directional context, which is crucial for automatic punctuation. Automatic Speech Recognition
(ASR) production systems, however, are constrained by real-time requirements, making it hard
to incorporate the right context when making punctuation decisions. Context within the segments
produced by ASR decoders can be helpful but limiting in overall punctuation performance for a
continuous speech session. In this paper, we propose a streaming approach for punctuation or re-
punctuation of ASR output using dynamic decoding windows and measure its impact on
punctuation and segmentation accuracy across scenarios. The new system tackles over-
segmentation issues, improving segmentation F0.5-score by 13.9%. Streaming punctuation
achieves an average BLEUscore improvement of 0.66 for the downstream task of Machine
Translation (MT).
Keywords
automatic punctuation, automatic speech recognition, re-punctuation, speech segmentation.
Full Text: https://aircconline.com/ijnlc/V11N6/11622ijnlc01.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol11.html
A Robust Three-Stage Hybrid Framework for English to Bangla Transliteration
Redwan Ahmed Rizvee, Asif Mahmood, Shakur Shams Mullick and Sajjadul Hakim, Tiger
IT Bangladesh Limited, Dhaka, Bangladesh
Abstract
Phonetic typing using the English alphabet has become widely popular nowadays for social
media and chat services. As a result, a text containing various English and Bangla words and
phrases has become increasingly common. Existing transliteration tools display poor
performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration
(THT) framework that can transliterate both English words and phonetic typed Bangla words
satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based
techniques. Experimental results confirm superiority of THT as it significantly outperforms the
benchmark transliteration tool.
Keywords
Transliteration framework, phonetic typing, English to Bangla, hybrid framework, THT.
Full Text: https://aircconline.com/ijnlc/V11N1/11122ijnlc04.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol11.html
Analyzing Architectures for Neural Machine Translation using Low Computational
Resources
Aditya Mandke, Onkar Litake, and Dipali Kadam, SCTR’s Pune Institute of Computer
Technology, India
Abstract
With the recent developments in the field of Natural Language Processing, there has been a rise
in the use of different architectures for Neural Machine Translation. Transformer architectures
are used to achieve state-of-the-art accuracy, but they are very computationally expensive to
train. Everyone cannot have such setups consisting of high-end GPUs and other resources. We
train our models on low computational resources and investigate the results. As expected,
transformers outperformed other architectures, but there were some surprising results.
Transformers consisting of more encoders and decoders took more time to train but had fewer
BLEU scores. LSTM performed well in the experiment and took comparatively less time to train
than transformers, making it suitable to use in situations having time constraints.
Keywords
Machine Translation, Indic Languages, Natural Language Processing.
Full Text: https://aircconline.com/ijnlc/V10N5/10521ijnlc02.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol10.html
Developing Products Update-Alert System for E-Commerce Websites Users using Html
Data and Web Scraping Technique
Ikechukwu Onyenwe, Ebele Onyedinma, Chidinma Nwafor and Obinna Agbata, Nnamdi
Azikiwe University, Nigeria
Abstract
Websites are regarded as domains of limitless information which anyone and everyone can
access. The new trend of technology has shaped the way we do and manage our businesses.
Today, advancements in Internet technology has given rise to the proliferation of e-commerce
websites. This, in turn made the activities and lifestyles of marketers/vendors, retailers and
consumers (collectively regarded as users in this paper) easier as it provides convenient
platforms to sale/order items through the internet. Unfortunately, these desirable benefits are not
without drawbacks as these platforms require that the users spend a lot of time and efforts
searching for best product deals, products updates and offers on ecommerce websites.
Furthermore, they need to filter and compare search results by themselves which takes a lot of
time and there are chances of ambiguous results. In this paper, we applied web crawling and
scraping methods on an e-commerce website to obtain HTML data for identifying products
updates based on the current time. These HTML data are preprocessed to extract details of the
products such as name, price, post date and time, etc. to serve as useful information for users.
Keywords
NATURAL LANGUAGE PREPROCESSING (NLP), E-COMMERCE, E-RETAIL, HTML,
DATA, Web, Webscrapping
Full Text: https://aircconline.com/ijnlc/V10N5/10521ijnlc01.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol10.html

More Related Content

PDF
January 2024: Top 10 Downloaded Articles in Natural Language Computing
PDF
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
PDF
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
PDF
UNDERSTANDING CHINESE MORAL STORIES WITH FURTHER PRE-TRAINING
PDF
Understanding Chinese Moral Stories with Further Pre-Training
PDF
QUESTION ANSWERING MODULE LEVERAGING HETEROGENEOUS DATASETS
PDF
QUESTION ANSWERING MODULE LEVERAGING HETEROGENEOUS DATASETS
PDF
QUESTION ANSWERING MODULE LEVERAGING HETEROGENEOUS DATASETS
January 2024: Top 10 Downloaded Articles in Natural Language Computing
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
UNDERSTANDING CHINESE MORAL STORIES WITH FURTHER PRE-TRAINING
Understanding Chinese Moral Stories with Further Pre-Training
QUESTION ANSWERING MODULE LEVERAGING HETEROGENEOUS DATASETS
QUESTION ANSWERING MODULE LEVERAGING HETEROGENEOUS DATASETS
QUESTION ANSWERING MODULE LEVERAGING HETEROGENEOUS DATASETS

Similar to May 2024 - Top10 Cited Articles in Natural Language Computing (20)

PPTX
How machines learn to talk. Machine Learning for Conversational AI
PDF
Supreme court dialogue classification using machine learning models
PDF
ENHANCING EDUCATIONAL QA SYSTEMS: INTEGRATING KNOWLEDGE GRAPHS AND LARGE LANG...
PDF
ENHANCING EDUCATIONAL QA SYSTEMS INTEGRATING KNOWLEDGE GRAPHS AND LARGE LANGU...
PDF
Analysis of the evolution of advanced transformer-based language models: Expe...
PDF
ENHANCING EDUCATIONAL QA SYSTEMS: INTEGRATING KNOWLEDGE GRAPHS AND LARGE LANG...
PDF
The Value and Benefits of Data-to-Text Technologies
PDF
A multilingual semantic search chatbot framework
PDF
Language and Offensive Word Detection
PDF
A Vietnamese Text-based Conversational Agent.pdf
PDF
A Comparative Study of Text Comprehension in IELTS Reading Exam using GPT-3
PDF
Xử lý ngôn ngữ tự nhiên dựa trên học sâu
PDF
Evaluation Challenges in Using Generative AI for Science & Technical Content
PDF
A Comprehensive Analytical Study Of Traditional And Recent Development In Nat...
PPTX
NLP in 2020
PDF
An in-depth review on News Classification through NLP
PDF
Evaluating the machine learning models based on natural language processing t...
PDF
Challenges with Securing Digital Identity
PDF
Decoding AI and Human Authorship: Nuances Revealed through NLP and Statistica...
PDF
IRJET- Factoid Question and Answering System
How machines learn to talk. Machine Learning for Conversational AI
Supreme court dialogue classification using machine learning models
ENHANCING EDUCATIONAL QA SYSTEMS: INTEGRATING KNOWLEDGE GRAPHS AND LARGE LANG...
ENHANCING EDUCATIONAL QA SYSTEMS INTEGRATING KNOWLEDGE GRAPHS AND LARGE LANGU...
Analysis of the evolution of advanced transformer-based language models: Expe...
ENHANCING EDUCATIONAL QA SYSTEMS: INTEGRATING KNOWLEDGE GRAPHS AND LARGE LANG...
The Value and Benefits of Data-to-Text Technologies
A multilingual semantic search chatbot framework
Language and Offensive Word Detection
A Vietnamese Text-based Conversational Agent.pdf
A Comparative Study of Text Comprehension in IELTS Reading Exam using GPT-3
Xử lý ngôn ngữ tự nhiên dựa trên học sâu
Evaluation Challenges in Using Generative AI for Science & Technical Content
A Comprehensive Analytical Study Of Traditional And Recent Development In Nat...
NLP in 2020
An in-depth review on News Classification through NLP
Evaluating the machine learning models based on natural language processing t...
Challenges with Securing Digital Identity
Decoding AI and Human Authorship: Nuances Revealed through NLP and Statistica...
IRJET- Factoid Question and Answering System
Ad

More from kevig (20)

PDF
INTERLINGUAL SYNTACTIC PARSING: AN OPTIMIZED HEAD-DRIVEN PARSING FOR ENGLISH ...
PDF
Call For Papers -International Journal on Natural Language Computing (IJNLC)
PDF
INTERLINGUAL SYNTACTIC PARSING: AN OPTIMIZED HEAD-DRIVEN PARSING FOR ENGLISH ...
PDF
Call For Papers - International Journal on Natural Language Computing (IJNLC)
PDF
Call For Papers - 3rd International Conference on NLP & Signal Processing (NL...
PDF
A ROBUST JOINT-TRAINING GRAPHNEURALNETWORKS MODEL FOR EVENT DETECTIONWITHSYMM...
PDF
Call For Papers- 14th International Conference on Natural Language Processing...
PDF
Call For Papers - International Journal on Natural Language Computing (IJNLC)
PDF
Call For Papers - 6th International Conference on Natural Language Processing...
PDF
July 2025 Top 10 Download Article in Natural Language Computing.pdf
PDF
Orchestrating Multi-Agent Systems for Multi-Source Information Retrieval and ...
PDF
Call For Papers - 6th International Conference On NLP Trends & Technologies (...
PDF
Call For Papers - 6th International Conference on Natural Language Computing ...
PDF
Call For Papers - International Journal on Natural Language Computing (IJNLC)...
PDF
Call For Papers - 4th International Conference on NLP and Machine Learning Tr...
PDF
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
PDF
Call For Papers - International Journal on Natural Language Computing (IJNLC)
PDF
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
PDF
Call For Papers - International Journal on Natural Language Computing (IJNLC)
PDF
INTERLINGUAL SYNTACTIC PARSING: AN OPTIMIZED HEAD-DRIVEN PARSING FOR ENGLISH ...
INTERLINGUAL SYNTACTIC PARSING: AN OPTIMIZED HEAD-DRIVEN PARSING FOR ENGLISH ...
Call For Papers -International Journal on Natural Language Computing (IJNLC)
INTERLINGUAL SYNTACTIC PARSING: AN OPTIMIZED HEAD-DRIVEN PARSING FOR ENGLISH ...
Call For Papers - International Journal on Natural Language Computing (IJNLC)
Call For Papers - 3rd International Conference on NLP & Signal Processing (NL...
A ROBUST JOINT-TRAINING GRAPHNEURALNETWORKS MODEL FOR EVENT DETECTIONWITHSYMM...
Call For Papers- 14th International Conference on Natural Language Processing...
Call For Papers - International Journal on Natural Language Computing (IJNLC)
Call For Papers - 6th International Conference on Natural Language Processing...
July 2025 Top 10 Download Article in Natural Language Computing.pdf
Orchestrating Multi-Agent Systems for Multi-Source Information Retrieval and ...
Call For Papers - 6th International Conference On NLP Trends & Technologies (...
Call For Papers - 6th International Conference on Natural Language Computing ...
Call For Papers - International Journal on Natural Language Computing (IJNLC)...
Call For Papers - 4th International Conference on NLP and Machine Learning Tr...
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Call For Papers - International Journal on Natural Language Computing (IJNLC)
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
Call For Papers - International Journal on Natural Language Computing (IJNLC)
INTERLINGUAL SYNTACTIC PARSING: AN OPTIMIZED HEAD-DRIVEN PARSING FOR ENGLISH ...
Ad

Recently uploaded (20)

PPT
Total quality management ppt for engineering students
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
737-MAX_SRG.pdf student reference guides
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PPTX
Software Engineering and software moduleing
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PDF
Abrasive, erosive and cavitation wear.pdf
PDF
August -2025_Top10 Read_Articles_ijait.pdf
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Total quality management ppt for engineering students
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Fundamentals of Mechanical Engineering.pptx
737-MAX_SRG.pdf student reference guides
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
Software Engineering and software moduleing
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
Abrasive, erosive and cavitation wear.pdf
August -2025_Top10 Read_Articles_ijait.pdf
Categorization of Factors Affecting Classification Algorithms Selection
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Fundamentals of safety and accident prevention -final (1).pptx
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
distributed database system" (DDBS) is often used to refer to both the distri...
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Exploratory_Data_Analysis_Fundamentals.pdf
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...

May 2024 - Top10 Cited Articles in Natural Language Computing

  • 1. May 2024: Top10 Cited Articles in Natural Language Computing International Journal on Natural Language Computing (IJNLC) https://airccse.org/journal/ijnlc/index.html ISSN: 2278 - 1307 [Online]; 2319 - 4111 [Print] Google Scholar https://scholar.google.com/citations?user=A5tqIdoAAAAJ&hl=en
  • 2. Rag-Fusion: A New Take on Retrieval Augmented Generation Zackary Rackauckas, Infineon Technologies, California Abstract Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots, but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal scores and fusing the documents and scores. Through manually evaluating answers on accuracy, relevance, and comprehensiveness, I found that RAG- Fusion was able to provide accurate and comprehensive answers due to the generated queries contextualizing the original query from various perspectives. However, some answers strayed off topic when the generated queries' relevance to the original query is insufficient. This research marks significant progress in artificial intelligence (AI) and natural language processing (NLP) applications and demonstrates transformations in a global and multi-industry context. Keywords Chatbot, Retrieval-augmented Generation, Reciprocal Rank Fusion, Natural Language Processing Full Text: https://aircconline.com/ijnlc/V13N1/13124ijnlc03.pdf Volume URL: http://airccse.org/journal/ijnlc/vol13.html
  • 3. Performance, Energy Consumption and Costs: A Comparative Analysis of Automatic Text Classification Approaches in the Legal Domain Leonardo Rigutini1, Achille Globo1, Marco Stefanelli2, Andrea Zugarini1, Sinan Gultekin1, Marco Ernandes1, 1expert.ai spa, Italy, 2University of Siena, Italy Abstract The common practice in Machine Learning research is to evaluate the top-performing models based on their performance. However, this often leads to overlooking other crucial aspects that should be given careful consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches (SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance parameters (standard indices), but also alternative measures such as timing, energy consumption and costs, which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of complex models (LLM and generative models), consuming much less energy and requiring fewer resources. These findings suggest that companies should consider additional considerations when choosing machine learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on). Keywords NLP, text mining, green AI, green NLP, carbon footprint, energy consumption, evaluation. Full Text: https://aircconline.com/ijnlc/V13N1/13124ijnlc02.pdf Volume URL: http://airccse.org/journal/ijnlc/vol13.html
  • 4. A Study on the Appropriate Size of the Mongolian General Corpus Choi Sun Soo1 and Ganbat Tsend2, 1University of the Humanities, Mongolia, 2Otgontenger University, Mongolia Abstract This study aims to determine the appropriate size of the Mongolian general corpus. This study used the Heaps’ function and Type-Token Ratio (TTR) to determine the appropriate size of the Mongolian general corpus. This study’s sample corpus of 906,064 tokens comprised texts from 10 domains of newspaper politics, economy, society, culture, sports, world articles and laws, middle and high school literature textbooks, interview articles, and podcast transcripts. First, we estimated the Heaps’ function with this sample corpus. Next, we observed changes in the number of types and TTR values while increasing the number of tokens by one million using the estimated Heaps’ function. As a result of observation, we found that the TTR value hardly changed when the number of tokens exceeded 39~42 million. Thus, we conclude that an appropriate size for a Mongolian general corpus is 39-42 million tokens. Keywords Mongolian general corpus, Appropriate size of corpus, Sample corpus, Heaps’ function, TTR, Type, Token. Full Text: https://aircconline.com/ijnlc/V12N3/12323ijnlc02.pdf Volume URL: http://airccse.org/journal/ijnlc/vol12.html
  • 5. Evaluating BERT and ParsBERT for Analyzing Persian Advertisement Data Ali Mehrban1 and Pegah Ahadian2, 1Newcastle University, UK, 2Kent State University, USA Abstract This paper discusses the impact of the Internet on modern trading and the importance of data generated from these transactions for organizations to improve their marketing efforts. The paper uses the example of Divar, an online marketplace for buying and selling products and services in Iran, and presents a competition to predict the percentage of a car sales ad that would be published on the Divar website. Since the dataset provides a rich source of Persian text data, the authors use the Hazm library, a Python library designed for processing Persian text, and two state-of-the-art language models, mBERT and ParsBERT, to analyze it. The paper's primary objective is to compare the performance of mBERT and ParsBERT on the Divar dataset. The authors provide some background on data mining, Persian language, and the two language models, examine the dataset's composition and statistical features, and provide details on their fine-tuning and training configurations for both approaches. They present the results of their analysis and highlight the strengths and weaknesses of the two language models when applied to Persian text data. The paper offers valuable insights into the challenges and opportunities of working with low-resource languages such as Persian and the potential of advanced language models like BERT for analyzing such data. The paper also explains the data mining process, including steps such as data cleaning and normalization techniques. Finally, the paper discusses the types of machine learning problems, such as supervised, unsupervised, and reinforcement learning, and the pattern evaluation techniques, such as confusion matrix. Overall, the paper provides an informative overview of the use of language models and data mining techniques for analyzing text data in low-resource languages, using the example of the Divar dataset. Keywords Text Recognition, Persian text, NLP, mBERT, ParsBERT Full Text: https://aircconline.com/ijnlc/V12N2/12223ijnlc02.pdf Volume URL: http://airccse.org/journal/ijnlc/vol12.html
  • 6. Understanding Chinese Moral Stories with Further Pre-Training Jing Qian1, Yong Yue1, Katie Atkinson2 and Gangmin Li3, 1Xi’an Jiaotong Liverpool University, China, 2University of Liverpool, UK, 3University of Bedfordshire, UK Abstract The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is compacted into a single statement without involving any characters within the original text, necessitating a more astute language model that can comprehend connotative morality and exhibit commonsense reasoning. The “pre-training + fine-tuning” paradigm is widely embraced in neural language models. In this paper, we propose an intermediary phase to establish an improved paradigm of “pre-training + further pre-training + fine-tuning”. Further pre-training generally refers to continual learning on task-specific or domain-relevant corpora before being applied to target tasks, which aims at bridging the gap in data distribution between the phases of pre-training and fine-tuning. Our work is based on a Chinese dataset named STORAL-ZH that composes of 4k human-written story-moral pairs. Furthermore, we design a two-step process of domain-adaptive pre-training in the intermediary phase. The first step depends on a newly- collected Chinese dataset of Confucian moral culture. And the second step bases on the Chinese version of a frequently-used commonsense knowledge graph (i.e. ATOMIC) to enrich the backbone model with inferential knowledge besides morality. By comparison with several advanced models including BERT-base, RoBERTa-base and T5-base, experimental results on two understanding tasks demonstrate the effectiveness of our proposed three-phase paradigm. Keywords Moral Understanding, Further Pre-training, Knowledge Graph, Pre-trained Language Model Full Text: https://aircconline.com/ijnlc/V12N2/12223ijnlc01.pdf Volume URL: http://airccse.org/journal/ijnlc/vol12.html
  • 7. LOCATION-BASED SENTIMENT ANALYSIS OF 2019 NIGERIA PRESIDENTIAL ELECTION USING A VOTING ENSEMBLE APPROACH Ikechukwu Onyenwe1, Samuel N.C. Nwagbo2, Ebele Onyedinma1, Onyedika Ikechukwu- Onyenwe1, Chidinma A. Nwafor3 and Obinna Agbata1 1* Computer Science Department, Nnamdi Azikiwe University, Onitsha-Enugu Expressway, Awka, PMB 5025, Anambra, Nigeria. 2* Political Science Department, Nnamdi Azikiwe University, Onitsha-Enugu Expressway, Awka, PMB 5025, Anambra, Nigeria. 3* Computer Science Department, Nigerian Army College of Environmental Science and Technology, North-Bank, Makurdi,PMB 102272, Benue, Nigeria Abstract Nigeria president Buhari defeated his closest rival Atiku Abubakar by over 3 million votes. He was issued a Certificate of Return and was sworn in on 29 May 2019. However, there were claims of widespread hoax by the opposition. The sentiment analysis captures the opinions of the masses over social media for global events. In this paper, we use 2019 Nigeria presidential election tweets to perform sentiment analysis through the application of a voting ensemble approach (VEA) in which the predictions from multiple techniques are combined to find the best polarity of a tweet (sentence). This is to determine public views on the 2019 Nigeria Presidential elections and compare them with actual election results. Our sentiment analysis experiment is focused on location-based viewpoints where we used Twitter location data. For this experiment, we live-streamed Nigeria 2019 election tweets via Twitter API to create tweets dataset of 583816 size, pre-processed the data, and applied VEA by utilizing three different Sentiment Classifiers to obtain the choicest polarity of a given tweet. Furthermore, we segmented our tweets dataset into Nigerian states and geopolitical zones, then plotted state-wise and geopolitical-wise user sentiments towards Buhari and Atiku and their political parties. The overall objective of the use of states/geopolitical zones is to evaluate the similarity between the sentiment of location-based tweets compared to actual election results. The results reveal that whereas there are election outcomes that coincide with the sentiment expressed on Twitter social media in most cases as shown by the polarity scores of different locations, there are also some election results where our location analysis similarity test failed. Keywords Nigeria, Election, Sentiment Analysis, Politics, Tweets, Exploration Data Analysis, location data Full Text: https://aircconline.com/ijnlc/V12N1/12123ijnlc01.pdf Volume URL: https://airccse.org/journal/ijnlc/vol12.html
  • 8. Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional Context for Continuous Speech Recognition Piyush Behre, Sharman Tan, Padma Varadharajan and Shuangyu Chang, Microsoft Corporation Abstract While speech recognition Word Error Rate (WER) has reached human parity for English, continuous speech recognition scenarios such as voice typing and meeting transcriptions still suffer from segmentation and punctuation problems, resulting from irregular pausing patterns or slow speakers. Transformer sequence tagging models are effective at capturing long bi- directional context, which is crucial for automatic punctuation. Automatic Speech Recognition (ASR) production systems, however, are constrained by real-time requirements, making it hard to incorporate the right context when making punctuation decisions. Context within the segments produced by ASR decoders can be helpful but limiting in overall punctuation performance for a continuous speech session. In this paper, we propose a streaming approach for punctuation or re- punctuation of ASR output using dynamic decoding windows and measure its impact on punctuation and segmentation accuracy across scenarios. The new system tackles over- segmentation issues, improving segmentation F0.5-score by 13.9%. Streaming punctuation achieves an average BLEUscore improvement of 0.66 for the downstream task of Machine Translation (MT). Keywords automatic punctuation, automatic speech recognition, re-punctuation, speech segmentation. Full Text: https://aircconline.com/ijnlc/V11N6/11622ijnlc01.pdf Volume URL: http://airccse.org/journal/ijnlc/vol11.html
  • 9. A Robust Three-Stage Hybrid Framework for English to Bangla Transliteration Redwan Ahmed Rizvee, Asif Mahmood, Shakur Shams Mullick and Sajjadul Hakim, Tiger IT Bangladesh Limited, Dhaka, Bangladesh Abstract Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool. Keywords Transliteration framework, phonetic typing, English to Bangla, hybrid framework, THT. Full Text: https://aircconline.com/ijnlc/V11N1/11122ijnlc04.pdf Volume URL: http://airccse.org/journal/ijnlc/vol11.html
  • 10. Analyzing Architectures for Neural Machine Translation using Low Computational Resources Aditya Mandke, Onkar Litake, and Dipali Kadam, SCTR’s Pune Institute of Computer Technology, India Abstract With the recent developments in the field of Natural Language Processing, there has been a rise in the use of different architectures for Neural Machine Translation. Transformer architectures are used to achieve state-of-the-art accuracy, but they are very computationally expensive to train. Everyone cannot have such setups consisting of high-end GPUs and other resources. We train our models on low computational resources and investigate the results. As expected, transformers outperformed other architectures, but there were some surprising results. Transformers consisting of more encoders and decoders took more time to train but had fewer BLEU scores. LSTM performed well in the experiment and took comparatively less time to train than transformers, making it suitable to use in situations having time constraints. Keywords Machine Translation, Indic Languages, Natural Language Processing. Full Text: https://aircconline.com/ijnlc/V10N5/10521ijnlc02.pdf Volume URL: http://airccse.org/journal/ijnlc/vol10.html
  • 11. Developing Products Update-Alert System for E-Commerce Websites Users using Html Data and Web Scraping Technique Ikechukwu Onyenwe, Ebele Onyedinma, Chidinma Nwafor and Obinna Agbata, Nnamdi Azikiwe University, Nigeria Abstract Websites are regarded as domains of limitless information which anyone and everyone can access. The new trend of technology has shaped the way we do and manage our businesses. Today, advancements in Internet technology has given rise to the proliferation of e-commerce websites. This, in turn made the activities and lifestyles of marketers/vendors, retailers and consumers (collectively regarded as users in this paper) easier as it provides convenient platforms to sale/order items through the internet. Unfortunately, these desirable benefits are not without drawbacks as these platforms require that the users spend a lot of time and efforts searching for best product deals, products updates and offers on ecommerce websites. Furthermore, they need to filter and compare search results by themselves which takes a lot of time and there are chances of ambiguous results. In this paper, we applied web crawling and scraping methods on an e-commerce website to obtain HTML data for identifying products updates based on the current time. These HTML data are preprocessed to extract details of the products such as name, price, post date and time, etc. to serve as useful information for users. Keywords NATURAL LANGUAGE PREPROCESSING (NLP), E-COMMERCE, E-RETAIL, HTML, DATA, Web, Webscrapping Full Text: https://aircconline.com/ijnlc/V10N5/10521ijnlc01.pdf Volume URL: http://airccse.org/journal/ijnlc/vol10.html