May 2024 - Top10 Cited Articles in Natural Language Computing

May 2024: Top10 Cited Articles in Natural
Language Computing
International Journal on Natural Language
Computing (IJNLC)
https://airccse.org/journal/ijnlc/index.html
ISSN: 2278 - 1307 [Online]; 2319 - 4111 [Print]
Google Scholar
https://scholar.google.com/citations?user=A5tqIdoAAAAJ&hl=en

Rag-Fusion: A New Take on Retrieval Augmented Generation
Zackary Rackauckas, Infineon Technologies, California
Abstract
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain
product information. This problem is traditionally addressed with retrieval-augmented generation
(RAG) chatbots, but in this study, I evaluated the use of the newly popularized RAG-Fusion
method. RAG-Fusion combines RAG and reciprocal rank fusion (RRF) by generating multiple
queries, reranking them with reciprocal scores and fusing the documents and scores. Through
manually evaluating answers on accuracy, relevance, and comprehensiveness, I found that RAG-
Fusion was able to provide accurate and comprehensive answers due to the generated queries
contextualizing the original query from various perspectives. However, some answers strayed off
topic when the generated queries' relevance to the original query is insufficient. This research
marks significant progress in artificial intelligence (AI) and natural language processing (NLP)
applications and demonstrates transformations in a global and multi-industry context.
Keywords
Chatbot, Retrieval-augmented Generation, Reciprocal Rank Fusion, Natural Language
Processing
Full Text: https://aircconline.com/ijnlc/V13N1/13124ijnlc03.pdf
Volume URL: http://airccse.org/journal/ijnlc/vol13.html

Performance, Energy Consumption and Costs: A Comparative Analysis of Automatic Text
Classification Approaches in the Legal Domain
Leonardo Rigutini1, Achille Globo1, Marco Stefanelli2, Andrea Zugarini1, Sinan Gultekin1,
Marco Ernandes1, 1expert.ai spa, Italy, 2University of Siena, Italy
Abstract
The common practice in Machine Learning research is to evaluate the top-performing models
based on their performance. However, this often leads to overlooking other crucial aspects that
should be given careful consideration. In some cases, the performance differences between
various approaches may be insignificant, whereas factors like production costs, energy
consumption, and carbon footprint should be taken into account. Large Language Models
(LLMs) are widely used in academia and industry to address NLP problems. In this study, we
present a comprehensive quantitative comparison between traditional approaches (SVM-based)
and more recent approaches such as LLM (BERT family models) and generative models (GPT2
and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only
performance parameters (standard indices), but also alternative measures such as timing, energy
consumption and costs, which collectively contribute to the carbon footprint. To ensure a
complete analysis, we separately considered the prototyping phase (which involves model
selection through training-validation-test iterations) and the in-production phases. These phases
follow distinct implementation procedures and require different resources. The results indicate
that simpler algorithms often achieve performance levels similar to those of complex models
(LLM and generative models), consuming much less energy and requiring fewer resources.
These findings suggest that companies should consider additional considerations when choosing
machine learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary
for the scientific world to also begin to consider aspects of energy consumption in model
evaluations, in order to be able to give real meaning to the results obtained using standard
metrics (Precision, Recall, F1 and so on).
Keywords
NLP, text mining, green AI, green NLP, carbon footprint, energy consumption, evaluation.

A Study on the Appropriate Size of the Mongolian General Corpus
Choi Sun Soo1 and Ganbat Tsend2, 1University of the Humanities, Mongolia, 2Otgontenger
University, Mongolia
Abstract
This study aims to determine the appropriate size of the Mongolian general corpus. This study
used the Heaps’ function and Type-Token Ratio (TTR) to determine the appropriate size of the
Mongolian general corpus. This study’s sample corpus of 906,064 tokens comprised texts from
10 domains of newspaper politics, economy, society, culture, sports, world articles and laws,
middle and high school literature textbooks, interview articles, and podcast transcripts. First, we
estimated the Heaps’ function with this sample corpus. Next, we observed changes in the number
of types and TTR values while increasing the number of tokens by one million using the
estimated Heaps’ function. As a result of observation, we found that the TTR value hardly
changed when the number of tokens exceeded 39~42 million. Thus, we conclude that an
appropriate size for a Mongolian general corpus is 39-42 million tokens.
Keywords
Mongolian general corpus, Appropriate size of corpus, Sample corpus, Heaps’ function, TTR,
Type, Token.

Evaluating BERT and ParsBERT for Analyzing Persian Advertisement Data
Ali Mehrban1 and Pegah Ahadian2, 1Newcastle University, UK, 2Kent State University,
USA
Abstract
This paper discusses the impact of the Internet on modern trading and the importance of data
generated from these transactions for organizations to improve their marketing efforts. The paper
uses the example of Divar, an online marketplace for buying and selling products and services in
Iran, and presents a competition to predict the percentage of a car sales ad that would be
published on the Divar website. Since the dataset provides a rich source of Persian text data, the
authors use the Hazm library, a Python library designed for processing Persian text, and two
state-of-the-art language models, mBERT and ParsBERT, to analyze it. The paper's primary
objective is to compare the performance of mBERT and ParsBERT on the Divar dataset. The
authors provide some background on data mining, Persian language, and the two language
models, examine the dataset's composition and statistical features, and provide details on their
fine-tuning and training configurations for both approaches. They present the results of their
analysis and highlight the strengths and weaknesses of the two language models when applied to
Persian text data. The paper offers valuable insights into the challenges and opportunities of
working with low-resource languages such as Persian and the potential of advanced language
models like BERT for analyzing such data. The paper also explains the data mining process,
including steps such as data cleaning and normalization techniques. Finally, the paper discusses
the types of machine learning problems, such as supervised, unsupervised, and reinforcement
learning, and the pattern evaluation techniques, such as confusion matrix. Overall, the paper
provides an informative overview of the use of language models and data mining techniques for
analyzing text data in low-resource languages, using the example of the Divar dataset.
Keywords
Text Recognition, Persian text, NLP, mBERT, ParsBERT

Understanding Chinese Moral Stories with Further Pre-Training
Jing Qian1, Yong Yue1, Katie Atkinson2 and Gangmin Li3, 1Xi’an Jiaotong Liverpool
University, China, 2University of Liverpool, UK, 3University of Bedfordshire, UK
Abstract
The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by
delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is
compacted into a single statement without involving any characters within the original text,
necessitating a more astute language model that can comprehend connotative morality and
exhibit commonsense reasoning. The “pre-training + fine-tuning” paradigm is widely embraced
in neural language models. In this paper, we propose an intermediary phase to establish an
improved paradigm of “pre-training + further pre-training + fine-tuning”. Further pre-training
generally refers to continual learning on task-specific or domain-relevant corpora before being
applied to target tasks, which aims at bridging the gap in data distribution between the phases of
pre-training and fine-tuning. Our work is based on a Chinese dataset named STORAL-ZH that
composes of 4k human-written story-moral pairs. Furthermore, we design a two-step process of
domain-adaptive pre-training in the intermediary phase. The first step depends on a newly-
collected Chinese dataset of Confucian moral culture. And the second step bases on the Chinese
version of a frequently-used commonsense knowledge graph (i.e. ATOMIC) to enrich the
backbone model with inferential knowledge besides morality. By comparison with several
advanced models including BERT-base, RoBERTa-base and T5-base, experimental results on
two understanding tasks demonstrate the effectiveness of our proposed three-phase paradigm.
Keywords
Moral Understanding, Further Pre-training, Knowledge Graph, Pre-trained Language Model

LOCATION-BASED SENTIMENT ANALYSIS OF 2019 NIGERIA PRESIDENTIAL
ELECTION USING A VOTING ENSEMBLE APPROACH
Ikechukwu Onyenwe1, Samuel N.C. Nwagbo2, Ebele Onyedinma1, Onyedika Ikechukwu-
Onyenwe1, Chidinma A. Nwafor3 and Obinna Agbata1
1*
Computer Science Department, Nnamdi Azikiwe University, Onitsha-Enugu Expressway,
Awka, PMB 5025, Anambra, Nigeria.
2*
Political Science Department, Nnamdi Azikiwe University, Onitsha-Enugu Expressway, Awka,
PMB 5025, Anambra, Nigeria.
3*
Computer Science Department, Nigerian Army College of Environmental Science and
Technology, North-Bank, Makurdi,PMB 102272, Benue, Nigeria
Abstract
Nigeria president Buhari defeated his closest rival Atiku Abubakar by over 3 million votes. He
was issued a Certificate of Return and was sworn in on 29 May 2019. However, there were
claims of widespread hoax by the opposition. The sentiment analysis captures the opinions of the
masses over social media for global events. In this paper, we use 2019 Nigeria presidential
election tweets to perform sentiment analysis through the application of a voting ensemble
approach (VEA) in which the predictions from multiple techniques are combined to find the best
polarity of a tweet (sentence). This is to determine public views on the 2019 Nigeria Presidential
elections and compare them with actual election results. Our sentiment analysis experiment is
focused on location-based viewpoints where we used Twitter location data. For this experiment,
we live-streamed Nigeria 2019 election tweets via Twitter API to create tweets dataset of 583816
size, pre-processed the data, and applied VEA by utilizing three different Sentiment Classifiers
to obtain the choicest polarity of a given tweet. Furthermore, we segmented our tweets dataset
into Nigerian states and geopolitical zones, then plotted state-wise and geopolitical-wise user
sentiments towards Buhari and Atiku and their political parties. The overall objective of the use
of states/geopolitical zones is to evaluate the similarity between the sentiment of location-based
tweets compared to actual election results. The results reveal that whereas there are election
outcomes that coincide with the sentiment expressed on Twitter social media in most cases as
shown by the polarity scores of different locations, there are also some election results where our
location analysis similarity test failed.
Keywords
Nigeria, Election, Sentiment Analysis, Politics, Tweets, Exploration Data Analysis, location data
Volume URL: https://airccse.org/journal/ijnlc/vol12.html

Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional Context
for Continuous Speech Recognition
Piyush Behre, Sharman Tan, Padma Varadharajan and Shuangyu Chang, Microsoft
Corporation
Abstract
While speech recognition Word Error Rate (WER) has reached human parity for English,
continuous speech recognition scenarios such as voice typing and meeting transcriptions still
suffer from segmentation and punctuation problems, resulting from irregular pausing patterns or
slow speakers. Transformer sequence tagging models are effective at capturing long bi-
directional context, which is crucial for automatic punctuation. Automatic Speech Recognition
(ASR) production systems, however, are constrained by real-time requirements, making it hard
to incorporate the right context when making punctuation decisions. Context within the segments
produced by ASR decoders can be helpful but limiting in overall punctuation performance for a
continuous speech session. In this paper, we propose a streaming approach for punctuation or re-
punctuation of ASR output using dynamic decoding windows and measure its impact on
punctuation and segmentation accuracy across scenarios. The new system tackles over-
segmentation issues, improving segmentation F0.5-score by 13.9%. Streaming punctuation
achieves an average BLEUscore improvement of 0.66 for the downstream task of Machine
Translation (MT).
Keywords
automatic punctuation, automatic speech recognition, re-punctuation, speech segmentation.

A Robust Three-Stage Hybrid Framework for English to Bangla Transliteration
Redwan Ahmed Rizvee, Asif Mahmood, Shakur Shams Mullick and Sajjadul Hakim, Tiger
IT Bangladesh Limited, Dhaka, Bangladesh
Abstract
Phonetic typing using the English alphabet has become widely popular nowadays for social
media and chat services. As a result, a text containing various English and Bangla words and
phrases has become increasingly common. Existing transliteration tools display poor
performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration
(THT) framework that can transliterate both English words and phonetic typed Bangla words
satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based
techniques. Experimental results confirm superiority of THT as it significantly outperforms the
benchmark transliteration tool.
Keywords
Transliteration framework, phonetic typing, English to Bangla, hybrid framework, THT.

Analyzing Architectures for Neural Machine Translation using Low Computational
Resources
Aditya Mandke, Onkar Litake, and Dipali Kadam, SCTR’s Pune Institute of Computer
Technology, India
Abstract
With the recent developments in the field of Natural Language Processing, there has been a rise
in the use of different architectures for Neural Machine Translation. Transformer architectures
are used to achieve state-of-the-art accuracy, but they are very computationally expensive to
train. Everyone cannot have such setups consisting of high-end GPUs and other resources. We
train our models on low computational resources and investigate the results. As expected,
transformers outperformed other architectures, but there were some surprising results.
Transformers consisting of more encoders and decoders took more time to train but had fewer
BLEU scores. LSTM performed well in the experiment and took comparatively less time to train
than transformers, making it suitable to use in situations having time constraints.
Keywords
Machine Translation, Indic Languages, Natural Language Processing.

Developing Products Update-Alert System for E-Commerce Websites Users using Html
Data and Web Scraping Technique
Ikechukwu Onyenwe, Ebele Onyedinma, Chidinma Nwafor and Obinna Agbata, Nnamdi
Azikiwe University, Nigeria
Abstract
Websites are regarded as domains of limitless information which anyone and everyone can
access. The new trend of technology has shaped the way we do and manage our businesses.
Today, advancements in Internet technology has given rise to the proliferation of e-commerce
websites. This, in turn made the activities and lifestyles of marketers/vendors, retailers and
consumers (collectively regarded as users in this paper) easier as it provides convenient
platforms to sale/order items through the internet. Unfortunately, these desirable benefits are not
without drawbacks as these platforms require that the users spend a lot of time and efforts
searching for best product deals, products updates and offers on ecommerce websites.
Furthermore, they need to filter and compare search results by themselves which takes a lot of
time and there are chances of ambiguous results. In this paper, we applied web crawling and
scraping methods on an e-commerce website to obtain HTML data for identifying products
updates based on the current time. These HTML data are preprocessed to extract details of the
products such as name, price, post date and time, etc. to serve as useful information for users.
Keywords
NATURAL LANGUAGE PREPROCESSING (NLP), E-COMMERCE, E-RETAIL, HTML,
DATA, Web, Webscrapping

May 2024 - Top10 Cited Articles in Natural Language Computing

More Related Content

Similar to May 2024 - Top10 Cited Articles in Natural Language Computing (20)

More from kevig (20)

Recently uploaded (20)

May 2024 - Top10 Cited Articles in Natural Language Computing