Sentiment Analysis with Transformers: A Complete Deep Learning Project — PT. I

Master fine-tuning Transformers, comparing deep learning architectures, and deploying sentiment analysis models

This project provides a detailed, step-by-step guide to fine-tuning a Transformer model for sentiment classification while taking you through the entire Machine Learning pipeline.

Curious for more? The COMPLETE project repository awaits you in the Bibliography section at the end of this tutorial, where you can explore every detail hands-on.

We begin by defining the problem and preparing data, then progress through building, training, and evaluating models.

The focus is on fine-tuning a Transformer model, but we also compare its performance with two traditional deep learning architectures, ensuring a well-rounded understanding of the methodologies involved.

Key concepts like deep learning architecture, metric interpretation, and deployment are highlighted throughout the project.

This is a comprehensive learning experience, designed to deepen your understanding of modern Machine Learning techniques.

The dataset used was sourced from carblacac/twitter-sentiment-analysis (Apache 2.0 license) on Hugging Face — a platform advancing artificial intelligence through open source and open science.

2. Installing and Loading Packages

Let’s start by installing and loading the Python packages we will use throughout this project.

First, we need to install three packages that are not included with Anaconda Python:

!pip install -q -U watermark

The watermark package, used to display a watermark with the versions of the packages.

!pip install -q spacy

The spacy package, an excellent tool for text data processing and natural language processing (NLP), which we will use here for preprocessing the model’s data.

!pip install -q transformers

And finally, the transformers package, which allows us to access the Hugging Face platform, retrieve pre-trained models, and perform fine-tuning with our own data.

Therefore, the transformers package must be properly installed. Once installed, you will have all three packages available, in addition to the standard packages provided with Anaconda Python.

Next, we will load all the packages we will use throughout the project:

1. Imports

import math
import nltk
import spacy
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import transformers
from tokenizers import BertWordPieceTokenizer
from tqdm import tqdm
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.models import load_model
from keras_preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
from keras.metrics import Precision, Recall, AUC
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from keras.callbacks import EarlyStopping, LearningRateScheduler, CallbackList, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l1_l2
from keras.saving import register_keras_serializable
from tensorflow.keras.layers import Layer, Dense
from transformers import TFDistilBertModel, DistilBertConfig
from tensorflow.keras.metrics import Precision, Recall, AUC
import warnings
warnings.filterwarnings(‘ignore’)

Here, for the sake of organization, I always list all the packages at the beginning of the project. This approach makes it easier in terms of documentation, allowing you to review which packages were used later on.

However, many packages and functions will only reveal their necessity as the project progresses. When that happens, what do I do? I return to this section, add the required package or function.

While it’s not mandatory to pre-load all packages at once, and you can load them as needed, I prefer to load all packages at the start. This ensures that all packages are loaded into your computer’s memory upfront.

As we progress through the project, I will explain the purpose and use of each package.

3. Loading Text Data

Now, let’s load our text dataset. The files are provided at the end of this tutorial in the GitHub link, located in the Bibliography and Useful Links section.

We are working with real, publicly available data. These datasets will also be provided alongside the project’s files.

We have two files: training_data.txt and test_data.txt. Let’s load both using the read_csv function from the pandas library.

2.a Loading Training Data

training_data = pd.read_csv(‘training_data.txt’, header=None, delimiter=‘;’)

2.b Loading Test Data

test_data = pd.read_csv(‘test_data.txt’, header=None, delimiter=‘;’)

The files do not have headers, so I specify header=None. The column separator is a semicolon (;), which acts as the delimiter.

When loading the training and test data without a header, pandas automatically assigns indices as column titles. Since Python indexing starts at 0, the first column will be labeled as 0, the second as 1, and so on.

Working with this format is inconvenient, so we will rename the columns to make them more meaningful.

3. Adjusting Column Names

training_data = training_data.rename(columns={0: ‘text’, 1: ‘sentiment’})
test_data = test_data.rename(columns={0: ‘text’, 1: ‘sentiment’})

The column with index 0 will be renamed to **text**, and the column with index 1 will be renamed to **sentiment**.

Since the dataset contains only two columns, we will apply this renaming to both the training and test datasets.

After renaming, we will check the shape of the datasets to verify the number of rows and columns.

4. Checking Dataset Shape

training_data.shape

-----> (16000, 2)

5. Checking Test Dataset Shape

test_data.shape

-----> (2000, 2)

We have 16,000 rows in the training dataset with two columns and 2,000 rows in the test dataset with two columns.

Let’s take a look at a sample of the data by displaying the first few rows of the training dataset.

6. Training Data Sample

training_data.head()

Observe that all we need here is a dataset with just two columns.

On one side, we have the **text**, and on the other, we have the associated **sentiment**.

If you want to reproduce this project with your own data, all you need is to gather historical data. For instance, you can extract text in the form of sentences or paragraphs, as you prefer, and then perform the process called labeling.

What is labeling? A human annotator reviews each sentence and determines the sentiment it represents — whether it’s fear, joy, anger, sadness, etc. The results are then recorded in the second column, **sentiment**.

In this project:

  • The **text** column will serve as the input variable.
  • The **sentiment** column will serve as the output variable.

Now, let’s check how many records we have for each sentiment.

7. Sentiments in Training Data

training_data[‘sentiment’].value_counts()

You can see that the training dataset contains the following sentiments: joy, sadness, anger, fear, love, and surprise.

For the test dataset, we need to ensure it contains the same set of sentiments.

8. Sentiments in Test Data

test_data[‘sentiment’].value_counts()

4. Can We Have Different Classes in Training and Test Data?

Observe above that in our dataset, the classes are the same in both training and test sets.

Each sentiment is present in both datasets, although the texts differ, the classes remain the same. And here comes a very common question:

Can training and test datasets have different classes? Yes or no? Technically, they can — it’s not prohibited. However, this will cause problems.

Imagine, for instance, that in the training data, there is no sentiment **Love**. You have no phrases with this sentiment in the training set. But in the test set, you do have phrases with the sentiment **Love**. What would happen in this case?

When training the model, you use the training data to teach it the mathematical relationship between **text** and its associated sentiment, which are the classes or categories. Will the model learn about **Love**? No, because you didn’t provide it with any examples of this sentiment. So, if **Love** is absent from the training data, the model won’t understand what this sentiment is, because you didn’t teach it.

However, in the test set, you do have this sentiment. And now, what happens? Well, the model will attempt to classify it as something similar to Love. It will analyze the phrases, right? It will look for patterns based on what it has learned.

What does Love resemble from what’s left? Maybe Joy, Sadness—but it depends on the context of love. It could be Joy, Sadness, or even Anger, why not?

In other words, the model won’t be able to classify any phrases or sentiments as **Love**, because it didn’t learn about it. It can only classify what it has learned. It will only classify what it has learned—the categories you taught it.

In the test data, I expect the exact same categories, so I can evaluate the model. When I later provide new phrases to the trained model, it will only classify these sentiments here.

Nothing beyond that. A Machine Learning model is created for a specific problem. Remember that. So, it will only do what you have taught it.

And what about the opposite scenario? If, for example, I have all the sentiments in the training data, but, in the test data, there is no Love? The impact of this is that you won’t be able to evaluate that category.

When evaluating, you’ll use the test data to assess the model, right? If there are no examples of Love, you won’t be able to know if the model correctly classifies this sentiment or not.

That’s why it’s ideal to have the same categories — the same classes — in both training and test datasets, so you can train and later evaluate the model effectively.

5. Preprocessing Text Data with SpaCy

The next step is the preprocessing of text data using SpaCy, an excellent Python library for working with text data and preparing it for further analysis.

[

spaCy · Industrial-strength Natural Language Processing in Python

spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency…

spacy.io

](https://spacy.io/)

Here is the official website, and I recommend you visit it later. After that, you need to download the language model (dictionary):

!python -m spacy download en_core_web_md -q

I then need to load the dictionary. I will assign it to an object called nlp:

9. Load SpaCy Model

nlp = spacy.load(‘en_core_web_md’)

There is a Python function that applies preprocessing using SpaCy, taking the text data as input. This data is passed through the dictionary. Why? For example, if I want to change the verb form of a word, I need the dictionary to handle this transformation.

This is exactly what we do here — passing the text through the dictionary. Then, I will extract the lemma. What is the lemma? It’s essentially the root form of a word. The nlp object processes the text and breaks it down into tokens. In other words, it converts the text into smaller components (tokens).

After this, based on the dictionary, I extract the lemma for each token, which is its root form. I then convert it to lowercase (lower), and apply strip to clean up any unnecessary spaces.

This entire process is done inside a list comprehension, which acts as a loop in Python. For each token in my list of tokens (referred to as **doc**), I check if the token is valid. If so, I extract its lemma, convert it to lowercase, and apply stripto remove stop words and other unnecessary characters.

The result is fully processed data. This processed text is then saved back into the column, ensuring that I have the cleaned and simplified version of the text for further analysis.

10. Definition of the ‘preprocess_text’ Function, Which Takes a Text as a Parameter

def preprocess_text(text):

10.a Process the text using the SpaCy model

doc = nlp(text)

10.b Create a list of lemmatized tokens, converted to lowercase, stripped of whitespace,

excluding stopwords

tokens = [token.lemma_.lower().strip() for token in doc if not token.is_stop]

10.c Return the processed tokens as a single string, joined with spaces

return ’ '.join(tokens)

The next step is to take your DataFrame, use the apply function, and apply the preprocessing function to the text column.

The result will be saved in a new column named processed_text:

11. Apply the Preprocessing Function to Training Data

training_data[‘processed_text’] = training_data[‘text’].apply(preprocess_text)

12. Apply the Preprocessing Function to Test Data

test_data[‘processed_text’] = test_data[‘text’].apply(preprocess_text)

And I will display the first few rows as the result:

13. Data Sample

training_data.head()

Notice what we have here. On the left, we have the original text — sentences. Next, we have the sentiment, and finally, the processed text.

What did we accomplish here? We simplified the text. Why? Shortly, I will need to convert this text data into numerical representations. I will create matrices with numerical values because everything in this process ultimately boils down to mathematics. We cannot perform mathematical computations on raw text.

During model training, this text will be converted into numerical representations. Without simplification, the resulting matrix might become unnecessarily large and filled with irrelevant data.

For example, does it make sense to include the pronoun **I** in English? Almost every sentence starts with **I**—“I do something,” “I feel something,” “I experienced something.” Since it appears in every sentence, why not remove it? This is a clear example of simplification.

Simplification Steps:

  1. Lemmatization: Consider the first line, “Feeling completely overwhelmed.” As humans, we know that “feeling” is the gerund form of the verb feel. However, the computer cannot distinguish between gerund and verb forms. Instead of using “feeling”, we use just the root form: “feel”. This eliminates the need for word forms and focuses only on their roots.
  2. Removing Stopwords: Words like pronouns, adverbs, and common connectors (e.g., “and,” “but,” “the”) are often unnecessary for analysis. These words are stopwords — commonly occurring words with little contextual value in most cases.
  3. Reducing Redundancy: Removing verbs’ full forms and keeping just their roots reduces unnecessary repetition, simplifies text data, and avoids creating excessively large matrices.

Why Simplify?

  • Efficiency: Simplification reduces the size and complexity of the data, making it easier to convert into numerical matrices.
  • Focus on Meaning: For the machine, only the core meaning of each phrase matters to detect sentiment.
  • Avoid Irrelevance: Including unnecessary words like “I” or fully-formed verbs adds no real value in most cases.

By applying natural language processing (NLP), we simplify the text data, focusing on essential parts of each sentence. This was the general preprocessing step, and additional steps will follow based on the specific model we build.

6. Model (Version 1 ) — Vectorization with TF-IDF

Let’s build the first version of our model. The first step is vectorization.

Since the data is in text format, and we cannot perform mathematical operations on text, we need to convert it into a numerical representation.

There are numerous strategies for this, and one of them is TF-IDF (Term Frequency-Inverse Document Frequency):

14. Create the Vectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words=‘english’)

TF-IDF (Term Frequency-Inverse Document Frequency) is a technique that has been used for a long time in natural language processing.

Nowadays, we have a more advanced strategy: Word Embedding, which is used in Transformers. When we get to the Transformer model, things will become much easier thanks to this foundational work of explaining the timeline and showing you how things can work with the latest advancements in Transformers.

Can you use TF-IDF? Yes, absolutely, as I am using it here and expect to achieve good performance. For simpler text problems, this might even be the ideal solution.

Currently, everyone is focused on LLMs (Large Language Models) and Transformers because they represent cutting-edge, advanced technology.

However, remember that sometimes a Transformer can be like using a cannon to kill a mosquito.

Can you solve this sentiment classification problem using a fully connected neural network? Yes, perfectly.

Advantages of Simpler Models:

  • Simpler Interpretation: A model that is easier to understand.
  • Faster Training: This version 1 model can be trained in just a few minutes. In contrast, our Transformer model (version 3) will require at least 45 minutes of training.

Performance Comparison:

Interestingly, the performance of the Transformer will not be significantly better than that of version 1. My goal is not to endlessly create complex architectures but to solve business problems effectively.

Thus, I will apply the vectorization strategy to convert the input data (from the processed_text column) into a numerical representation.

While TF-IDF is excellent and works well, it has one limitation: it cannot capture context. Context is important, but consider the simplicity of the sentences in our dataset:

These are very simple sentences. For this case, context may not be relevant. So why use Word Embedding?

Can you use Word Embedding? Yes, absolutely, and I will use it in version 3. But is it truly necessary? Will there be a significant performance gain?

As a professional, you must always analyze this and avoid being swayed by the latest or most advanced technique.

Focus on Business Objectives:

If we can solve the problem with TF-IDF, then great, we solve it.

Do you want to use more advanced technology? Keep in mind that it will require:

  • More hardware
  • Greater computational capacity
  • More time

However, the performance gain may not always justify these additional costs.

Ultimately, we need to analyze the situation in the context of the problem.

Regarding vectorization, I have provided the complete description of all parameters in the notebook for you:

Essentially, TF-IDF calculates the frequency of words. It checks which words occur more or less frequently and assigns them a score.

This score is what defines TF-IDF.

Using these scores, TF-IDF builds a vector that represents the text numerically. However, it does not capture the context of each sentence, as Word Embedding does.

Instead, it provides a numerical representation based on word frequency, which works well for simpler texts, such as the ones in our dataset.

In the notebook, I have included a detailed description of the parameters, which you can review later.

Next, I will apply the TF-IDF function:

15. Apply the Vectorizer

train_tfidf = tfidf_vectorizer.fit_transform(training_data[‘processed_text’])
test_tfidf = tfidf_vectorizer.transform(test_data[‘processed_text’])

Pay attention to this detail, okay? fit_transform is used on the training data. transform is used on the test data because fit_transform trains the object—in this case, the vectorizer.

When you hear about training, it is always exclusively with the training data. It must only be performed on this sample. After fitting, you then transform the training data itself.

For the test data, there is no training, only transform is applied.

And here’s the important part, okay? Any transformation applied to the training data must also be applied to the test data and to new data.

When we deploy the model, we will need to follow the same process, applying the vectorizer in exactly the same way.

16. Check Shape of Training Data TF-IDF Matrix

train_tfidf.shape

-----> (16000, 5586)

have for you the resulting shape and the number of columns. What does this mean? These columns represent the attributes, specifically the occurrence of words across all the text in the training dataset.

Taking all this into account, TF-IDF constructs a large dictionary that forms the vector — a numerical matrix.

When it comes to word embeddings, as used in Transformers, the process is similar. However, embeddings capture the context during calculations.

This is why word embeddings generally perform better when working with more complex text.

17. Check Type of Training Data TF-IDF Matrix

type(train_tfidf)

Notice that the result of applying the vectorizer is the creation of a sparse matrix in scipy.

Therefore, in this case, I will convert the matrix into an array object:

18. Convert Input Data (Text) to Array

X_train_array = train_tfidf.toarray()
X_test_array = test_tfidf.toarray()

And done! We now have the input data fully processed.

What is the input data? To make it clear, it’s the processed text.

What is the output data? It’s the sentiment.

The processed text has been transformed into a matrix using TF-IDF vectorization.

Next, I will apply an encoding strategy to numerically represent the sentiment, which is the output variable.

7. Data Preparation — Encoding the Target Variable

Was step 1 clear to you? What is the purpose of step 1? To convert the input data into a numerical representation. In this case, we created a matrix with TF-IDF scores.

To achieve this, we applied the concept of vectorization, setting up the necessary parameters to create the object.

We then applied fit and transform on the training data and transform on the test data. Now, the input data is fully vectorized and represented numerically.

With the issue of the processed_text column resolved, the next step is to address the sentiment column.

How many sentiments do we have? There are 6 sentiment classes, making this a multi-class classification problem.

We cannot leave the sentiment data in text format — it must be converted into a numerical representation.

And I already know your question: “Why not use vectorization here?”

The answer is simple: It doesn’t make sense in this case. Here, we don’t have a set of words for each row — there’s just one word per row.

Vectorization only makes sense when dealing with a large number of words, as it creates a large vector.

16. Check Shape of Training Data TF-IDF Matrix

train_tfidf.shape

-----> (16000, 5586)

Observe the dimension here: 5,586 columns, essentially representing scores indicating whether a particular word appears in each situation. Vectorization only makes sense when dealing with a large volume of data, which is true for the input data.

For the output data, however, I have only 6 words. That’s it. In this case, vectorization doesn’t make sense.

So, I will use a different strategy: encoding.

For this, I’ll use a label encoder. It will examine the sentiment column and identify how many unique words it contains. Six, right?

Then, it will assign numbers to these words, for example, from 1 to 6. So, instead of “love,” for instance, I will have the number 1, another word will become 2, and so on, up to 6.

In other words, it remains a categorical variable with six categories. The information isn’t changed — only the representation is. Instead of words, I’ll use numbers. And now it’s ready. Ready for what? Mathematics. That’s what we do in machine learning.

Now, let’s create the encoder:

19. Create the Label Encoder

label_encoder_v1 = LabelEncoder()

Perform the fit and transform:

20. Fit and Transform the Target Variable in Training Data

y_train_le = label_encoder_v1.fit_transform(training_data[‘sentiment’])

21. Transform the Target Variable in Test Data

y_test_le = label_encoder_v1.transform(test_data[‘sentiment’])

Pay attention here. fit_transform is used only on the training data. transform is applied only on the test data and on new data.

Always keep this in mind. This will have a very interesting detail when we get to the deployment stage, but I’ll explain that later.

For now, all the data is ready in the appropriate format, which is numerical.

But we have a problem. Take a look at the training data — the classes.

Do all classes have the same number of records? Clearly not. For example, joy appears in 5,362 sentences, while sadnessappears in 4,666, anger in 2,159, and so on.

Here, we have a problem: the classes are imbalanced.

What does this mean?

It means the model will end up learning more about the sentiment joy and less about, for instance, the sentiment surprise.

This is quite obvious. If you study one category of deep learning more than another, which one will you know better? The one you studied more, right? The same logic applies here.

The model will learn more about the relationships in sentences associated with the sentiment joy because there are many more examples. However, for sentiments like surprise, love, etc., there are fewer examples, so it will learn less.

This is a problem.

How do we solve this?

We have numerous alternatives…

8. Addressing Class Imbalance

Now, let’s look at how to handle class imbalance.

When working with classification tasks, this is an extremely common problem. Is it normal to have more sentences representing one sentiment than others? Absolutely.

The ideal scenario would be to collect the same number of sentences for each sentiment. But this ideal world is just a dream, a utopia, something far out of reach.

In the real world, you will encounter situations like this.

In other words, the data you receive will have more records in one class than in others.

How do we address this?

If we feed the data into the model as-is, without any kind of adjustment, what will happen?

The model will learn more about one class than the others, leading to an imbalanced model.

We don’t want that. We want a balanced model, right?

Possible Strategies:

Oversampling

  • This involves creating synthetic data using statistical strategies to increase the number of records for the minority classes.
  • By doing this, we generate synthetic records based on the real data.

Undersampling

  • This involves reducing the number of records in the majority classes (e.g., joy and sadness).
  • However, this comes at the cost of losing data.

Trade-offs:

  • In oversampling, generating synthetic data can make the model biased in some way.
  • In undersampling, reducing records might weaken the model’s ability to learn.

A Third Option:

Instead of modifying the data, I will tell the model to assign greater weight to the minority classes, giving them more attention during training.

To achieve this, I’ll use the compute_class_weight function.

22. Compute Class Weights

class_weights = compute_class_weight(‘balanced’, classes=np.unique(y_train_le), y=y_train_le)

The full description of the parameters is available in the notebook — check it out on GitHub.

The compute_class_weight function from sklearn calculates the weights for the classes. These weights can be applied in classification models to give more importance to classes that are underrepresented in the dataset.

So, what I’m telling the model is: “Hey, please pay closer attention to the minority classes.”

If I don’t use this, the model will learn based solely on the data distribution. By applying class weights, the model will automatically balance its learning.

This is an extremely valuable strategy, especially for multi-class problems.

Since I have many classes here — six categories of sentiments — using oversampling or undersampling could introduce problems, as it would require balancing or unbalancing multiple classes.

With class weights, I don’t modify the data. The data remains intact; I simply instruct the model to pay extra attention to the minority classes.

We will use these class weights later when training the model. Got it?

24. Split Data into Training and Validation Sets

X_train, X_val, y_train, y_val = train_test_split(
X_train_array,
y_train_le,
test_size=0.2,
random_state=42,
stratify=y_train_le
)

Now, I will split the data into training and test sets, and specifically into validation data.

Here’s an important point — pay attention:

For vectorization, this step must be performed after splitting the data into training and test sets.

14. Create the Vectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words=‘english’)

Hold on, stay with me and follow the reasoning as I explain.

Vectorization must be performed after you split the data into training and test sets.

Do I already have the split? Yes, I already had the split beforehand. I provided you with two files, right? One for training and one for testing.

So, it’s all set — just apply the steps.

The same goes for Label Encoding. You apply it to both training and test data, but you must first have the data divided into training and test sets. Since I already provided two files, the split is already done.

Now, what I’m doing is something different: I’m splitting the training set into two subsets. Take a look:

24. Split Data into Training and Validation Sets

X_train, X_val, y_train, y_val = train_test_split(
X_train_array,
y_train_le,
test_size=0.2,
random_state=42,
stratify=y_train_le
)

Another subset will remain for training, slightly smaller, and one for validation, which I’ll use to evaluate the model’s performance during training.

Now, I can do this without any issues — this is the correct order to follow.

If I had only provided you with one dataset, the process would have been:

  1. Split into training and test sets.
  2. Apply vectorization and encoding.
  3. Split the training set further into training and validation sets.

But since I already provided two files (training and test), I’ve already applied vectorization and encoding to each of them.

Now, I’m splitting the training set into two subsets: one for training and one for validation.

Why did I call it training and test in the code? Because that’s the name of the function, train_test_split. It’s just a convention, isn’t it?

In practice, I’m splitting the training data into one subset solely for validation, which I’ll use to evaluate performance during training.

The test data itself will be used for evaluation after training is complete.

This approach is very common in Deep Learning. Since training a model can take a long time (some models in this first version will train quickly, but others could take much longer), it’s useful to evaluate performance during training. This is why we use callbacks to monitor and control the training process.

Remember:

Vectorization and encoding must be applied separately to training and test data.

Now that we’ve done this, I’m splitting the training data into two subsets to train and validate the model during training.

Lastly, I’ll adjust the output variable (y) to categorical format.

25. Convert Target Variable to Categorical Type

y_train_encoded = to_categorical(y_train)
y_test_encoded = to_categorical(y_test_le)
y_val_encoded = to_categorical(y_val)

And that’s it!

Now, I have the three subsets ready: training, validation, and test.

26. Check Shape of Encoded Target Variables

y_train_encoded.shape, y_test_encoded.shape, y_val_encoded.shape

-----> ((12800, 6), (2000, 6), (3200, 6))

We can now proceed to build the model. Let’s wrap it up.

9. Model Construction

Check out the Notebook. We have already completed steps 1 and 2. In step 1, we prepared the input data, and in step 2, we prepared the output data.

Then, we divided the data into training and validation subsets, which we will use shortly to train and evaluate the model during the training process. The test data will be used to evaluate the model after training.

Now, let’s move to step 3, which is the model construction.

This is where you define your architecture. Who is responsible for building this architecture? Guess what? You are.

Yes, it’s entirely your job. In fact, all of this is your responsibility. Deciding on the architecture is the key task for the model developer.

Here, I am presenting an example of a fully connected architecture. That is, this model has no special modules.

What do I mean by “special modules”?

  • A convolutional module, which is commonly used in computer vision.
  • A bidirectional module, often utilized in natural language processing.
  • A transformer module, which can be applied to various types of tasks.

There are no such special modules here. This would be a standard neural network with multiple internal layers, which is precisely what defines the concept of Deep Learning.

This would be the simplest possible Deep Learning model that you can build. Note the emphasis on simple — put that in quotation marks, because it’s not necessarily simple.

If we compare it to more advanced architectures, however, this one is indeed the simplest and entirely fully connected. These fully connected architectures are often used as the final component of more complex architectures.

Nowadays, almost every complex architecture is composed of several interconnected parts. Among these parts, a fully connected network is usually included.

A good example is the transformer. One of its components is a fully connected network, as it simplifies the learning process and delivers the final classification.

However, along the way, transformers tend to lack robust learning capabilities on their own. For this reason, they incorporate special modules, such as:

  • Convolutional modules
  • Bidirectional modules
  • The attention module inherent to transformers, and others.

27. Create the Model

27.a Initialize a sequential model. Sequential models are a linear stack of layers.

model_v1 = Sequential()

27.b Add the first dense (fully-connected) layer to the model

model_v1.add(Dense(
4096,
activation=‘selu’, # Use the SELU (Scaled Exponential Linear Unit) activation function
kernel_initializer=‘lecun_normal’, # Initialize weights with LeCun normal distribution
input_shape=(X_train.shape[1],), # Define input shape based on the number of features in X_train
kernel_regularizer=tf.keras.regularizers.l2(0.01) # Apply L2 regularization to reduce overfitting
))

27.c Add the second dense layer

model_v1.add(Dense(
2048,
activation=‘selu’,
kernel_initializer=‘lecun_normal’,
kernel_regularizer=tf.keras.regularizers.l2(0.01)
))

27.d Add the third dense layer

model_v1.add(Dense(
1024,
activation=‘selu’,
kernel_initializer=‘lecun_normal’,
kernel_regularizer=tf.keras.regularizers.l2(0.1)
))

27.e Add the fourth dense layer

Layer with 64 neurons and SELU activation

model_v1.add(Dense(64, activation=‘selu’))

27.f Add the output layer

Output layer with 6 neurons and softmax activation for multi-class classification

model_v1.add(Dense(6, activation=‘softmax’))

What do we have here?

First, I created a sequence of layers and defined the object model_v1, using Keras as the API for TensorFlow (our main framework).

Then, I added the following layers:

  1. First dense layer: A fully connected layer that initializes the model architecture.
  2. Additional dense layers: The second and third layers deepen the architecture.
  3. Final dense layer: Uses the Softmax activation function to generate class probabilities.

Key details of the dense layers:

  • Number of neurons: Each layer contains its own mathematical operators.
  • Activation function: Determines the output behavior of the layer.
  • Weight initialization: At the start of training, weights are unknown and must be initialized. Here, we use kernel_initializer='lecun_normal'.
  • Input shape: The input format is based on the number of columns in the training data:
  • X_train.shape[1] refers to the number of columns in the vectorized dataset.

Thus, the first dense layer is configured with these fundamental components, establishing the foundation of the fully connected architecture.

27. Create the Model

27.a Initialize a sequential model. Sequential models are a linear stack of layers.

model_v1 = Sequential()

27.b Add the first dense (fully-connected) layer to the model

model_v1.add(Dense(
4096,
activation=‘selu’, # Use the SELU (Scaled Exponential Linear Unit) activation function
kernel_initializer=‘lecun_normal’, # Initialize weights with LeCun normal distribution
input_shape=(X_train.shape[1],), # Define input shape based on the number of features in X_train
kernel_regularizer=tf.keras.regularizers.l2(0.01) # Apply L2 regularization to reduce overfitting
))

Normally, we start with a higher number of neurons and then gradually decrease it.

In the second dense layer, notice that I reduce the number of neurons while keeping the other elements consistent.

This strategy is common in deep learning, as reducing neurons layer by layer helps the model focus on more abstract and higher-level features as it progresses through the network.

27.c Add the second dense layer

model_v1.add(Dense(
2048,
activation=‘selu’,
kernel_initializer=‘lecun_normal’,
kernel_regularizer=tf.keras.regularizers.l2(0.01)
))

One of these elements is kernel regularization, which is useful to prevent overfitting.

Why is that? When working with simpler data — like in our case — this type of architecture tends to overlearn. Overlearning can be problematic.

I know it sounds paradoxical, but that’s how it works. We don’t want the model to learn every detail of the data (overfitting); instead, we aim for the generalization of mathematical patterns.

To achieve this, I apply regularization to the network. There are various strategies for this, and one of the most common is L2 regularization.

After setting this up, I move on to the next layer.

27.d Add the third dense layer

model_v1.add(Dense(
1024,
activation=‘selu’,
kernel_initializer=‘lecun_normal’,
kernel_regularizer=tf.keras.regularizers.l2(0.1)
))

Dense layer: I reduce the number of neurons while keeping the other elements unchanged.

27.e Add the fourth dense layer

Layer with 64 neurons and SELU activation

model_v1.add(Dense(64, activation=‘selu’))

After that, I proceed to the fourth layer: I further reduce the number of neurons.

At this point, I no longer need weight initialization or regularization.

27.f Add the output layer

Output layer with 6 neurons and softmax activation for multi-class classification

model_v1.add(Dense(6, activation=‘softmax’))

Why does the final Softmax layer have only six neurons?

The answer lies in the number of classes in our problem. In this case, we have six classes representing six different sentiments.

The final layer must deliver the probabilities for each class. If there were only two classes, the last layer would have two neurons, and so on. This step isn’t experimental; it’s dictated by the problem itself.

Why 4,096 neurons in the first layer, 2,048 in the second, and so on?

These numbers are determined through experimentation. There’s no universal rule for the number of neurons to use.

Factors like the size of your dataset, the complexity of the problem, and the input dimensions influence these decisions. Here, I tested several configurations, and this particular setup yielded excellent results.

27.b Add the first dense (fully-connected) layer to the model

model_v1.add(Dense(
4096,

Use the SELU (Scaled Exponential Linear Unit) activation function

activation='selu',  

Initialize weights with LeCun normal distribution

kernel_initializer=‘lecun_normal’,

Define input shape based on the number of features in X_train

input_shape=(X_train.shape[1],),

Apply L2 regularization to reduce overfitting

kernel_regularizer=tf.keras.regularizers.l2(0.01)  

))

Key Takeaways:

  1. Activation Function: The choice of activation function (selu in this case) is experimental. Explore different options like relu, tanh, etc., to determine the best fit for your data.
  2. Weight Initialization: Strategies like lecun_normal or he_normal help the model converge faster. Again, the choice depends on experimentation.
  3. Regularization: Here, we used L2 regularization to address overfitting in this relatively simple architecture. In more complex architectures, strategies like dropout might be better.

Why Experimentation is Key:

The effectiveness of activation functions, weight initialization, or regularization techniques varies based on the data and architecture.

For example:

  • L2 Regularization: Works well for simple architectures like this one.
  • Dropout: Preferred for more advanced architectures (e.g., LSTMs or transformers).

With a different dataset or problem, the strategies you select may yield different results. The essence of data science lies in experimentation — trying different combinations and observing their impact on model performance.

Suggestions for Further Exploration:

  • Visit the Keras documentation to explore other activation functions, regularization methods, and weight initialization strategies.
  • Test alternative configurations. If you achieve better performance with a different setup, that’s excellent — it’s part of the iterative nature of machine learning development.

In this case, the architecture I shared provided excellent results and serves as a robust foundation for further experimentation.

Stay tuned for Version 2, where we’ll incorporate more advanced strategies and compare outcomes.

10. Model Compilation and Summary

The next step is the model compilation, and then we’ll print the summary.

What is compilation, exactly? It’s the final organization of the model.

Up to this point, we’ve defined the general architecture, the main structure of the model — in this case, a fully connected architecture.

Now, I’ll add the bias and include the class_weights:

28. Assign Specific Weights to the Bias Vector of the Model’s Last Layer

model_v1.layers[-1].bias.assign(class_weights)

In other words, during training, our model will assign slightly more weight to the minority classes, compensating for class imbalance without modifying the data itself.

Could we modify the data?

Yes, we could. Strategies like oversampling and undersampling are available. However, another approach is working with class_weights, which is becoming increasingly popular to avoid introducing bias or tendencies into the data.

When you use oversampling or undersampling, you either generate synthetic data or remove existing records. Both approaches can lead to a slightly biased model.

The idea here is simple:

For minority classes, the model assigns a slightly higher weight to compensate for the majority classes, which naturally have more records. We incorporate this adjustment directly into our model as part of its design.

29. Compile the Model

29.a Define the optimizer as ‘Adam’.

Adam is an optimization algorithm used as an alternative to the classic stochastic gradient descent

procedure, updating network weights iteratively based on training data.

29.b Set the loss function to ‘categorical_crossentropy’.

This is suitable for multi-class classification problems where labels are provided in one-hot encoded format.

29.c Specify the evaluation metrics for the model as ‘accuracy’, along with Precision, Recall, and AUC.

Accuracy is a common metric to evaluate classification model performance.

model_v1.compile(
optimizer=‘Adam’,
loss=tf.losses.categorical_crossentropy,
metrics=[‘accuracy’, Precision(), Recall(), AUC()]
)

Finally, we compile the model using the ADAM optimizer, which is the most commonly used optimizer in Deep Learning.

For the loss function, we specify categorical_crossentropy, which is suitable for multi-class problems like the one we’re working on.

The evaluation metric chosen is accuracy, which will help us track the model’s performance and adjust weights with each training step.

This setup organizes the model to perform the forward pass, allowing the input data to flow through the layers while the loss function computes the error, guiding the adjustments during training.

27. Create the Model

27.a Initialize a sequential model. Sequential models are a linear stack of layers.

model_v1 = Sequential()

27.b Add the first dense (fully-connected) layer to the model

model_v1.add(Dense(
4096,
activation=‘selu’, # Use the SELU (Scaled Exponential Linear Unit) activation function
kernel_initializer=‘lecun_normal’, # Initialize weights with LeCun normal distribution
input_shape=(X_train.shape[1],), # Define input shape based on the number of features in X_train
kernel_regularizer=tf.keras.regularizers.l2(0.01) # Apply L2 regularization to reduce overfitting
))

27.c Add the second dense layer

model_v1.add(Dense(
2048,
activation=‘selu’,
kernel_initializer=‘lecun_normal’,
kernel_regularizer=tf.keras.regularizers.l2(0.01)
))

27.d Add the third dense layer

model_v1.add(Dense(
1024,
activation=‘selu’,
kernel_initializer=‘lecun_normal’,
kernel_regularizer=tf.keras.regularizers.l2(0.1)
))

27.e Add the fourth dense layer

Layer with 64 neurons and SELU activation

model_v1.add(Dense(64, activation=‘selu’))

27.f Add the output layer

Output layer with 6 neurons and softmax activation for multi-class classification

model_v1.add(Dense(6, activation=‘softmax’))

So, the data flows through each layer, performing mathematical operations and generating predictions.

These predictions provide a probability for each class. This process is known as the forward pass, and it does not involve learning — it’s simply the model making predictions.

For the model to learn, what happens?

We compare the model’s prediction with the actual data.

Do we have the true values?

Yes, they are part of the training data (y_train).

The model predicts a class, and we calculate the error using the loss function loss=tf.losses.categorical_crossentropy.

This function calculates the error for each class. This error guides the weight adjustments during the next pass.

How does this happen?

Through the Adam optimizer, which is an alternative to gradient descent. Adam operates similarly but adds improvements based on derivatives.

Then, the process repeats as described in command #27.

The model makes another prediction, the error is calculated again using the loss function, and the weights are adjusted.

This iterative process continues until the model begins to effectively learn from the data.

30. Display Model Summary

model_v1.summary()

And so, we have the final architecture of the model, encompassing all the dense layers and the total number of parameters that this model will learn.

It’s a relatively small model, especially when compared to transformers, which are significantly larger. However, it’s a model that will likely achieve good predictive performance because the data is not particularly complex.

If the data were much more complex, it would be unlikely to achieve strong performance with this architecture.

11. Callbacks and Early Stopping

The next step is to set up the callbacks, which include the Learning Rate Scheduler and Early Stopping.

What is the purpose of these?

To control the training process. For instance, do you know the ideal values for the learning rate? No, you don’t — and there’s no way to know beforehand.

This is something that always requires experimentation. However, experimenting manually can take a significant amount of time.

To address this, we create a Learning Rate Scheduler, allowing TensorFlow to manage this process for us.

Over time, TensorFlow will adjust the learning rate while continuously monitoring the model’s performance — this is where the validation data plays a key role.

Based on the performance, TensorFlow will dynamically update the learning rate.

31. Learning Rate Scheduler Function

def step_decay(epoch):

# 31.a Initial learning rate
initial_lrate = 0.001

# 31.b Drop factor for learning rate decay
drop = 0.5

# 31.c Number of epochs after which learning rate is reduced
epochs_drop = 10.0

# 31.d Calculate the updated learning rate
lrate = initial_lrate * math.pow(drop, math.floor((1 + epoch) / epochs_drop))

return lrate

Initially, the value will be set to 0.001. Is this the best value? I have no idea — there’s no way to know upfront. It requires experimentation.

However, I can start with an initial value. Generally, values close to 0.001 work well, give or take. These values are commonly effective, based on extensive practice and experimentation.

I’ll begin with the structure outlined in block #31, instructing the model to reduce the learning rate as epochs progress.

This approach allows for a much more balanced training process. If I were to use a single learning rate throughout, it might not yield optimal results, potentially requiring me to build another model version with a different learning rate later.

So why not make these adjustments during training itself? That’s the core idea behind a callback, particularly with the learning rate scheduler:

32. Learning Rate Scheduler

lr_scheduler = LearningRateScheduler(step_decay)

The idea behind early stopping is as follows:

33. Early Stopping

early_stopping = EarlyStopping(
monitor=‘val_loss’, # Monitor the validation loss
restore_best_weights=True, # Restore the model weights from the epoch with the best validation loss
patience=3 # Stop training after 3 epochs with no improvement
)

When the model stops learning, the training continues, right? If the model reaches a plateau, meaning it’s no longer learning anything, there’s a risk involved.

That risk is the model over-learning, leading to overfitting. So, here’s what I’ll do: I’ll monitor with a patience parameter of three epochs — hence the name of the parameter, patience. We’ll wait for three epochs.

This is the patience threshold. If the model doesn’t improve its performance during these three epochs, I’ll stop the training.

Because continuing the training doesn’t make sense. Beyond the risk of overfitting, there’s also the issue of computational resources.

12. Model Training

Training the model is the easiest part of the process. You execute a single line of code and wait for the training to complete.

Of course, getting to this point requires a lot of work. Once the training is done, the process continues with model evaluation, possible optimization, and then the deployment procedure.

First, I defined two hyperparameters:

34. Hyperparameters

num_epochs = 20 # Number of epochs
batch_size = 256 # Batch size

This also helps us control the training process. I will train the model for 20 epochs (iterations over the entire dataset).

You can slightly increase or decrease this value. I experimented with various epoch numbers, and 20 provided good performance.

The batch size is 256. What does this mean?

You cannot load the entire dataset into memory at once, as it would exceed your computer’s memory capacity — a physical limitation.

By “exceed,” I mean surpassing the limit; don’t worry, it won’t cause physical damage to your machine. It simply won’t have enough memory to handle the data.

So, what do we do?

We train in batches, or mini-batches. During each epoch, multiple batches of data are processed by the algorithm.

Each batch is loaded into memory, processed for learning, and then replaced by the next batch within the same epoch.

An epoch is a full pass through the dataset, but in smaller pieces — batches.

A batch size of 256 also yielded good results. Finding the balance between training time, performance, and hardware constraints is crucial.

35. Model Training

%%time
history = model_v1.fit(
X_train, # Training data
y_train_encoded, # Encoded training labels
validation_data=(X_val, y_val_encoded), # Validation data and labels
epochs=num_epochs, # Number of epochs
batch_size=batch_size, # Batch size
callbacks=[early_stopping, lr_scheduler] # Callback functions: early stopping and learning rate scheduler
)

After that, I use model_v1 and call the fit method. I pass the training data in X_train and Y_train.

However, for the target variable, I must use y_train_encoded, meaning the properly encoded target variable in categorical format.

I’ve already prepared all of this during the data preparation phase.

I also specify the validation data by passing x_val and y_val_encoded. I include the number of epochs, batch size, and callbacks.

I used %%time to measure the execution time of this cell. The training results are stored in the history variable.

This is useful because I can later access this variable to extract information, such as the model’s error metrics.

You’ll notice that the accuracy starts at around 0.50. Each time you run the model, the results may differ slightly — this is completely normal.

There’s an inherent randomness in parts of this process. However, the trend will generally follow this pattern: accuracy starts relatively low and gradually increases, while the error decreases.

What does this indicate?

It means the model is learning — excellent! Initially, the accuracy is around 50%, but after about 10 epochs, it reaches approximately 94%.

You’ll also observe that the training error starts high and decreases over time. Similarly, the validation error, which is initially high, also decreases, though at a slower pace.

The validation accuracy, shown toward the end of the training, is a bit less stable — it doesn’t decrease as significantly.

This behavior suggests that the model is maintaining some level of stability during validation. To further refine the results, you could return to fine-tune the hyperparameters or optimize the training process.

However, for this example, the performance is sufficient. If desired, further adjustments could be made to the training settings to achieve even better results.

13. Model Evaluation

Let’s move to the penultimate step, which is the evaluation of the model after training. The evaluation shown previously is during training, correct?

Because we used validation data. Now, the model has been trained and is ready, so I will evaluate it using the test data.

This step helps us get a good idea of whether the model can actually be used or not.

36. Extract Training and Validation Loss

loss, val_loss = history.history[‘loss’], history.history[‘val_loss’]

I will access the History object and from it retrieve the history attribute.

I know it might seem a bit odd since we named the variable History, and this variable has an attribute also called history.

So, what will I do with this?

I’ll use it to return the error values. The term loss refers to the training error, while val_loss represents the validation error.

I will extract these, store them in two variables, and then create the plot:

37. Plot Training and Validation Loss

plt.plot(loss, label=‘Training Loss’) # Plot training loss
plt.plot(val_loss, label=‘Validation Loss’) # Plot validation loss
plt.legend() # Add a legend
plt.show() # Display the plot

What do you observe here?

The error starts off quite high at the beginning, both for training and validation.

This is normal since the model is still learning at the start of the training process, so the error begins at a high level.

Then, the error drops significantly during the first epochs and stabilizes with only minor reductions.

Why is this a good sign?

It shows that the model is learning in a relatively stable manner. What you don’t want to see in this graph is a set of spikes — for instance, the error suddenly increases, then decreases again, and keeps fluctuating.

That would indicate an imbalanced training process, likely caused by incorrectly tuned hyperparameters.

How do you address such issues?

You would revisit and adjust factors like the learning rate or batch size to smooth the training process. A perfectly flat line is not required (as seen here, which is excellent), but large discrepancies should be avoided.

Erratic patterns would indicate that the model can learn well in one epoch but struggles in the next, which is undesirable.

Ideal behavior:

The graph should show a descending trend. If that’s the case, as seen here, additional epochs might improve the model’s performance further.

However, this comes down to time and resource constraints. Training time is a precious resource and a restriction in any project.

You’ll rarely have unlimited time to develop a project, and you need to deliver results within deadlines.

Practical considerations:

If more time is available, you can extend training to refine the model further.

If not, define a performance threshold that meets the project requirements and stop training once it’s reached.

For this example, the training is sufficient. Now, let’s move on to making predictions using the test data:

38. Predictions on Test Data

predictions_v1 = model_v1.predict(X_test_array)

I will use the X-Test Array that we prepared in previous lessons. When you make these predictions here, you actually get a prediction for each class.

The model will always deliver that. In our case, it will deliver six predictions, or six probabilities.

Where does this come from? The softmax function will always provide six predictions at the end—one for each class. It will deliver six probabilities.

I will then take the highest probability and consider it as the class prediction. This is how almost any classification model works.

For that, I will use the ArgMax function:

39. Extract Predicted Labels

predictions_v1_labels = predictions_v1.argmax(axis=1)

To extract the highest probability values, which will actually give me the class prediction for each test record, I will proceed.

After that, I will generate the reports:

40. Print Classification Report

print(classification_report(y_test_le, predictions_v1_labels))

Classification Report

41. Print Confusion Matrix

print(confusion_matrix(y_test_le, predictions_v1_labels))

Confusion Matrix

Finally, I will calculate the Accuracy Score:

42. Print Accuracy Score

print(accuracy_score(y_test_le, predictions_v1_labels))

----> 0.843

And then, I will save the model to disk. Let’s take a look here.

First, the Classification Report:

Classification Report

You can see that there are two groups of metrics. At the bottom, we have the global metrics, which represent the performance of the model itself. These include the accuracy, macro accuracy, and weighted accuracy.

Here, we’re mainly looking at an accuracy of 0.84. This value ranges from 0 to 1, where the higher, the better.

When multiplied by 100, it equals 84%, which is an excellent result—especially for the first version of the model. These are the global metrics.

In addition to that, we also have the local metrics, which focus on each individual class. In this case, we have six classes, numbered from 0 to 5.

For each class, we calculate the Precision, Recall, F1 Score, and Support, which indicate the model’s performance for each specific class.

When you look at the global accuracy of 0.84, that’s the overall metric. However, the specific metrics tell a deeper story.

For example, in the case of class 0, the model has a precision of 0.85, whereas for class 5, the precision drops to 0.67. This means that, while the global accuracy is 84%, the model performs better for some classes than others.

This discrepancy is due to class imbalance. We attempted to address this by applying weights to each class. However, we might improve it further by trying oversampling, undersampling, or adjusting the weighting strategy.

This is just the first version of the model, so don’t expect perfection right away. In fact, you won’t achieve a perfect modeleven in the final version, let alone the first.

So, you need to carefully analyze the numbers. Does the current performance meet your needs? If not, go back and adjust. Modify the class weighting strategy we applied earlier, or consider using oversampling, undersampling, or even adjusting the hyperparameters.

Keep iterating until you approach the ideal model — that’s our job.

For this first version, let’s save it.

43. Save the Model

model_v1.save(‘model_v1.keras’)

14. Deployment of Version 1 of the Model

Now, let’s move on to the final step for Version 1 of the model.

Keep in mind that we still have two more versions to develop in this project. I’ll bring those to you in the next tutorial.

For now, what will I do? I will load the model from disk.

44. Load the Model

loaded_model = load_model(‘model_v1.keras’)

Deployment is a separate activity, isn’t it? You don’t necessarily deploy the model immediately after training it.

For instance, you could train the model one day and deploy it the next — they are distinct activities.

I could save the model, close the notebook, and then reload it later to continue working. However, I’ve included everything in a single Jupyter Notebook because, didactically, it’s easier for you to study.

So, I will now load the model from disk. At this point, I want to perform sentiment classification for this phrase:

45. New Sentence (sentiment = fear)

sentence = “i even feel a little shaky”

I already know the sentiment of the phrase. I’m including it here in parentheses so we can check if the model gets it right.

Of course, in real-world usage, when you’re working with the model, you won’t know the sentiment in advance — that’s what the model is supposed to predict.

However, before deploying the model to production, it’s always a good idea to test it. This ensures the model is delivering the results you expect.

Here’s the phrase I want the model to classify: “i even feel a little shaky”

What sentiment does this phrase represent? What does it mean? In this case, the sentiment is fear. Let’s see what our model predicts.

What do I need to do now to use the machine learning model we just created? First, I will take this phrase and convert it into a Pandas DataFrame:

46. Create a DataFrame with the Sentence

df_new = pd.DataFrame({‘Sentence’: [sentence]})

So, here it is for you — this is cell 46. Next, pay attention to what I’m going to do: I will call the preprocess_textfunction.

47. Apply the Preprocessing Function

df_new[‘Processed_Sentence’] = df_new[‘Sentence’].apply(preprocess_text)

Where does this come from? From the beginning of our work: text preprocessing using spaCy.

10. Definition of the ‘preprocess_text’ Function, Which Takes a Text as a Parameter

def preprocess_text(text):

10.a Process the text using the SpaCy model

doc = nlp(text)

10.b Create a list of lemmatized tokens, converted to lowercase, stripped of whitespace,

excluding stopwords

tokens = [token.lemma_.lower().strip() for token in doc if not token.is_stop]

10.c Return the processed tokens as a single string, joined with spaces

return ’ '.join(tokens)

What does this function do?

It essentially cleans the data — removing unnecessary words and replacing full words with their stems.

Didn’t we apply this preprocessing step to prepare the data and train the model? So, I must apply it again. Any transformation applied to the training data must also be applied to the test data and any new data.

You can only provide the model with the same format it was trained on.

Is this sentiment classifier sufficient to solve a problem? I trained this classifier using data in a specific format, so I must follow the same steps now that the model is trained.

Therefore, I will proceed with the preprocessing:

47. Apply the Preprocessing Function

df_new[‘Processed_Sentence’] = df_new[‘Sentence’].apply(preprocess_text)

And observe what happens:

48. Display the DataFrame

df_new

Sentence is the original phrase. This here is the processed phrase—simpler, with unnecessary words removed.

What’s the next step?

Apply vectorization. Can a Machine Learning model directly process text data? No, because it only works with numerical representations.

So, I need to convert the text data into a numerical format by applying the transform method:

49. Apply Vectorization

df_new_tfidf = tfidf_vectorizer.transform(df_new[‘Processed_Sentence’])

Be careful! The fit_transform method is applied only to the training data. For test or new data, we use the transformmethod instead.

Now, I’m applying it to new data. After that, I will convert the result into an array format:

50. Convert to Array

df_new_array = df_new_tfidf.toarray()

That’s because this is how I fed the data to the model during training. However, if I were to deploy this model in a web application, the end user wouldn’t know any of this, right? Exactly — they don’t need to know.

The end user will simply type in a phrase, and your application will need to handle everything internally: apply preprocessing, vectorization, convert to an array, and then extract the prediction from the model. That’s how it works — there’s no magic involved.

Now that I have the array, I can use the model I just loaded. I’ll call the predict method and pass the array as an argument:

51. Predictions

predictions = loaded_model.predict(df_new_array)

I extract the predictions:

52. Display Predictions

predictions

What does this predictions object contain?

It holds the probability for each class. This is what the model provides. You can see that there are six values, each representing the probability of one class.

But what now? I don’t want all the classes — I want just one.

So, I’ll use the argmax function:

53. Select the Class with the Highest Probability

highest_prob_class = np.argmax(predictions, axis=1)

54. Display the Class with the Highest Probability

highest_prob_class

-----> array([1])

What does argmax do? It takes the highest probability and returns the corresponding class index.

Now, I already know your question: What does this number mean? Here’s the thing — Machine Learning models work with numbers.

So, it gives you the number 1, but you don’t know what class 1 represents.

Can we retrieve that information? Absolutely!

15. Conclusion and Lessons Learned

Let’s now see how to convert this number 1 into text—the class name.

As humans, we work better with text, but machines work better with numbers. So, we need to translate this numerical output back into text.

Each class was assigned a number when we performed the encoding. Now, we’ll perform the reverse process: decoding.

55. Get the Class Name

class_name = label_encoder_v1.inverse_transform(highest_prob_class)

I will call the label_encoder_v1. Where does this come from? It originates from step 2, specifically in command #19 of our notebook.

19. Create the Label Encoder

label_encoder_v1 = LabelEncoder()

At that point, I applied the Label Encoder. I trained it with the training data and then applied it to the test data as well.

20. Fit and Transform the Target Variable in Training Data

y_train_le = label_encoder_v1.fit_transform(training_data[‘sentiment’])

21. Transform the Target Variable in Test Data

y_test_le = label_encoder_v1.transform(test_data[‘sentiment’])

And this step performed the encoding, converting the text data into numbers, right? This is what I fed into the model.

So, when the model returns a result, it also returns a number.

Now, I need to decode it. To do that, I’ll call my label_encoder_v1 object again and use the inverse_transformmethod. This will effectively reverse the transformation.

I’ll take the class with the highest probability, which we called highest_prob_class, and use it to retrieve its meaning—the name of the class. I execute it, and there it is for you:

55. Get the Class Name

class_name = label_encoder_v1.inverse_transform(highest_prob_class)

56. Predicted Class

class_name

-----> array([‘fear’], dtype=object)

It’s the sentiment Fear.

And if you go back to command #45, I had already noted what this sentiment represents to demonstrate that the model is working correctly.

45. New Sentence (sentiment = fear)

sentence = “i even feel a little shaky”

And with that, we conclude the deployment of Version 1 of our model. We’ll continue in the next tutorial. Phew! What a journey, right?

I’d like to draw your attention to something important — though I believe you’ve already noticed. Look at everything we accomplished in Version 1 of the model. We navigated through the entire Machine Learning process from start to finish, didn’t we?

Here’s what we covered:

  1. We defined the problem.
  2. We prepared both the input and output data.
  3. We divided the data into samples.
  4. We built the model, compiled it, and created a summary.
  5. We added callbacks to control the training process.
  6. We trained the model and performed an evaluation.
  7. Finally, we completed the deployment.

This was a comprehensive walkthrough of the Machine Learning process, with specific stops to explain key components like the architecture of Deep Learning, the interpretation of metrics, and the deployment process itself — topics that often raise many questions.

Now, you have the foundation of a complete project, even though it’s just the first version of our model.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值