Sitemap
Artificial Intelligence in Plain English

New AI, ML and Data Science articles every day. Follow to join our 3.5M+ monthly readers.

Follow publication

AI Hallucinations : Fear Not — It’s A Solved Problem — Here’s How (With Examples!)

--

There has been a lot of press lately about hallucinations and ChatGPT “making things up”. Most of the focus has been around how to modify ChatGPT or Gemini and somehow have it be more accurate.

In my opinion, general purpose LLMs and systems like ChatGPT will NEVER be able to fully control hallucinations. That is NOT their job. Just like Google Search, their job is to absorb the vast knowledge on the Internet and answer questions based on that.

And expecting ChatGPT to not hallucinate is the same as expecting Google to ONLY show articles with the truth (that’s not going to happen!)

However, there is a better way: Using Retrieval Augmented Generation (RAG) along with “ground truth” knowledge to control the responses generated by ChatGPT.

Zoom image will be displayed

In this post, I will cover some anti-hallucination lessons learned in the field. (PS: I’ve already covered the Basics of Anti-Hallucination in a previous article, so if this is new to you, please read that first)

But first ..

How were these lessons learned?

Over the last nine months, we have battle tested some of these solutions to control hallucinations in generative AI across thousands of customers. And literally made about a 100+ system upgrades to our RAG SaaS platform to get hallucination under control.

I’ve written a previous article about how to stop hallucinations in ChatGPT. If you want non technical details and basics, that would be a good place to start.

But in this blog post, I want to go through some of the advanced technical details. And rules that we have followed based on these experiences.

Ninja tip: Almost all these lessons were learnt while handling hallucinations with real paying business customers. Think of it as hand-to-hand combat dealing with customers who know their data best.

And just to confirm, some of these experiences have been with large companies and extremely stringent use cases — like health therapy or banking-related use cases. And so the system seems to have been battle tested with a very low threshold of tolerance for hallucination.

Cause you know: When a customer sees it, he will tell you about it!

Without further ado, here are some lessons:

Lesson 1 : 97% Effective Is As Good As 100% Useless.

So hallucinations are like “security” or “uptime”. If you are doing some basic anti-hallucination and get to 97%, unfortunately that is not good enough.

That would be like a DevOps person saying “my machine is up 97% of the time” or a SecOps person saying, “Oh, we are 99 percent secure”. It is just not how business works.

Lesson 2: Every Part Of The RAG Pipeline Needs Anti Hallucination

So one of the things that makes me cringe is when people just put in one line of anti- hallucination into the prompt and expect it to work. It’s usually something like this:

prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"
Context:
<CONTEXT FROM KNOWLEDGEBASE>
Q: <USER QUESTION>
A:"""

Unfortunately, a line like that where you say, “please stay within the context”, that is just not good enough. The root cause of the hallucination can occur with 5 other components of your RAG pipeline.

That is the sort of thing that gets you to 95%.

Why?

Because in a RAG pipeline, there are multiple components and each of them can be the cause of introducing hallucinations.

Starting from the data chunking to the query intent, then going to chunks retrieved from the vector data, to the prompt engineering, then going to the type of LLM that is used, then going to how that LLM is used, then going through the AI response to see if it is a hallucinated answer.

C-Suite execs from top companies call this “Fully Integrated Horizontal Pipeline” (sweet!)

Zoom image will be displayed

As you can see, there are multiple components that matter in a RAG pipeline. And any one of them could be the root cause of your hallucination.

On a side note: When I was talking to an old-timer from the manufacturing world, he said:

Yeah — this is basically like a manufacturing pipeline and what you call “hallucination”, we call “defect”.

Lesson 3: Build It Versus Buy It

Till now we have spent literally thousands of hours across hundreds of cases and millions of queries across thousands of customers dealing with hallucination cases.

When any one of our customers identifies a hallucination case, we identify the root cause and then upgrade the entire system.

So when we do this, all the thousands of customers get the benefit.

This is the value of “economy of scale”. So when you are trying to control hallucination by yanking out Langchain (which by the way is a pain!) and implementing anti-hallucination with it, it’s just you by yourself.

The testing that will be done and the anti hallucination you will put in place is limited to your one account and experience.

There is no network effect where thousands of customers are helping each other by identifying the root causes and then purpose-built SaaS systems solving those cases.

Due to this, when you consider a “Buy It” strategy, you get the benefit of all of those thousands of hours of work at a very, very minimal cost.

It’s the same reason we all use the OpenAI API, right? Do we really want to sit around and reinvent the wheel and start building our own LLMs like Bloomberg?

Lesson 4: Query Intent Decides Anti Hallucination

A big problem with basic RAG pipelines is that the user query keeps changing under various conditions. People have:

  • long conversations
  • short conversations
  • sudden turns in conversations
  • Bad prompts

It turns out that normal users are really terrible at prompting. They talk to chatbots like they are talking to friends.

We see a lot of queries like: “okay”, ”Yes”, “Haha. That’s right”, “No, that one”, “2”, etc.

And handling the true intent of such queries and making sure that every aspect handling the user query should keeps anti-hallucination in mind.

Especially when it comes to retrieving the right context from your knowledge base (aka: vectorDB search and chunk re-ranking) and including it in the LLM call that is critical in anti hallucination.

Ninja tip: Query intent (sometimes called query pre-processing) is so important, that we built our own custom proprietary intent algorithm and chunk re-ranking algorithm for this.

Lesson 5: Test, Test, Test!

As you can imagine, test the heck out of the RAG pipeline. And this is probably the most painful.

How do you create a test suite of conversations and context to see if the response is being hallucinated?

You have to consider a large number of real-world scenarios. In particular:

  • Long conversations: These might be a thread with hundreds of previous messages and responses.
  • Short conversations: These might be quick conversations like those coming in over SMS to a car dealer.
  • Big fat prompts: Where people are typing thousand word prompts.
  • Big fat responses: Where the response itself is like 4,000 words and that leads to follow-on questions.
  • Sudden turn in query intent : This one is the most tricky, where the flow of the conversation could take a sudden turn.

These are the types of things that will need to be tested and confirmed because this is what happens in the real world.

Till date, I see devs trying to run evals for hallucinations with basic Q&A datasets. That works for a while, but when you take your RAG pipeline to production, things get real and the cases I mentioned above come into play.

Lesson 6: Quantitative Testing

There are some new ways in which your system can be tested quantitatively to see if hallucination is occurring and the degree of hallucination that is present in the AI responses.

You can use an LLM-agent to validate the context, prompt and AI response and calculate validation metrics from it.

Think of it almost like a quality score that is computed for each response. So by measuring and logging these metrics in real time (just like you would with say “response time” or “error rate”), you can optimize your RAG pipeline over time.

Update: In recent evaluations, our platform came in #1 for anti-hallucination benchmarks.

Lesson 7: Keeping RAG in Sync With Content

Possibly the biggest source of hallucinations that nobody talks about : Making sure that the RAG is in sync with your ground truth content.

Just think about it : You build a nice RAG pipeline and show it off to your boss. A few days later, the Pricing page on your website changes and you have new pricing. Has your RAG chatbot kept up-to-date?

Or is your RAG answering based on old outdated content?

What is needed is: For the RAG to keep in sync with content changes. Whether it is a website, or files ingested from your Google Drive, the RAG needs an “auto sync” feature that will make sure that the RAG’s information is consistent with the “ground truth”.

If your RAG doesn’t have auto sync, then it will hallucinate like crazy in production as content in the vector database gets stale.

Side note: this “auto sync” feature is by far the most requested Enterprise feature in our platform after we implemented it.

Frequently Asked Questions

But wait — you can never be 100% sure with hallucinations, correct?

Correct — that is why I often refer to hallucinations like DevOps people refer to “uptime”. For some people, 98% is good enough — for others, they need 99.999% accuracy.

Hallucination is like “uptime” or “security”. There is no 100%. Over time, we will come to expect “Five 9s” with hallucinations too.

Technically, what are the major anti-hallucination methods?

While we had to put in anti-hallucination at ALL parts of the RAG pipeline, here are some that had the most effect:

1. Query Pre-processing: Using an agent (called “InterpreterAgent”) to understand the user intent gave the most improvement. It not only helped calculate better CONTEXT from the vectorDB search, but also helped create a better prompt for the LLM API call resulting in better AI response quality.

2. Anti-hallucination prompt engineering: We created a concept of a dynamic “context boundary wall” that was added to each prompt. This helped re-affirm anti-hallucination in the final prompt that the LLM (like ChatGPT) operates on.

3. LLM Model: We used ONLY GPT-4. The higher the model’s ability to reason, the better the anti-hallucination score. And yes — while this might make it lot more expensive, it’s the price to pay for the added anti-hallucination points.

Going from 97% to 99.999% for anti-hallucination is a long and hard process.

But wait — can’t GPT-4 or Llama-3 just fix this for us?

A general-purpose LLM like ChatGPT (or Llama-2) can NEVER fully control hallucinations.

Think about it: These models are by definition designed to hallucinate or be “creative” (in the words of Sam Altman)

Image Credit

Expecting ChatGPT to not hallucinate is like asking Google to ONLY show truthful articles. What would that even mean?

Hmmmf — I don’t believe you. I want to see this in action.

I get this a lot. Here are some live chatbots. If you can get them to hallucinate, please drop a comment below and if your concern is real, I’ll send a reward your way :-)

CustomGPT’s Customer Service: Consolidated chatbot with all of CustomGPT’s knowledge.

MIT’s ChatMTC: Multiple knowledge bases with MIT’s expertise on Entrepreneurship.

Tufts University Biotech Research Lab: Decades of biotech lab research documents and videos.

Dent’s Disease Foundation : Consolidated knowledge from Pubmed and articles about a rare disease.

Abraham Lincoln : Public articles and websites about Honest Abe.

Side note: Hallucination should NOT be confused with jailbreaking. Jailbreaking is where you are trying to break your own session using prompt injection. It’s like lighting your house on fire and blaming the fire department.

Dude, why not just create a verification agent to check the AI response?

This is a great idea — IF you have the luxury of doing it.

The idea: Take the AI response, and then confirm it using an LLM against the context and the user query. So think of a prompt like this:

Act like a verification agent to confirm whether the assistant is staying within the context for the given conversation.

The problem: Most real-world use cases use response streaming. This means that as soon as the first word is available from the LLM (like ChatGPT), it is sent to the user. In such cases, by the time the full response has been dished out to the user, doing verification is too late.

But yes — if this was a non-streamed or offline use case, then adding a verification agent would definitely help.

Conclusion

So these are some basic lessons and rules learned from the school of hard knocks.

I hope that you can share your own hallucination experiences in the comments below. So that we can all tackle what is effectively a defect in AI and solve it to create safer and more reliable AI systems.

The author is CEO @ CustomGPT.ai, a no-code/low-code cloud RAG SaaS platform that lets any business build RAG chatbots with their own content. This blog post is based upon experiences working with thousands of business customers over the last 9 months (since the ChatGPT API was introduced).

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

--

--

Artificial Intelligence in Plain English
Artificial Intelligence in Plain English

Published in Artificial Intelligence in Plain English

New AI, ML and Data Science articles every day. Follow to join our 3.5M+ monthly readers.

Alden Do Rosario
Alden Do Rosario

Written by Alden Do Rosario

CEO @ CustomGPT.ai (https://customgpt.ai) -- ranked "Top 7 Emerging Leaders in Generative AI” by GAI Insights - alongside: OpenAI, Anthropic, Cohere and Others.

Responses (1)