Why does AI still hallucinate?

Transcript

Keith Shaw: Generative AI has come a long way in helping us write emails, summarize documents, and even generate code. But it still has a bad habit we can't ignore — hallucinations.

Whether it's making up citations, getting facts wrong, or confidently describing software features that don't exist, AI systems are producing content that simply isn’t true. On this episode of Today in Tech, we're going to dive in and discuss why AI is still hallucinating. Hi, everybody.

Welcome to Today in Tech. I'm Keith Shaw. Joining me in the studio today is Byron Cook. He is a Distinguished Scientist and Vice President at AWS. Welcome to the show, Byron. Byron Cook: Thank you very much for having me.

Keith: I’m a little bit intimidated — your résumé is incredibly impressive. So, I’m honored you're here to talk with us. For the purposes of this discussion, do we need to come up with a definition of what it means when AI is “hallucinating”?

Is it more than just giving a wrong answer? When you hear the term “hallucination,” what comes to mind? Byron: I'm of two minds. In the press, hallucination is seen as a bad thing.

But actually, hallucination can be a good thing — it’s the creativity we seek when using a transformer-based model. I think the misconception is that it’s inherently negative. But what you want is to use tools in composition that ensure the answers you’re getting are appropriate for your context.

Keith: When generative AI first came onto the scene, a lot of businesses expected it to give the correct answer to whatever question was asked.

I think they failed to recognize that there’s also a creative aspect to it — people got excited about drawing pictures, creating videos, all sorts of things. So when the models started spitting out information that wasn’t true, it freaked people out. Byron: People hallucinate too.

You hire someone, and they might do the wrong thing — type the wrong command — sometimes by accident, sometimes maliciously. So it’s not a surprise that a model trained on human content might also make mistakes.

If it’s told to produce a result no matter what, and it can’t predict the right thing, it won’t say, “I don’t know.” That’s why it makes up citations, for example.

AI, broadly speaking, encompasses a number of techniques — Bayes theorem, cognitive models, symbolic logic. But what really excited the public was the launch of transformer-based models, which are quite pure. Early on, they literally just predicted what the next token should be.

So it’s like saying a few words, then guessing what should come next. There's no moral reasoning or thought — just probability. And what shocked us, even those of us who’ve worked in this field for 30 years, was just how effective it turned out to be.

But it’s also not surprising that a system making predictions based on token sequences would sometimes generate inaccurate or nonsensical responses. We've known about this problem for quite a while — longer than the general public has. Keith: So the big question is: Why are hallucinations still happening?

Is it because hallucinations can be useful, or should we fix them? A lot of businesses would argue we should fix them — so we’re not giving away $1 cars by mistake, for example.

Byron: I work in what’s now called automated reasoning — it used to be known as symbolic logic. It’s a branch of AI.

If you look at the origins of the term “AI,” it goes back to the 1950s and the Dartmouth workshop, where different disciplines came together to explore intelligent systems. My discipline focuses on getting the details exactly right. But even defining what’s “right” can be surprisingly challenging.

For example, if you look at rules around immigration, HR, or medical leave — when you start breaking them down, you find all sorts of weird edge cases. Even domain experts will disagree on those.

So if you're willing to do the work ahead of time — defining your domain and rules — then there are tools that can help you achieve 100% accuracy. But just throwing a PDF into a transformer model and expecting perfect results? That’s unrealistic.

Keith: And sometimes the training data has errors, right? Byron: That’s part of it. Also, rules change. Pricing for public transit changes over time, so models trained on old data might return outdated info. And then there's how the model is trained — it's all matrix multiplication.

It’s synthesizing data based on prior experience, just like we do. Sometimes that means creative problem-solving, but the solution might not work.

Keith: What about overconfidence in the output? Is that a human failing or an AI issue? Byron: I’m not a cognitive scientist, but I’ll give you my take. Our brains are very receptive to conversational formats.

That’s why we enjoy talking this way — and it’s also why we're so easily fooled. These models are designed to sound fluent, which makes them seem trustworthy. But that doesn't mean they’re always right.

If the model is just making things up, people who don’t understand what’s going on under the hood will believe it. That’s the problem. Keith: Can AI even say, “I don’t know”? In media training, I was told not to BS my way through an answer.

Just say “I don’t know,” and move on. Can AI do that? Byron: In tools I work on, we can configure the AI to ask, “Why are you asking this?” or to clarify the question.

But early on, model providers leaned into “aggressive answering” because otherwise the tools wouldn’t have succeeded with users.

When I first joined Amazon, I learned the importance of asking why people are asking a question. Often they’re actually asking something else. That kind of active listening and clarification is critical. Keith: And if a chatbot gives me an answer I don’t like, I’ll rephrase or narrow the prompt.

That often helps — but people might expect the first answer to be right.

Byron: We're still learning. Different AI capabilities — formal reasoning tools, for example — can be combined in different ways. Industry is still figuring out the best combinations that provide both business value and accuracy.

Keith: And that’s why companies turned to retrieval-augmented generation (RAG) — so the AI only uses verified, company-specific data. Byron: Exactly. That helped. But to go further, we need automated reasoning. I’m biased, of course, because that’s my field.

If you’re writing poetry, you don’t need formal correctness. But if you're dealing with cryptography, cloud infrastructure, or security policies — those require provable correctness. That’s where automated reasoning really shines. Keith: So you’re combining automated reasoning with generative AI? Byron: Yes.

What’s exciting is that these models can make formal tools more accessible. And on the flip side, those reasoning tools can increase the reliability of generative systems — especially in high-stakes scenarios where outputs have legal, financial, or safety consequences.

Byron: For thousands of years — going back to Plato, Socrates, Aristotle — we’ve had a notion of truth. You can make a true argument, or a false one.

The foundations of truth involve things like modus ponens — if A is true, and A implies B, then B must also be true. Automated reasoning is the algorithmic search for proofs using formal logic. You write code that finds those arguments.

Once found, you can check them with an open-source, widely accepted proof checker. This has been used for years in safety-critical systems. Now, we’re applying that to chatbots. At Amazon, we’ve launched something called automated reasoning checks as part of Bedrock Guardrails.

We map natural language inputs to logic, and then prove or disprove the correctness of statements. If needed, we can even build a supporting argument.

Keith: You lost me a little bit in the middle there — but it sounds like, if you write enough rules, the system can hold together? Byron: Yes.

Ahead of time, before deploying these tools, you define the rules — whether it's for medical leave, airline ticketing, or whatever the context is. We help write those rules in software, translate them from natural language, and uncover the consequences or edge cases.

Then the model — using very natural language — communicates those decisions to non-technical users or agentic systems. Keith: So automated reasoning is kind of like a “logic cop”? It stands between the input and output, checking if something makes logical sense? Byron: Exactly.

If an AI says, “This pizza is half cheese and half pepperoni,” it needs to know that the whole pizza is cheese-based and pepperoni is a topping. That kind of commonsense reasoning doesn’t yet exist in most systems, but we can use automated reasoning to enforce it.

The term in academic literature for combining these approaches is neuro-symbolic AI — “neuro” for the transformer-based cognitive models, and “symbolic” for the formal logic. You can run them side-by-side, filtering the true from the untrue.

Keith: And the AI itself doesn’t know whether something is true — it’s just predicting based on what looks like past text? Byron: Exactly. It says, “This looks like what someone would say next,” without knowing if it’s correct. And honestly, we do that too.

When we speak, we often don’t know where a sentence is going — we’re emitting words, then adjusting in real-time. There's a "slow brain" and a "fast brain," and we teach ourselves to engage the slow one when accuracy matters. But fundamentally, we improvise a lot.

Keith: So where is automated reasoning being used to reduce hallucinations? Byron: We’ve launched automated reasoning checks as part of Bedrock Guardrails for chatbots, but we’ve also applied it internally for 10 years at AWS.

We’ve proven the correctness of core infrastructure components — virtual networks, cryptography, virtualization, storage, networking, etc. When we use generative AI to optimize or generate code, we can re-check its correctness. Customers can start to do this too by combining generative models with open-source reasoning tools.

Keith: But isn't there a difference between math-based correctness and subjective areas like humor? Byron: Absolutely. There’s a continuum — some answers are Boolean (yes/no), others are continuous. “Is this funny?” falls into that gray area. One person’s joke is another person’s eye roll.

But I wouldn’t say AI can’t be funny. I’ve used AI to help with naming projects, and it’s actually been pretty clever.

Keith: We’re also moving into the world of agentic AI. Are you comfortable with agents making decisions based on potentially false information? Byron: We need infrastructure to safely deploy agentic systems in high-stakes environments — data loss, financial risk, safety, etc.

Agentic AI allows non-programmers to build and operate distributed systems. If you say, “Go change my investment portfolio,” the agent is talking to databases, moving money, triggering actions. That’s essentially a distributed system, and it needs guardrails.

To help non-experts safely deploy these tools, we need to define clear rules and safeguards. Automated reasoning is one way to do that. Keith: I think you mentioned a milk example before the show? Byron: Right. I don’t need a proof of correctness for a bot that brings me milk.

Either it brings the milk or it doesn’t — and if it doesn’t, I leave a bad review. But if the bot is managing investments or performing a task that could have legal or financial consequences, then yes, correctness matters.

You can’t just mail any product to any customer. Some items are restricted by location or regulation. So if an agentic AI starts mailing restricted goods to California when it’s only legal in Nevada, that’s a serious problem.

We need encoded rules for that, and those rules change frequently by jurisdiction. Keith: So as emperor of the universe — just hypothetically — how would you solve the hallucination problem? Byron: Well, we’ve been dealing with it for decades.

You go to the DMV and ask a slightly off-the-wall question — you’re not sure if the person at the counter is right, so you ask for a manager, or get a lawyer. That’s how our socio-technical systems work: inconsistent truths resolved through process.

Now we have the opportunity to encode more of those rules, automate consistency checks, and make that system more democratic and efficient. If something is wrong, it can be flagged. Experts can revise the rules, and those changes can be replayed across previous queries to ensure consistency.

That lowers the friction to truth and improves decision-making for everyone. Keith: Do you think we’re reducing hallucinations overall — or will they increase as the systems become more complex? Byron: We’ve always had hallucinations — just not from AI.

Now, because we're trying to reduce friction when accessing information, hallucinations have become more visible. Over time, as society learns how to define truth and apply the right tools, I think we’ll improve.

Keith: If you were advising a CEO or CIO who’s concerned about hallucinations in a project, what would you say? Byron: It depends on their business. Some organizations are built on their understanding of truth — like those who master the U.S. tax code.

That knowledge becomes more valuable in the AI era. By codifying their understanding, they can remove hallucinations while enabling broader access to their expertise. Keith: Final question: Have you ever seen a hallucination that surprised or amused you?

Byron: Well, not funny exactly — I’m a logician — but fascinating, yes. Terence Tao, the Fields Medal-winning mathematician, live-streams his search for mathematical proofs. He combines generative AI with formal reasoning tools, like the Lean theorem prover developed by Amazon.

It’s incredible to watch even the greatest living mathematician rely on AI to discover new connections. Keith: That’s amazing. So if I asked ChatGPT, “What’s the last number of pi?” and it said “four,” would that be surprising?

Byron: [Laughs] There is no last number of pi, so that would definitely be a hallucination. But it’s also a great example of the kind of thing people still test AI with.

Keith: Final thought: Will AI get better if we simply tell people it’s getting better? What about trust? Byron: Trust will improve with reinforcement learning.

Early ML thinking said, “Train on everything.” But now, the shift is toward reinforcement learning with optimized functions that aren’t based on what humans already know. That’s how AI exceeds human capability.

Remember move 37 in AlphaGo — the move no expert expected, but it was brilliant. In some areas, AI is already better than humans. We’ve always had mechanisms for dealing with misinformation or confusion; now we just need similar ones for AI.

Keith: Byron Cook, thank you again for this fascinating discussion. Byron: Thanks — it was fun.

Keith: That’s all the time we’ve got for today’s show. Be sure to like the video, subscribe to the channel, and drop your thoughts in the comments. Join us every week for new episodes of Today in Tech. I'm Keith Shaw — thanks for watching.  

Overview

Generative AI has revolutionized how we write, code, and create—but it still hallucinates. In this episode of Today in Tech, host Keith Shaw sits down with Byron Cook, Distinguished Scientist and VP at AWS, to break down why AI continues to make things up—and what we can do to stop it.

They explore:
* What AI hallucinations really are—and why they’re not always bad
* How automated reasoning can serve as a “logic cop” for generative AI
* The challenge of defining truth in business contexts
* Agentic AI: Why it needs stronger safeguards before widespread adoption
* How Amazon is using formal methods to improve accuracy and reduce risk
* When hallucinations are harmless—and when they can cost you your job, money, or reputation

If you’re an enterprise leader, developer, or tech enthusiast trying to understand the next phase of trustworthy AI, this is the episode for you.

Don’t forget to like, comment, and subscribe for more tech insights every week!

Register Now