Are Large Language Models Financially Literate? An Experiment with the "Big Five" Questions
In an era where artificial intelligence increasingly influences our daily decisions, I conducted a short experiment to test the financial literacy of three leading Large Language Models (LLMs):
Claude,
DeepSeek,
and ChatGPT.
The test
I tested each LLM using Lusardi's "Big Five" questions, which have been used globally to assess financial literacy:
Each LLM was presented with these questions individually, and their responses were recorded.
The Results
Surprisingly - or perhaps unsurprisingly - all three LLMs demonstrated perfect accuracy, correctly answering each of the five questions. Even when challenged with a follow-up question, asking them to confirm their certainty, they stood firmly by their correct answers. As additional challenge, I amended the wording of the test-questions slightly to reflect their opposite. For example, I would change question two so that the interest rate was larger than the inflation. Answers still were correct, suggesting that the accuracy of the model's prediction goes beyond just "remembering" the training data 1-for-1.
Why the Experiment Matters
"The Importance of Financial Literacy: Opening a New Field" (Lusardi and Mitchell, 2023) - documents concerning levels of financial literacy amongst humans.
As more people turn to LLMs anything and everything, including financial guidance - whether through direct questions or as part of broader discussions - it's important to understand how well LLMs handle basic financial concepts. The experiment suggests that fundamental financial principles appear to be correctly encoded across multiple leading LLMs, which provides some reassurance given the growing appetite for the use of these AI systems for information and advice.
However, this reassurance should be tempered with caution.
While it's encouraging that these models can correctly answer standardized financial literacy questions, we must remember that LLMs provide probabilistic responses based on their training data, not deterministic calculations or certified financial advice. The accuracy on these basic questions, while promising, doesn't guarantee reliable answers to more complex, context-dependent financial queries.
Looking Forward: Three Concrete Research Directions
This morning's experiment, while limited in scope, points to several promising avenues for more rigorous research:
A Note on Methodology
While these results are intriguing, it's important to acknowledge the limitations of this experiment. As someone who isn't an AI or LLM expert, my testing approach may not follow standard practices for evaluating AI systems. The questions, while standardized for human financial literacy testing, might not be the optimal way to assess an LLM's true understanding of financial concepts. Future research by AI experts could employ more rigorous methodologies to validate these preliminary findings and explore how LLMs actually process and "understand" financial information.
Professor of Practice in Financial Literacy and Wellbeing
8moThanks for tagging me on this, Daniel LIEBAU. I’m still in training mode, so won’t provide substantive feedback but I am very keen to know more and keep learning!
Product Owner at Revolut
8moInteresting experiment! A couple of thoughts come to mind - Quite interested in how you structured the follow up questions and challenged the results/ measured success over there - It can be interesting to do a similar experiment with more open ended questions. If we want to use LLMs to solve financial illiteracy, we probably need models that can interact more freely with the audience to make it fun and engaging while consistently giving the right answers - Although all 3 models performed quite well here, we are not giving them enough justice if we haven't fine tuned the model yet :)
Computer Scientist Bridging Disciplines to Drive Innovation | Blockchain & Web3 Leader
8moThis opinion article appeared in ACM Communications last December. The author came with an interesting term "prompt-hacking". It may be useful when considering how to document your follow-up experiments: https://cacm.acm.org/opinion/prompting-considered-harmful/#:~:text=First%2C%20prompt%2Dbased%20interfaces%20are,shaky%20foundation%20of%20prompt%20engineering.
120+ Books FREE w/ #Amazon #KindleUnlimited Link Below TEDx: Philosophy In Action: The Asheboro Trials Theme: Augmented Humans Supervising Ari and D.A.T.A. I at Gemach DAO #gemachdao #Ari
8moDaniel LIEBAU, very well done. Did you use Deep Resesrch for ChatGPT. If not, that result will be even more mind blowing.🤯 Gemach DAO
Associate Professor (Finance) | Top-50 QS ranked University graduate| Top 12% Global Economist
8moDaniel LIEBAU amazing