The Zero-Shot Crisis: Lessons Learned in the AI/ML Community

Alexander Klenner-Bajaja

Head of Data Science bei European Patent Office, Vectorizer of the grand Prior Art Corpus

Published Jun 24, 2024

In mid-early 2023, my team, myself, and likely many others in the hardworking AI/ML community experienced what I now refer to as the "zero-shot crisis." So, what exactly is the zero-shot crisis? The Jackie Chan meme in the title picture of this article captures the essence of our collective astonishment. For the last decade, working with AI or machine learning has been a complex endeavor that typically unfolded in the following manner:

Translate a business problem into an algorithmic problem and sanity check if machine learning is needed to solve it: The vast majority of those problems translated into supervised machine learning, i.e. we want to label things (classification) or we want to predict a more continuous outcome (regression) based on (often human created) past data.
Acquiring enough high-quality(!) labeled data to train and evaluate the model is crucial. This step is often the most critical in the entire workflow and can determine the success or failure of the project.
Setup a proper evaluation workflow that allows to you compare different approaches and to also establish a base line.
Do a few iteration with different approaches using 3. until upfront agreed quality is reached.
Deploy the model in operations. Deploying the model into production can be challenging. Transitioning from a Jupyter notebook to an organization's IT ecosystem, managing data streams, and ensuring automated re-training and monitoring is a complex task, often referred to as Machine Learning Operations (MLOps).

From these steps, it is clear that achieving a successful AI/ML project that generates real business impact requires significant resources. One such project was the European Patent Office's (EPO) Auto-Classification AI implementation, which we rolled out shortly before ChatGPT was released. Naturally, we tested ChatGPT and later other models by copy-pasting a published patent and, voila, we got a CPC classification. Just like that. If you're unfamiliar with zero-shot classification, it means that the model was never specifically trained to do patent classification in the Cooperative Patent Classification system, yet it managed to perform the task. It was astounding.

Article content — GPT4.0 Turbo doing zero shot classification

Jacky Chan a second time - just for the effect:

Data Science strikes back

Unfortunately, or perhaps fortunately, the story has a part two: "The Return of Data Science." Soon after we overcame the initial shock, we realized a few critical issues. For example, some of the symbols (CPC-classes produced) didn't exist—they were made up. Today those are called hallucinations or more recently ChatGPT is bullshit.

We now know that zero-shotting a large language model does not necessarily yield better quality results for this specific case. It is just more expensive and more difficult to evaluate, as the generative answer can sometimes be hard to parse for a machine. What we learned is that we still need an evaluation framework. In reality, all points 1-5 are still very much needed; they just look a bit different now. I can personally only recommend always coming up with a proper evaluation framework on a controlled test data set for all problems you want to solve with generative AI and LLMs. Where genAI can really support is in generating plausible training data - but this is yet another project for which you need to understand the performance! It will push the 1-5 just to another task!

This holds also true for anything Retrieval Augmented Generations (RAG) - do not trust that random chunks combined with random embeddings will deliver the output that you want your users to have: Create or collect real question-answer pairs and evaluate which combination of prompts+model+embedding and chunk size (to just mention a few of the free parameters) is the best to make informed business decision and do not become a genAI-zombie just dumping everything together, closing your eyes and hope for the best.

This article was also partly inspired by the fact that I had been personally contacted a few times to advise on "which LLM is best for working with patents." This question is almost impossible to answer in such general terms. More importantly, it shows that many people and colleagues who have recently entered the AI field still think it is just a matter of choosing the correct LLM and all problems are solved auto-magically. In our experience, this is not the case at all. While LLMs allow for fantastic capabilities, they have fundamentally changed the way we work. However, they have not eliminated the need for a robust methodology to deliver high and consistent quality to our business clients.

Proper testing and evaluation remain crucial. This also means we do still need high quality data to compare against. A pure qualitative analysis is not a replacement for a robust qualitative evaluation.

Dr. Lukas Wollenschlaeger

Co-Founder @ Palito | Patent litigator | Ex-Hogan Lovells

Julian Bergmann

1 Reaction

Christoph Hewel

German / European Patent Attorney, UPC Representative, Partner

Thanks Alex, sounds like what I experienced (e.g. comparing a fine tuned QA model with AWS partyrock / Claude). The evaluation (data) becomes even more important. Matthias Blume

LinkedIn respects your privacy

The Zero-Shot Crisis: Lessons Learned in the AI/ML Community

Alexander Klenner-Bajaja

Head of Data Science bei European Patent Office, Vectorizer of the grand Prior Art Corpus

Data Science strikes back

More articles by Alexander Klenner-Bajaja

Others also viewed

Almost Timely News: 🗞️ Using AI for Analytics (2025-09-21)

What Are The Latest Trends in Data Science?

Why 85% of AI Projects Are Expensive Failures

AI is the future. But do we have the right data?

Agentic AI with Model Context Protocol and AIStor

DeepSeek-V2 R1: Unveiling the Unknown in AI Innovation

Fair play? Navigating bias hurdles in AI-driven big data analytics

This Week in AI #1: Scaling Q&A, Efficiency Trade-offs, and Increasing Credibility

Demystifying Retrieval Augmented Generation (RAG)

The benefits of Openness and Modularity in an AI Lakehouse

Explore content categories

Data Science strikes back

More articles by Alexander Klenner-Bajaja

EPO's Legal Interactive Platform

Introducing AI-PreSearch: A Revolutionary AI-Driven Search Tool to support our Patent Examiners

AI guided CPC Classification

Trend Monitoring in Patents

2 million patent publications processed

Computer Vision for Patent Figures

EP-BERT goes live for Pre-Classification

EPO Neural Machine Translation

Others also viewed

Almost Timely News: 🗞️ Using AI for Analytics (2025-09-21)

What Are The Latest Trends in Data Science?

Why 85% of AI Projects Are Expensive Failures

AI is the future. But do we have the right data?

Agentic AI with Model Context Protocol and AIStor

DeepSeek-V2 R1: Unveiling the Unknown in AI Innovation

Fair play? Navigating bias hurdles in AI-driven big data analytics

This Week in AI #1: Scaling Q&A, Efficiency Trade-offs, and Increasing Credibility

Demystifying Retrieval Augmented Generation (RAG)

The benefits of Openness and Modularity in an AI Lakehouse

Explore content categories