The Zero-Shot Crisis: Lessons Learned in the AI/ML Community
In mid-early 2023, my team, myself, and likely many others in the hardworking AI/ML community experienced what I now refer to as the "zero-shot crisis." So, what exactly is the zero-shot crisis? The Jackie Chan meme in the title picture of this article captures the essence of our collective astonishment. For the last decade, working with AI or machine learning has been a complex endeavor that typically unfolded in the following manner:
          
      
        
    
From these steps, it is clear that achieving a successful AI/ML project that generates real business impact requires significant resources. One such project was the European Patent Office's (EPO) Auto-Classification AI implementation, which we rolled out shortly before ChatGPT was released. Naturally, we tested ChatGPT and later other models by copy-pasting a published patent and, voila, we got a CPC classification. Just like that. If you're unfamiliar with zero-shot classification, it means that the model was never specifically trained to do patent classification in the Cooperative Patent Classification system, yet it managed to perform the task. It was astounding.
Jacky Chan a second time - just for the effect:
Data Science strikes back
Unfortunately, or perhaps fortunately, the story has a part two: "The Return of Data Science." Soon after we overcame the initial shock, we realized a few critical issues. For example, some of the symbols (CPC-classes produced) didn't exist—they were made up. Today those are called hallucinations or more recently ChatGPT is bullshit.
We now know that zero-shotting a large language model does not necessarily yield better quality results for this specific case. It is just more expensive and more difficult to evaluate, as the generative answer can sometimes be hard to parse for a machine. What we learned is that we still need an evaluation framework. In reality, all points 1-5 are still very much needed; they just look a bit different now. I can personally only recommend always coming up with a proper evaluation framework on a controlled test data set for all problems you want to solve with generative AI and LLMs. Where genAI can really support is in generating plausible training data - but this is yet another project for which you need to understand the performance! It will push the 1-5 just to another task!
This holds also true for anything Retrieval Augmented Generations (RAG) - do not trust that random chunks combined with random embeddings will deliver the output that you want your users to have: Create or collect real question-answer pairs and evaluate which combination of prompts+model+embedding and chunk size (to just mention a few of the free parameters) is the best to make informed business decision and do not become a genAI-zombie just dumping everything together, closing your eyes and hope for the best.
This article was also partly inspired by the fact that I had been personally contacted a few times to advise on "which LLM is best for working with patents." This question is almost impossible to answer in such general terms. More importantly, it shows that many people and colleagues who have recently entered the AI field still think it is just a matter of choosing the correct LLM and all problems are solved auto-magically. In our experience, this is not the case at all. While LLMs allow for fantastic capabilities, they have fundamentally changed the way we work. However, they have not eliminated the need for a robust methodology to deliver high and consistent quality to our business clients.
Proper testing and evaluation remain crucial. This also means we do still need high quality data to compare against. A pure qualitative analysis is not a replacement for a robust qualitative evaluation.
          
        
          
        
          
        
Co-Founder @ Palito | Patent litigator | Ex-Hogan Lovells
1yJulian Bergmann
German / European Patent Attorney, UPC Representative, Partner
1yThanks Alex, sounds like what I experienced (e.g. comparing a fine tuned QA model with AWS partyrock / Claude). The evaluation (data) becomes even more important. Matthias Blume