Evaluating AI in Health Economics: Beyond Replication

View profile for Haidong Feng

Associate Principal Scientist (Associate Director) at Merck

Replicating an Economic Model ≠ Evaluating an AI Model Many early GenAI use cases in health economics (HE) focus on replicating existing models. But replication isn’t evaluation. Some GenAI tools can replicate existing health economic models with results that perfectly align with human-built models in terms of outputs like ICERs. Sounds impressive — but is it enough? Or is this really the direction we should take when evaluating AI-assisted HE models? Is it feasible or realistic that every time in real HTA submissions we build two models (AI vs. human) just to compare? ✅ Replication may test a single model’s performance, but it doesn’t advance the evaluation framework that’s needed to assess trustworthy and transparent AI-assisted modelling. Let me ask a simple question to my HEOR and HTA colleagues: If there’s an AI tool that replicates a model in disease area X, producing exactly the same outputs (Cost, QALYs, ICER) as an existing human-built model — does that alone remove your concerns about its transparency and reliability? And if you use that same AI tool to build a de novo model from scratch, how can you guarantee it will still be 100% aligned with a hypothetical human-built model (if one exists)? ✅ Also — let’s be fair. Humans make errors too. Models are built with uncertainty, and decisions are probabilistic, influenced by evolving evidence. Deviation between AI-built and human-built models does not necessarily mean AI is wrong. Maybe new data is emerging. Maybe AI uses different statistical methods that are actually better than Excel. Maybe it applies updated assumptions. As the classic quote says: “All models are wrong, but some are useful.” ✅ When we talk about “evaluating AI-built HE models”, we’re actually talking about two layers of evaluation: Evaluating the HE model itself — calculations, parameters, assumptions, settings, as highlighted in the NICE HTA Lab report. Evaluating the AI tool that builds the model — its trustworthiness, transparency, and reproducibility. This second layer is foundational. Every pharma, biotech, consultancy can build their own AI tool. There’s no single “universal” AI tool for HE modelling — nor should there be. For trust to be earned, we must evaluate the AI itself: What AI architecture is being used? If it’s RAG, what knowledge base is feeding it? If it’s an Agentic model, what supporting evidence does it output (e.g., R/Python survival analysis code, documentation of decision steps)? Replication and comparison are useful starting points. But it’s time for the HEOR community to move beyond replication and build robust evaluation frameworks tailored to GenAI’s strengths and risks — especially around transparency, documentation, and human oversight. 👉 NICE HTA Lab project: https://lnkd.in/e7rXkcVH Curious how others in #HEOR and #HTA are approaching this. Let’s discuss. #GenAI #HTA #Transparency #AgenticAI # #HTAinnovation

To view or add a comment, sign in

Explore content categories