Note: this piece initially appeared as Helen King’s interview with Tim on PubTech Radar
As AI systems become serious consumers of research, scholarly publishing must rethink its formats, business models, and quality controls — because machine readers don’t want PDFs, they want structured, reliable science.
Large language models (LLMs) have already been trained on huge volumes of the written word; to an order of magnitude, everything humans have ever published online. As these systems mature, we will need to find new forms of input. If we want these tools to contribute meaningfully to scientific discovery, we need to rethink how research content is structured for their consumption and – necessarily – how it’s paid for.
The core limitation of today’s LLMs is that they operate by averaging. Their predictions are generated by assimilating vast numbers of tokens from existing texts and inferring what likely comes next. This approach may be useful in many domains, but it is ill-suited to scientific inquiry, where progress often stems from outliers – individual studies that take a new approach and reshape the consensus.
Scientific literature, moreover, is not of uniform quality. Some articles are foundational; many are derivative, flawed, or outright misleading. Yet because high-impact research is typically paywalled, and less rigorous work is often openly accessible, LLMs are disproportionately trained on lower-quality material. We’re asking AI systems to model the state of scientific knowledge by digesting the most available – not the most accurate – content.

To fix this problem we need to provide AI with the underlying raw materials of science, but what are the raw materials of science? It has to be an individual claim or conjecture: a set of conditions, a hypothesis that’s tested, and then a result in terms of both a dataset and a conclusion drawn from that dataset.
For the majority of research articles, the only accessible information comes as tables and figures that summarize the results, plus the text describing the authors’ interpretation of their results. Authors can – and often do – overstate the robustness of their conclusions and the significance of their work. Unless you’re very skilled and able to filter out these bloviations, it’s hard to determine the value and validity of the work from the words alone.
To enable AI systems to assist in research, we need to provide them with structured, reliable input. Not just the prose of published articles, but the underlying claims, data, and reasoning. A truly AI-ready version of a research paper might look very different from the human-readable PDF. It could be decomposed into modular units: individual hypotheses, linked datasets, methodological summaries, and direct mappings between evidence and conclusions. This kind of formatting would allow AI to assess the strength of specific claims and build a more nuanced understanding of the field.
The current publishing infrastructure is optimized for human readers. Research articles are delivered in a format that presumes a shared understanding (between humans!) of how to communicate findings; most commonly, via a PDF. This static medium persists because it enables a predictable exchange between sender and receiver. But AI readers operate under entirely different assumptions. They aren’t concerned with narrative flow or rhetorical nuance; they want structured input that enables reasoning.
This divergence opens the door to a new business model: offering AI-ready versions of research articles via a subscription. Human-readable content could remain open access, while publishers monetize machine-optimized content streams for institutional clients, particularly those running proprietary, in-house AI systems.
To give a concrete example of how this would work, imagine a pharmaceutical company subscribing to an AI feed that continuously ingests structured research about Parkinson’s disease, helping it identify promising leads in real time. In this scenario, access to high-quality, up-to-date content becomes essential, not optional.
The AI subscription model forces a renewed focus on quality control because the value of your journal depends on your ability to keep the fraudulent work out – nobody wants to train their AI on results that are fabricated or misleading.
The missing piece here is the data that underpins these articles. How can we prompt authors to provide their datasets, and make them sufficiently well curated that they can be understood by an AI? It comes back to the business model behind open data – and in this case, the AI subscriptions really do provide a financial incentive for publishers to work with authors to ensure that their datasets are shared alongside the manuscript. Moreover, if publishers are going to be earning subscription revenue, then maybe the author should see some part of that in return for the extra curation work on their datasets. Perhaps they could get a reduced APC for the Open Access (= human) version of the article, or there’s some revenue sharing.
AI models are only going to get better and better. The new generation of reasoning models is already so impressive, but what they need to keep improving is knowledge about how the world works, and not just more information about the most likely next word in a sentence. The AI’s are not going to get that from the ‘words only’ version of human-readable articles that academic publishing currently makes, and that means there’s a huge opportunity for a new way to disseminate research.
Discussion
10 Thoughts on "Rise of the Machine Readers: What They Really Want to Read"
Certainly would take some involved coding. Articles that do have supporting data will likely publish it in a stand alone repository, such as Dryad or figshare. So that’s a step, to tell the machine learning to look for a data availability statement and follow it. But depending on the field, the data structure and metadata will be nonstandard. So it seems a challenge yet for reliable, full AI data ingestion and analysis.
Instead of framing this as yet another new output publishers should offer (most likely in addition to PDF, which is still in demand), why not ask universities and funders to adjust the outputs they expect from researchers? Until researchers are required to provide “individual hypotheses, linked datasets, methodological summaries, and direct mappings between evidence and conclusions” for promotion/tenure and grant applications, it will be difficult for publishers to get them from already pressured researchers. It seems a change like this should start much farther upstream.
Thank you very much for sharing these thoughts Tim!
It reminded me that I’m looking forward to the outcomes of the PLOS project to investigate a redesign of the scientific paper into what I they called a “knowledge stack” in the RWJF grant application: https://plos.org/redefining-publishing/
Are machine-friendly materials closer to what researchers create before they massage them into article formats? Could the former be less work than the latter?
In the age of data, I hope that funders and tenure and promotion frameworks create meaningful incentives for researchers.
What some may hope for as more equitable knowledge generation may also favor consolidation because of increased complexity, scale, and infrastructure requirements. Some information service providers will become richer, but I wonder if this will provide more meaningful incentives to researchers, or if these will remain twisted and then, on a greater scale.
We’re gonna need bigger ontologies. AND standardization in the semantics across all the differing standards (e.g. SNOMED and FHIR). And that’s just English!
AND we’re gonna need better informed consent, privacy protection, prevention of algorithmic biases, + computational capacity that won’t destroy the planet and/or use up all our potable water! Research data is barely shared because it is usually pretty human-centric and ergo, sensitive. Sharing it is problematic from ethics standpoints. Besides, why do humans need to pander to these hungry AI overlords with “special sauce” anyway?
100%
Hi Michelle! Thanks for this comment. An LLM’s underlying weights (a colossal matrix recording which words are most likely to follow a particular sequence of other words) is – in some ways – an ontology of language. Training an LLM re-forms that ontology around the topic of interest, all without direct human intervention.
This is a very apt observation, and an area in which libraries and scholarly publishers are uniquely-positioned to lead. I also imagine that there could be an opportunity for third parties to generate this sort of data model for published works. This is a big chance at a growth area – either in terms of pubic access, or in terms of revenue.
I think you are spot on with your analysis here, Tim. I’ve spent the last few days working with stakeholders involved in global biodata, exploring sustainability models for the massive (and exponentially growing) amounts of data they are now capturing. There’ll be a report to follow, but it’s fair to say we landed in much the same place as you – AI’s demand for data opens up huge new opportunities. That said, I can also see a couple of unintended consequences arising from this. One is that in future LLMs will go straight to the underlying data in some fields, cutting out publishers altogether, for all the reasons you note in the article. Another is that as open data starts to become monetised in this way it ceases to be open.
Hi Rob – thanks for the comment. I don’t think the underlying data will be quite as useful as data that’s woven into the narrative of the article, as the latter is explicitly presented in its context. The results of each small experiment can then become part of skeins of evidence showing that the world works in a particular way – which is how science ‘knows’ anything. An AI will be much better at assembling these skeins across many disparate topics than a forgetful and time-bound human.
Moreover, the insights an AI can get from direct access to a large database are going to be quite different from the understanding it can get from a ‘reading’ of the literature. I think the latter would be a necessary part of the AI’s education before it it can be expected to find anything genuinely useful in a large database.