Plain text wins again: A story of the web, RDF, and LLMs

The Knowledge Graph Guy

Plain text has outlived every hype cycle. It has survived the GUI revolution, the database boom, and the reign of the API. Now, in the age of large language models, it’s quietly winning again. When I write or code these days, I keep my notes, rules, and small knowledge models in text. It isn’t nostalgia. It’s because both humans and machines can read them. No translation layers. No black boxes. Just meaning, expressed directly. This practice traces back to the original web. Tim Berners-Lee’s idea was simple: publish information as plain text that anyone - or anything - could inspect, link, and process. Hyperlinks stitched those fragments together, forming a web of knowledge. Today’s foundational models are trained on that corpus. In a sense, they are statistical compressions of its structure and regularities. Back in the day the web’s text wasn’t quite enough for machines to understand. So came RDF which reduced meaning to a minimal grammar: subject, predicate, object. It was a linguistic insight turned into data infrastructure. With that triple pattern you could describe the world in a format that was both legible and formal. A web of meaning, written in text. The next evolution came with JSON-LD and schema.org. Sites began including small application/ld+json blocks beside their prose. JSON remained human-readable; the @context mapped words to globally defined IRIs with unambiguous semantics. Search engines could suddenly read both the story and its structure - the words and the graph beneath them - and index the web not just by text, but by meaning. Now, enter large language models. Trained on that same web, they generate fluent prose that feels like understanding. Yet they highlight an old truth: natural language is ambiguous. A model can speak confidently about “Paris” without knowing whether it’s in France or Texas. RDF solved that decades ago - one plain-text IRI like https://lnkd.in/eBZjiTGt can anchor the word to a specific thing. So the challenge isn’t to replace structure with language; it’s to coordinate the two. We want to feed models text they can reason about, but with hooks that tether words to verifiable meaning. That’s why the plain-text pattern is resurfacing: prose for humans, identifiers and relations for machines, both living side by side. This, more than any breakthrough in model size or token length, may define the next phase of AI. The web taught us that linked plain text could scale to the world. The semantic web taught us how to describe that world formally. LLMs bring the narrative power. The future lies in fusing them - a web where every sentence can be both read and reasoned with. Triples were the first step on that journey. They began as a sparse language for meaning, evolved to sit quietly next to our words, and now return as the hidden scaffolding beneath machine intelligence. After thirty years, the hero of plain text has come home.

83 Comments

Jack Jansonius

Developer Decision Intelligence\SQL\Python

No translation layers. No black boxes. Just 𝐦𝐞𝐚𝐧𝐢𝐧𝐠, expressed directly. And then you're back to hermeneutics again. https://www.linkedin.com/pulse/hermeneutics-age-ai-jack-jansonius-4r5we/

11 Reactions

Louis Hendriks

Founder & Initiator of Global Value Web

Tony Seale , again I applaud you for making it so transparent and easy to understand for the masses. Quote “The web taught us that linked plain text could scale to the world. The semantic web taught us how to describe that world formally. LLMs bring the narrative power. The future lies in fusing them - a web where every sentence can be both read and reasoned with.”

9 Reactions

Alfredo Serafini

serendipity expert

I do the same thing! :-) However I'm still searching for robust and easy to use way to embed RDF "mapping" and CSS simple declarations in markdown, without flooding it with too much "instructions" (that would complicate something that should be easy). The best would be a static site generator introducing their own "markdown+rdf" :-)

7 Reactions

Kingsley Uyi Idehen

Founder & CEO at OpenLink Software | Driving GenAI-Based AI Agents | Harmonizing Disparate Data Spaces (Databases, Knowledge Bases/Graphs, and File System Documents)

Yep! Which is why RDF-Turtle exists: reduce sentences to their core essence and turbocharge them using standardized identifiers: 1. References — hyperlinks 2. Typed Literals — dates, decimals, floats, booleans, etc. 3. Untyped Literals — optionally language-tagged RDF elegantly unveils natural language as a system of signs, syntax, and semantics for encoding and decoding information. It solves a complex problem in a “deceptively simple” way—its tortured journey was the only path to escape velocity and eventual mass adoption. TimBL realized early that freeing the world from application silos required a pathway back to natural language text and the file create, save, and share pattern. Without this work, a generation of computer users would have been severely hampered by lacking any knowledge of files and folders—an unimaginable disaster for literacy as a whole. See also: https://www.linkedin.com/pulse/file-create-save-share-paradigm-revisited-kingsley-uyi-idehen-phxze — The File Create, Save, and Share Paradigm — Revisited

8 Reactions

Rareș I.

Product Expert, focused on AI, Software, Analytics, Engineering. Bottleneck Remover.

Plain text is the cockroach of computing - every shiny new format tries to kill it, yet it keeps crawling back, still perfectly parseable with cat. 🤓

13 Reactions

Johan Wilhelm Klüwer

Principal Specialist at Det Norske Veritas

Let me grab the opportunity to plug my plaintext (org-mode) ontology authoring tool! Super convenient if you are an Emacs user. GPL license. Works well with AI assistant for text editing. https://github.com/johanwk/elot

5 Reactions

Dan Brickley

Data Standards Engineering

I don’t think we quite solved the Paris problem with RDF, although it is an important foundation. Sometimes we want to be ultra precise eg factchecking media claims about knife crime or air quality. Other times a more expansive merging of levels of detail might make sense, eg capturing “Roman London” as a notion of London. Or in bibliography being able to switch fluidly between a FRBR view of a situation (entities for works, expressions, manifestations, items…) with a simpler and flatter version that may be closer to everyday terminology but awkward when trying to be precise. My hope for the new technologies is the help us navigate between these various legitimate levels of detail in ways pur old “everything has a URI” rhetoric tended to optimistically gloss over.

5 Reactions

Dan Brickley

Data Standards Engineering

Tim Bray used to have "Intelligence is a text-based application." as a blog tagline, which I always quite liked

3 Reactions

Andrew Noble

Founder & Systems Builder | Creating tensegrity for your practice: a resilient, efficient, and interconnected tech structure.

A method for writing deterministic logic directly in English https://github.com/LogicalContracts/LogicalEnglish

3 Reactions

Niklas Emegård

AI & Knowledge Architect, Länsförsäkringar AB

Poetry.

3 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

Piers Hollott

Health Information Architect at Launch by NTT DATA
1w
Report this post
I agree with this almost entirely, and then perhaps some. Plain text notes as markdown files are easy to keep up to date, along with PlantUML diagrams. VS Code plugins make that easy, and this information is easy to copy over to GitHub, etc. More recently, though, I’ve been looking at markup, because it really seems like if Generative AI has an appropriate use, it’s as part of a publishing toolchain to add alt text and microdata/rdf-a attributes. I guess the good, the bad, and the ugly reality of markdown is the absence of namespaces.
Tony Seale

The Knowledge Graph Guy
2w

Plain text has outlived every hype cycle. It has survived the GUI revolution, the database boom, and the reign of the API. Now, in the age of large language models, it’s quietly winning again. When I write or code these days, I keep my notes, rules, and small knowledge models in text. It isn’t nostalgia. It’s because both humans and machines can read them. No translation layers. No black boxes. Just meaning, expressed directly. This practice traces back to the original web. Tim Berners-Lee’s idea was simple: publish information as plain text that anyone - or anything - could inspect, link, and process. Hyperlinks stitched those fragments together, forming a web of knowledge. Today’s foundational models are trained on that corpus. In a sense, they are statistical compressions of its structure and regularities. Back in the day the web’s text wasn’t quite enough for machines to understand. So came RDF which reduced meaning to a minimal grammar: subject, predicate, object. It was a linguistic insight turned into data infrastructure. With that triple pattern you could describe the world in a format that was both legible and formal. A web of meaning, written in text. The next evolution came with JSON-LD and schema.org. Sites began including small application/ld+json blocks beside their prose. JSON remained human-readable; the @context mapped words to globally defined IRIs with unambiguous semantics. Search engines could suddenly read both the story and its structure - the words and the graph beneath them - and index the web not just by text, but by meaning. Now, enter large language models. Trained on that same web, they generate fluent prose that feels like understanding. Yet they highlight an old truth: natural language is ambiguous. A model can speak confidently about “Paris” without knowing whether it’s in France or Texas. RDF solved that decades ago - one plain-text IRI like https://lnkd.in/eBZjiTGt can anchor the word to a specific thing. So the challenge isn’t to replace structure with language; it’s to coordinate the two. We want to feed models text they can reason about, but with hooks that tether words to verifiable meaning. That’s why the plain-text pattern is resurfacing: prose for humans, identifiers and relations for machines, both living side by side. This, more than any breakthrough in model size or token length, may define the next phase of AI. The web taught us that linked plain text could scale to the world. The semantic web taught us how to describe that world formally. LLMs bring the narrative power. The future lies in fusing them - a web where every sentence can be both read and reasoned with. Triples were the first step on that journey. They began as a sparse language for meaning, evolved to sit quietly next to our words, and now return as the hidden scaffolding beneath machine intelligence. After thirty years, the hero of plain text has come home.
Like Comment
To view or add a comment, sign in
Kurt Cagle Kurt Cagle is an Influencer

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast
1w
Report this post
I started my professional career working for a typesetter, many, many years ago, and inline markup, which had itself emerged from editors marking up authors' manuscripts, has been a hallmark of describing simple narrative structural intent for a long time. RDF can be evoked in a similar pattern: USING <http://example.com/sample> as default, ~> HTML [ a text format; a markup language; evolved into ~> Markdown [ a Text Format; a markup language; a compact language; understood by (humans, machines)]]. which can be translated into a more canonical representation (Turtle): PREFIX default <http://example.com/sample> :HTML a :TextFormat; a :MarkupLanguage; :evolvedInto :Markdown . :Markdown a :TextFormat; a :MarkupLanguage; a :CompactLanguage; :understoodBy (:humans, :machines) . RDF is a language of concepts, not tokens. If there was a consistent way of naming anonymous nodes (blank nodes associated with bracket notation and other similiar declarations in Turtle) then the first example can be seen as RDF within a default namespace. Once you get used to the odd symbology, which can also be rendered textually: ---- There is a thing called HTML, which is a text format; a markup language; that evolved into a thing called Markdown, which is a Text Format, a markup language, a compact language; which is understood by a list of humans and machines. ---- then you can see how the one elides into the other. RDF as text does require some thinking about what you're saying, and there are specific grammatical constructs in English and other languages which likely require additional syntax, but I think that we hamper ourselves when we say RDF must use IRIs. Conversation is generally local, meaning that the we should be thinking about such identifiers as: when I make an assertion about a given entity's class (x is a X), I can do so without specifically having to reference a definition immediately, so long as at some point you and I can agree that any concepts we declare in conversation we will ultimately resolve to a common set of definitions. You can think of this as late binding for classes. This is one reason I like SHACL. It's a great language for late binding of concepts to conceptual definitions, and it's simple enough that it can be automated while sufficiently robust to handle nuance.
Tony Seale

The Knowledge Graph Guy
2w

Plain text has outlived every hype cycle. It has survived the GUI revolution, the database boom, and the reign of the API. Now, in the age of large language models, it’s quietly winning again. When I write or code these days, I keep my notes, rules, and small knowledge models in text. It isn’t nostalgia. It’s because both humans and machines can read them. No translation layers. No black boxes. Just meaning, expressed directly. This practice traces back to the original web. Tim Berners-Lee’s idea was simple: publish information as plain text that anyone - or anything - could inspect, link, and process. Hyperlinks stitched those fragments together, forming a web of knowledge. Today’s foundational models are trained on that corpus. In a sense, they are statistical compressions of its structure and regularities. Back in the day the web’s text wasn’t quite enough for machines to understand. So came RDF which reduced meaning to a minimal grammar: subject, predicate, object. It was a linguistic insight turned into data infrastructure. With that triple pattern you could describe the world in a format that was both legible and formal. A web of meaning, written in text. The next evolution came with JSON-LD and schema.org. Sites began including small application/ld+json blocks beside their prose. JSON remained human-readable; the @context mapped words to globally defined IRIs with unambiguous semantics. Search engines could suddenly read both the story and its structure - the words and the graph beneath them - and index the web not just by text, but by meaning. Now, enter large language models. Trained on that same web, they generate fluent prose that feels like understanding. Yet they highlight an old truth: natural language is ambiguous. A model can speak confidently about “Paris” without knowing whether it’s in France or Texas. RDF solved that decades ago - one plain-text IRI like https://lnkd.in/eBZjiTGt can anchor the word to a specific thing. So the challenge isn’t to replace structure with language; it’s to coordinate the two. We want to feed models text they can reason about, but with hooks that tether words to verifiable meaning. That’s why the plain-text pattern is resurfacing: prose for humans, identifiers and relations for machines, both living side by side. This, more than any breakthrough in model size or token length, may define the next phase of AI. The web taught us that linked plain text could scale to the world. The semantic web taught us how to describe that world formally. LLMs bring the narrative power. The future lies in fusing them - a web where every sentence can be both read and reasoned with. Triples were the first step on that journey. They began as a sparse language for meaning, evolved to sit quietly next to our words, and now return as the hidden scaffolding beneath machine intelligence. After thirty years, the hero of plain text has come home.
8 Comments
Like Comment
To view or add a comment, sign in
Andre Castro

Software engineer & team lead
1w
Report this post
couldn't agree more. worth reading: "The web taught us that linked plain text could scale to the world. The semantic web taught us how to describe that world formally. LLMs bring the narrative power. The future lies in fusing them - a web where every sentence can be both read and reasoned with."
Tony Seale

The Knowledge Graph Guy
2w

Plain text has outlived every hype cycle. It has survived the GUI revolution, the database boom, and the reign of the API. Now, in the age of large language models, it’s quietly winning again. When I write or code these days, I keep my notes, rules, and small knowledge models in text. It isn’t nostalgia. It’s because both humans and machines can read them. No translation layers. No black boxes. Just meaning, expressed directly. This practice traces back to the original web. Tim Berners-Lee’s idea was simple: publish information as plain text that anyone - or anything - could inspect, link, and process. Hyperlinks stitched those fragments together, forming a web of knowledge. Today’s foundational models are trained on that corpus. In a sense, they are statistical compressions of its structure and regularities. Back in the day the web’s text wasn’t quite enough for machines to understand. So came RDF which reduced meaning to a minimal grammar: subject, predicate, object. It was a linguistic insight turned into data infrastructure. With that triple pattern you could describe the world in a format that was both legible and formal. A web of meaning, written in text. The next evolution came with JSON-LD and schema.org. Sites began including small application/ld+json blocks beside their prose. JSON remained human-readable; the @context mapped words to globally defined IRIs with unambiguous semantics. Search engines could suddenly read both the story and its structure - the words and the graph beneath them - and index the web not just by text, but by meaning. Now, enter large language models. Trained on that same web, they generate fluent prose that feels like understanding. Yet they highlight an old truth: natural language is ambiguous. A model can speak confidently about “Paris” without knowing whether it’s in France or Texas. RDF solved that decades ago - one plain-text IRI like https://lnkd.in/eBZjiTGt can anchor the word to a specific thing. So the challenge isn’t to replace structure with language; it’s to coordinate the two. We want to feed models text they can reason about, but with hooks that tether words to verifiable meaning. That’s why the plain-text pattern is resurfacing: prose for humans, identifiers and relations for machines, both living side by side. This, more than any breakthrough in model size or token length, may define the next phase of AI. The web taught us that linked plain text could scale to the world. The semantic web taught us how to describe that world formally. LLMs bring the narrative power. The future lies in fusing them - a web where every sentence can be both read and reasoned with. Triples were the first step on that journey. They began as a sparse language for meaning, evolved to sit quietly next to our words, and now return as the hidden scaffolding beneath machine intelligence. After thirty years, the hero of plain text has come home.
Like Comment
To view or add a comment, sign in
Grzegorz Wierzowiecki

Lateral Thinker, Rust Crypto embracing AI, Staff Blockchain Security Expert, Tech lead of white hat hackers specializing in Rust DLTs/Blockchains efforts. Blockchain CyberSecurity. White Box Engineer.
1w
Report this post
Yes! Especially that RDF in Turtle .ttl syntax or .n3 , .nt , .nq (N-Quads!) are almost as easy as .csv , .yaml, .json for adoption! And enable so much power with OWL, SHACLs and rest of ecosystem! https://lnkd.in/eENCS5Fr) https://lnkd.in/eB6rQ3CQ
Tony Seale

The Knowledge Graph Guy
2w

Plain text has outlived every hype cycle. It has survived the GUI revolution, the database boom, and the reign of the API. Now, in the age of large language models, it’s quietly winning again. When I write or code these days, I keep my notes, rules, and small knowledge models in text. It isn’t nostalgia. It’s because both humans and machines can read them. No translation layers. No black boxes. Just meaning, expressed directly. This practice traces back to the original web. Tim Berners-Lee’s idea was simple: publish information as plain text that anyone - or anything - could inspect, link, and process. Hyperlinks stitched those fragments together, forming a web of knowledge. Today’s foundational models are trained on that corpus. In a sense, they are statistical compressions of its structure and regularities. Back in the day the web’s text wasn’t quite enough for machines to understand. So came RDF which reduced meaning to a minimal grammar: subject, predicate, object. It was a linguistic insight turned into data infrastructure. With that triple pattern you could describe the world in a format that was both legible and formal. A web of meaning, written in text. The next evolution came with JSON-LD and schema.org. Sites began including small application/ld+json blocks beside their prose. JSON remained human-readable; the @context mapped words to globally defined IRIs with unambiguous semantics. Search engines could suddenly read both the story and its structure - the words and the graph beneath them - and index the web not just by text, but by meaning. Now, enter large language models. Trained on that same web, they generate fluent prose that feels like understanding. Yet they highlight an old truth: natural language is ambiguous. A model can speak confidently about “Paris” without knowing whether it’s in France or Texas. RDF solved that decades ago - one plain-text IRI like https://lnkd.in/eBZjiTGt can anchor the word to a specific thing. So the challenge isn’t to replace structure with language; it’s to coordinate the two. We want to feed models text they can reason about, but with hooks that tether words to verifiable meaning. That’s why the plain-text pattern is resurfacing: prose for humans, identifiers and relations for machines, both living side by side. This, more than any breakthrough in model size or token length, may define the next phase of AI. The web taught us that linked plain text could scale to the world. The semantic web taught us how to describe that world formally. LLMs bring the narrative power. The future lies in fusing them - a web where every sentence can be both read and reasoned with. Triples were the first step on that journey. They began as a sparse language for meaning, evolved to sit quietly next to our words, and now return as the hidden scaffolding beneath machine intelligence. After thirty years, the hero of plain text has come home.
Like Comment
To view or add a comment, sign in
Bob Goodman Bob Goodman is an Influencer

Enterprise Strategy, Platform & Product Design Director | Alum: Virgin Pulse, Havas, Microsoft
1w Edited
Report this post
Do you know why and when LLMs decide to look things up with Web search? Large language models know a lot, but that knowledge is frozen at training time and updates require a model update. They can sound confident while being out of date. Retrieval-Augmented Generation (RAG) explains how and why a model decides to look things up rather than rely only on memory. Here are the 5 key steps: 1) Retrieval Triggered: It is partly about confidence thresholds but also about rules and context. Systems retrieve for time-sensitive topics, current events, technical specifications, or when the user signals a need for freshness. Even if the model could answer from memory, it searches when recency, stakes, or reliability require it. 2) Web-Scale Retrieval: Your question becomes both a semantic vector and a standard web query. The model calls an external search API such as Google or Bing and receives about 8 to 12 results with titles, URLs, and excerpts. It brings in coherent sections from those pages, enough to stay readable but compact enough to fit within a few thousand words of active context. 3) The model re-scores those passages for answer-likelihood: topical completeness, clarity, specificity, and source authority. This is where I have coined the term LanguageRank. In practice, a retrieval-augmented model typically ingests a few thousand to about twenty thousand words of candidate text at this stage, roughly 8–12 web passages, each 500–2,000 words long. These passages are selected and scored using vectorized representations, then the actual text is loaded into the model’s active context window, which for most production-scale systems ranges from 16 k to 200 k tokens (around 12 k to 150 k words). It operates by weighting fragments not just for factual content but for fit : how well their linguistic form, precision, and framing align with the user’s query and prior dialogue. Over time, within a conversation, this becomes personal as the system learns your context and reasoning style. 4) Grounding While Writing: As the model composes its response, it tracks which retrieved passages each clause depends on, creating an internal map linking claims to sources. 5) Citation as Attribution: A citation appears when the search adds information the model uses in its answer, not when it merely confirms what it already knew. The line is between information newly retrieved and information already known. That distinction keeps citation focused on provenance and accountability, not redundancy. In strategic terms, it marks where live evidence intersects with stored understanding. RAG is more than a retrieval pipeline; it is a governance model for living knowledge. The system moves from static recall to dynamic verification, from an archive to a curator of current information. #search #llms
Like Comment
To view or add a comment, sign in
Developee AI solutions LTD.

8 followers
1w Edited
Report this post
One of the most popular use case of LLM( Large Language Model ) is RAG (Retrieval-Augmented Generation). Today let's deep dive on it. What is it? As the name suggests RAG has 3 part - Retrieval, Augmentation and Generation. Let's understand how it works step by step. User Query: A user asks a question. Retrieval: A retriever component (often using embeddings and a vector database) searches your knowledge base for the most relevant chunks of information related to the user's query. This component does the similarity search in knowledge base stored as embeddings in a vector database. I have discussed about embeddings in last post. Please refer that to get an idea about that. Augmentation: Top K retrieved relevant/similar information is then added to the user's original query, forming a combined, enriched prompt. Usually, we augment top 2-5 retrieved chunks for best results. This usually depends on your specific use case. Generation: This augmented prompt is fed to a standard LLM, which then generates a response based on its internal knowledge and the provided context. When to use RAG: 1. Fact-checking and reducing hallucinations: You want the LLM to ground its answers in specific, verifiable sources. 2. Working with proprietary, sensitive, or niche data: The data is not publicly available or might be too specific to warrant a full fine-tuning. 3. Cost-effectiveness for knowledge updates: Updating the knowledge base (e.g., adding new documents to a vector database) is much cheaper than re-fine-tuning an LLM. 4. Lower technical barrier to entry: Setting up a basic RAG system is relatively easier than managing LLM fine-tuning. Pros: 1. Reduced Hallucinations: Grounds answers in verifiable facts from provided sources. 2. Cost-Effective Knowledge Updates: Updating the knowledge base is cheap and fast. 3. Leverages existing LLMs: Doesn't require modifying the base LLM, using off-the-shelf powerful models. Cons: 1. Retrieval Quality is Key: If the retriever fails to find relevant information, the LLM won't be able to provide a good answer. 2. Context Window Limitations: The retrieved context must fit within the LLM's context window. All LLMs come with context window limitation. Even if limitation is long enough for some LLMs, it becomes costly with increasing context window. 3. Stylistic Limitations: The base LLM's style and tone remain unchanged. #RAG #LLM #Gemini #OpenAI #AI
Like Comment
To view or add a comment, sign in
Adarsha Nagaraja

Software Engineer(5+ YOE) | Full Stack Developer | Ai Engineer | Data Scientist | ExSDE @TietoEvry | RPA | Python 3 | C++ 11 | C# 8| Javascript ES6 | Vue 3 | Lang Chain 0.3 | LLM Twin | Gen-AI | RAG | DSA | PySpark |
1w Edited
Report this post
Retrieval in LangChain: The Backbone of RAG Systems In the world of Retrieval-Augmented Generation (RAG), retrievers act as the essential bridge between user queries and the knowledge base. In LangChain, a retriever: ➡️ Accepts a natural language query ➡️ Applies retrieval logic (semantic, lexical, or hybrid) ➡️ Returns relevant documents enriched with metadata ⚙️ Core Types of LangChain Retrievers 🔹 Core Infrastructure Retrievers – Self-hosted or cloud-based systems like Elasticsearch, Amazon Kendra, and Google Search. 🔹 External Knowledge Retrievers – Direct access to external data sources such as Wikipedia, Arxiv, or TavilySearchAPI. 🔹 Algorithmic Retrievers – BM25 and TF-IDF for lexical matching, kNN for semantic similarity. 🔹 Advanced/Specialized Retrievers – NeuralDB (CPU-optimized) and LLMLingua (document compression). 🔹 Integration Retrievers – Seamless connections to Google Drive, Outline, and other repositories. 🧩 Vector Store Retrievers — Powering Semantic Search Vector stores are at the heart of semantic search. They convert both documents and queries into embeddings, enabling similarity-based matching. Any vector store can act as a retriever using the as_retriever() method. Here’s an example: from langchain_community.retrievers import KNNRetriever from langchain_openai import OpenAIEmbeddings retriever = KNNRetriever.from_documents(documents, OpenAIEmbeddings()) results = retriever.invoke("What is machine learning?") Other Key Retriever Types in RAG Systems 1️⃣ Search API Retrievers – Interface directly with external search services without storing data locally. from langchain_community.retrievers.pubmed import PubMedRetriever retriever = PubMedRetriever() results = retriever.invoke("COVID research") 2️⃣ Database Retrievers – Connect to structured data sources using natural language translation: SQL → text-to-SQL Graph DBs → text-to-Cypher Document DBs → specialized queries 3️⃣ Lexical Search Retrievers – Implement classical algorithms like: BM25 → Probabilistic keyword ranking TF-IDF → Term frequency-based relevance Elasticsearch → Scalable enterprise-grade text search 🔍 Hybrid and Custom Retrieval Modern RAG systems often combine multiple retrieval strategies for optimal accuracy and context: Hybrid Search – Merges semantic & lexical matching MMR (Maximal Marginal Relevance) – Ensures diversity in retrieved results Custom Retrievers – Build your own via BaseRetriever Retrieval isn’t just about finding information , it’s about understanding intent and bridging meaning. that’s what makes it the true engine behind every intelligent RAG pipeline. #LangChain #RAG #AI #VectorSearch #SemanticSearch #Retrieval #MachineLearning #LLM
Like Comment
To view or add a comment, sign in
Jeff Pan

Professor at The University of Edinburgh | Director of Knowledge Graph Laboratory at Huawei UK | Chair of the Knowledge Graphs Group at The Alan Turing Institute
3d Edited
Report this post
Won't be able to make ISWC this year sadly. @Andre Melo from Edinburgh will attend and present our work on "Schema-Constrained Grammar-Guided Generation of GQL Queries from Natural Language" in the Industry Track. The Problem It Solves: Current NL2GQL methods struggle with three critical pain points: ✅ Poor syntactic accuracy without grammar constraints ✅ High compute costs from multiple prompting steps or per-schema fine-tuning ✅ Schema violations that break query functionality Key Breakthroughs: 🔹 Grammar as the Guardrail: Custom CFG extensions (with auto-expandable schema placeholders) ensure syntactic correctness AND schema compliance during decoding. 🔹 Zero Fine-Tuning Needed: No more training models for every new schema. One grammar setup works across use cases (e.g., ignore GQL create/update/delete syntax for QA-only graphs). 🔹 Faster & Cheaper: Cuts compute time and response latency by eliminating redundant prompt schema declarations. 🔹 Post-Processing Polish: Uses Abstract Syntax Trees (AST) to fix variable reference errors and ground literals (e.g., correcting typos like “Jordan” to “Michael Jordan”), boosting query execution performance. Why This Matters for Your Work: For teams building personal assistants, knowledge graph QA tools, or graph database interfaces, this method delivers low-resource, on-device NL-to-GQL translation that’s reliable and efficient. It outperforms prior grammar-constrained approaches (which sacrificed speed for accuracy) and schema-aligned fine-tuning (which drains resources). 📄 Read the full paper: https://lnkd.in/e2DDuSfU #ISWC2025 #GQL #KnowledgeGraphs #AIInnovation #NLP

Schema-Constrained Grammar-Guided Generation of GQL Queries from Natural Language ceur-ws.org
Like Comment
To view or add a comment, sign in
Toby Qin

Vice President (DevOps) at Citi
3w
Report this post
The core idea of RAG is to transform large language models (LLMs) from closed "knowledge repositories" into open "information processors." By decoupling the model's parametric memory (training datasets) from its non-parametric memory (external knowledge base), it fundamentally addresses the problems of knowledge ossification, hallucination tendencies, and lack of traceability in LLMs. In essence, it converts a difficult open-book exam question into a reading comprehension task—it doesn't require the model to know everything, but only to summarize and synthesize based on the relevant materials provided. Therefore, the upper limit of a RAG system's capabilities is not determined by the LLM alone, but rather by the weakest link in the collaboration between its two major modules: the retriever (context) and the generator (model reasoning). Models may become increasingly powerful, but without good context, the model can only produce nonsense. Thus, the essence of optimizing RAG lies in: 1. Raw materials determine quality: The foundation of all optimization comes from building a high-quality knowledge index. This means performing fine-grained, content-aware chunking and metadata processing to ensure that retrieved content consists of semantically complete, clean, noise-free knowledge units. 2. Mind-reading trumps obedience: Don't expect the user's original query to be the optimal search query. Through query transformation and hybrid search, proactively infer and approximate the user's true intent from multiple dimensions—semantic, keyword-based, and others—to maximize the accuracy and comprehensiveness of retrieval. 3. Curation beats volume: We must always return to the model's attention mechanism. The goal of retrieval is not to provide the LLM with as many documents as possible. Even when the context window isn't full, providing the most precise context remains the prerequisite for excellent results. Introduce a reranking stage—add a quality control step before feeding content to the LLM—to ensure the model receives the most nutritious refined grain, not unfiltered chaff.
Like Comment
To view or add a comment, sign in
Pradeep Tidke

Senior Data Scientist | AWS | GenAI | MLOps | Product Management @Tata Motors
2w
Report this post
**🔍 Query Expansion Techniques** Query expansion is a technique that transforms a single user query into multiple related queries, increasing the chances of retrieving relevant documents by capturing different phrasings, synonyms, and conceptual variations. --- **🔍 Simple analogy:** It's like asking a question in multiple ways 🗣️. Instead of just asking "How do I fix my car?", you also ask "How to repair automobiles?", "Vehicle troubleshooting tips," and "Car maintenance solutions"—casting a wider net to find answers. --- **💡 Common query expansion methods:** **1. Synonym Expansion:** - Add related terms and synonyms to the original query - Example: "global warming" → ["climate change", "greenhouse effect", "rising temperatures"] **2. Multi-Query Generation:** - Use an LLM to generate semantically similar queries - Original: "open source NLP frameworks" - Expanded: ["natural language processing tools", "free NLP libraries", "open-source language processing platforms"] **3. HyDE (Hypothetical Document Embeddings):** - Generate a hypothetical answer first, then search using that answer's embedding - Bridges the gap between question style and document style **4. Query Decomposition:** - Break complex queries into simpler sub-queries - "Compare Python and Java for web development" → ["Python web development features", "Java web development features"] **When Query Expansion Helps:** - Vague or poorly formed user queries - Keyword-based retrieval systems - Handling diverse vocabulary and terminology - Improving recall in sparse document collections --- **🚀 Key takeaway:** Query expansion improves RAG recall by generating multiple perspectives on the same information need—ensuring you don't miss relevant documents just because they use different wording. #QueryExpansion #RAG #InformationRetrieval #LLM #RAGOptimization #TechExplained ---
Like Comment
To view or add a comment, sign in

34,411 followers

245 Posts

View Profile Follow

LinkedIn respects your privacy

Plain text wins again: A story of the web, RDF, and LLMs

Explore content categories