用大模型（qwen）提取知识三元组并构建可视化知识图谱：从文本到图谱的完整实现

原创已于 2025-07-17 13:57:20 修改 · 1.3k 阅读

30 ·

CC 4.0 BY-SA版权

文章标签：

#知识图谱 #人工智能 #RAG #LLM

于 2025-07-17 13:52:46 首次发布

langchain构建RAG的Agent详细教程专栏收录该内容

11 篇文章

订阅专栏

引言

知识图谱作为一种结构化的知识表示方式，在智能问答、推荐系统、数据分析等领域有着广泛应用。在信息爆炸的时代，如何从非结构化文本中提取有价值的知识并进行结构化展示，是NLP领域的重要任务。知识三元组（Subject-Relation-Object）是知识图谱的基本组成单元，通过大模型强大的语义理解能力，我们可以自动化提取这些三元组，并构建可交互的知识图谱可视化界面。本文将介绍一个基于大模型的知识图谱构建工具，它能从文本中自动提取知识三元组（主体-关系-客体），并通过可视化工具生成交互式知识图谱。

这是运行结果获得，如下图示：

在这里插入图片描述

一、核心依赖库

在开始之前，确保已安装以下依赖库：

pip install networkx pyvis  # 知识图谱构建与可视化
# 其他基础库：json, re, os（通常Python环境自带）

而至于大模型环境，我们不在给出。

二、代码整体结构解析

整个项目代码主要包含四个核心模块，形成"文本输入→三元组提取→图谱构建→可视化输出"的完整流程：

# 核心模块关系
文本输入 → extract_triples() → 知识三元组 → build_knowledge_graph() → 图谱数据 → visualize_knowledge_graph() → 可视化HTML

下面我们逐个解析关键模块的实现逻辑。

1. 大模型调用与三元组提取（extract_triples函数）

该函数是整个流程的核心，负责调用大模型从文本中提取知识三元组。其关键实现思路如下：

大模型提示词设计

为了让大模型精准输出符合要求的三元组，我们设计了严格的系统提示词（System Prompt）：

system_prompt = """你是专业知识三元组提取器，严格按以下规则输出：
1. 仅从文本提取(主体, 关系, 客体)三元组，忽略无关信息。
2. 必须用JSON数组格式返回，每个元素含"subject"、"relation"、"object"字段。
3. 输出仅保留JSON数组，不要任何解释、说明、代码块标记。
4. 确保JSON格式正确：引号用双引号，逗号分隔，无多余逗号。
"""

提示词明确了输出格式要求，这是后续解析三元组的基础。

流式响应处理

大模型通常采用流式输出方式返回结果，我们需要持续接收并拼接响应内容：

full_response = ""
for chunk in stream_invoke(ll_model, messages):
    full_response += str(chunk)
    print(f"\r已接收 {len(full_response)} 字符...", end="")

这种处理方式能实时反馈进度，提升用户体验。

格式修复机制

大模型输出可能存在格式问题（如引号不规范、多余逗号等），因此需要异常处理和格式修复：

try:
    return json.loads(full_response)
except json.JSONDecodeError:
    # 尝试提取JSON结构并修复
    json_match = re.search(r'\[.*\]', full_response, re.DOTALL)
    if json_match:
        cleaned_response = json_match.group()
        cleaned_response = cleaned_response.replace("'", '"')  # 单引号转双引号
        cleaned_response = re.sub(r',\s*]', ']', cleaned_response)  # 移除末尾多余逗号
        try:
            return json.loads(cleaned_response)
        except json.JSONDecodeError as e:
            print(f"修复后仍解析失败：{e}")
            return []

这一机制大幅提升了代码的健壮性，即使大模型输出格式略有瑕疵也能尝试修复。

2. 知识图谱构建（build_knowledge_graph函数）

提取三元组后，需要将其转换为结构化的知识图谱数据结构：

def build_knowledge_graph(triples):
    if not triples:
        return None  # 处理空三元组情况
    
    entities = set()
    # 收集所有实体（主体和客体都是实体）
    for triple in triples:
        entities.add(triple["subject"])
        entities.add(triple["object"])
    
    # 构建实体属性字典
    entity_attributes = {entity: {"name": entity} for entity in entities}
    # 构建关系列表
    relations = [
        {
            "source": triple["subject"],
            "target": triple["object"],
            "type": triple["relation"]
        } for triple in triples
    ]
    
    return {
        "entities": [{"id": entity, **attrs} for entity, attrs in entity_attributes.items()],
        "relations": relations
    }

这个函数的核心逻辑是：

从三元组中提取所有唯一实体（去重）
为每个实体创建基础属性（目前包含名称）
将三元组转换为"源节点-目标节点-关系类型"的边结构
最终返回包含实体和关系的图谱字典

3. 知识图谱可视化（visualize_knowledge_graph函数）

可视化是知识图谱的重要展示方式，本项目使用pyvis库生成交互式HTML图谱：

可视化配置与节点边添加

# 初始化有向图
net = Network(
    directed=True, 
    height="700px", 
    width="100%", 
    bgcolor="#f5f5f5", 
    font_color="black",
    notebook=False  # 关键配置：非Notebook环境
)

# 添加节点
for entity in graph["entities"]:
    net.add_node(
        entity["id"],
        label=entity["name"],
        title=f"实体: {entity['name']}",
        color="#4CAF50"  # 绿色节点
    )

# 添加边（关系）
for relation in graph["relations"]:
    net.add_edge(
        relation["source"],
        relation["target"],
        label=relation["type"],
        title=relation["type"],
        color="#FF9800"  # 橙色边
    )

这里的关键配置是notebook=False，解决了非Jupyter环境下的模板渲染错误问题。

布局与交互配置

通过JSON配置定义图谱的视觉样式和交互行为：

net.set_options("""
{
  "nodes": {
    "size": 30,
    "font": {"size": 14}
  },
  "edges": {
    "font": {"size": 12},
    "length": 200
  },
  "interaction": {
    "dragNodes": true,  # 允许拖拽节点
    "zoomView": true,   # 允许缩放
    "dragView": true    # 允许拖拽视图
  }
}
""")

这些配置确保生成的图谱具有良好的可读性和交互性。

容错机制与备选方案

为应对HTML生成失败的情况，代码设计了备选可视化方案：

try:
    net.write_html(output_file, open_browser=False)
except Exception as e:
    # 备选方案：使用matplotlib生成静态PNG
    import matplotlib.pyplot as plt
    plt.figure(figsize=(12, 8))
    pos = nx.spring_layout(nx.DiGraph([(r["source"], r["target"]) for r in graph["relations"]]))
    nx.draw_networkx_nodes(pos, node_size=3000, node_color="#4CAF50")
    nx.draw_networkx_labels(pos, labels={e["id"]: e["name"] for e in graph["entities"]})
    nx.draw_networkx_edges(pos, edgelist=[(r["source"], r["target"]) for r in graph["relations"]], arrowstyle="->")
    nx.draw_networkx_edge_labels(pos, edge_labels={(r["source"], r["target"]): r["type"] for r in graph["relations"]})
    plt.savefig(output_file.replace(".html", ".png"))

这种双重保障机制确保即使pyvis出现问题，也能获得基础的可视化结果。

4. 主流程控制（process_text_to_graph函数）

该函数整合了前面的所有模块，形成完整的"文本→三元组→图谱→可视化"流程：

def process_text_to_graph(text):
    print("正在从文本中提取知识三元组...")
    triples = extract_triples(text)
    
    if not triples:
        print("未能提取到任何知识三元组")
        return None
    
    print(f"成功提取 {len(triples)} 个知识三元组：")
    for i, triple in enumerate(triples, 1):
        print(f"{i}. ({triple['subject']}, {triple['relation']}, {triple['object']})")
    
    print("\n正在构建知识图谱...")
    graph = build_knowledge_graph(triples)
    
    if not graph:
        print("构建知识图谱失败")
        return None
    
    print("\n正在生成知识图谱可视化...")
    output_file = visualize_knowledge_graph(graph)
    
    return output_file

流程清晰，包含了必要的日志输出和异常判断，方便用户跟踪进度和排查问题。

三、使用方法与示例

运行示例

if __name__ == "__main__":
    sample_text = """
    爱因斯坦是一位著名的物理学家，他出生于德国。1905年，爱因斯坦提出了相对论。
    相对论彻底改变了人们对时间和空间的理解。爱因斯坦因光电效应获得了1921年诺贝尔物理学奖。
    他后来移居美国，并在普林斯顿大学工作。爱因斯坦与玻尔就量子力学的解释有过著名的争论。
    """
    
    process_text_to_graph(sample_text)

输出结果

运行后会得到以下输出：

正在从文本中提取知识三元组...
正在接收大模型流式响应...
已接收 236 字符...
流式响应接收完成，开始解析...
成功提取 6 个知识三元组：
1. (爱因斯坦, 是, 物理学家)
2. (爱因斯坦, 出生于, 德国)
3. (爱因斯坦, 提出, 相对论)
4. (相对论, 改变, 人们对时间和空间的理解)
5. (爱因斯坦, 获得, 1921年诺贝尔物理学奖)
6. (爱因斯坦, 工作于, 普林斯顿大学)

正在构建知识图谱...
正在生成知识图谱可视化...
知识图谱已保存至 /path/to/knowledge_graph.html

打开生成的knowledge_graph.html文件，可看到交互式知识图谱，支持节点拖拽、缩放和平移操作。

代码运行图示：
在这里插入图片描述

四、完整代码

运行知识图谱完整代码，该代码需要调用大模型构建的代码。我只是作为列子给出知识图谱的prompt方法。你可以根据graphrag等方式来提取知识图谱或更专业的方式来提取。

大模型调用完整代码

from langchain_openai import ChatOpenAI
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))

# 给出大语言模型默认参数字典的导入内容
llm_config = {
    "deepseek_1.5b": {
        "model_name": "deepseek-r1:1.5b",
        "api_url": "http://162.130.245.26:9542/v1",
        "api_key": "sk-RJaJE4fXaktHAI2MB295F6Ad58004feBcE25B83CdD6F0",
        "embedding_ctx_length": 8191,
        "chunk_size": 1000,
        "max_retries": 2,
        "timeout": None,  # 请求超时时间，默认为 None
        "default_headers": None,  # 默认请求头
        "default_query": None,  # 默认查询参数
        "retry_min_seconds": 4,
        "retry_max_seconds": 20,
        },

    "deepseek_14b": {
        "model_name": "deepseek-r1:14b",
          "api_url": "http://162.130.245.26:9542/v1",
        "api_key": "sk-RJaJE4fXaktHAI2MB295F6Ad58004feBcE25B83CdD6F0",
        "embedding_ctx_length": 8191,
        "chunk_size": 1000,
        "max_retries": 2,
        "timeout": None,  # 请求超时时间，默认为 None
        "default_headers": None,  # 默认请求头
        "default_query": None,  # 默认查询参数
        "retry_min_seconds": 4,
        "retry_max_seconds": 20,
        },

    "deepseek_32b": {
        "model_name": "deepseek-r1:32b",
         "api_url": "http://162.130.245.26:9542/v1",
        "api_key": "sk-RJaJE4fXaktHAI2MB295F6Ad58004feBcE25B83CdD6F0",
        "embedding_ctx_length": 8191,
        "chunk_size": 1000,
        "max_retries": 2,
        "timeout": None,  # 请求超时时间，默认为 None
        "default_headers": None,  # 默认请求头
        "default_query": None,  # 默认查询参数
        "retry_min_seconds": 4,
        "retry_max_seconds": 20,
        },
  


    "qwen3_14b": {
        "model_name": "qwen3:14b",
        "api_url": "http://192.145.216.20:7542/v1",
        "api_key": "sk-RJaJE4fXaktHAI2M295F6Ad58004f7eBcE25B863CdD6F0",
        "embedding_ctx_length": 8191,
        "chunk_size": 1000,
        "max_retries": 2,
        "timeout": None,  # 请求超时时间，默认为 None
        "default_headers": None,  # 默认请求头
        "default_query": None,  # 默认查询参数
        "retry_min_seconds": 4,
        "retry_max_seconds": 20,
        },
    "qwen3_32b": {
        "model_name": "qwen3:32b",
        "api_url": "http://192.145.216.20:7542/v1",
        "api_key": "sk-RJaJE4fXaktHAI2MB295F6d58004f7eBcE255B863CdD6F0",
        "embedding_ctx_length": 8191,
        "chunk_size": 1000,
        "max_retries": 2,
        "timeout": 60,  # 请求超时时间，默认为 None
        "default_headers": None,  # 默认请求头
        "default_query": None,  # 默认查询参数
        "retry_min_seconds": 4,
        "retry_max_seconds": 20,
        },


}



def stream_invoke(llm_model,prompt):
    """
    prompt可以做成2种方式，方式一：
    from langchain.schema import HumanMessage
    messages = [HumanMessage(content=prompt)]
    方式二：
    {"role": "user", "content": question}
    """
    full_response = ""
    results = llm_model.stream(prompt)
    for chunk in results:
        print(chunk.content, end="", flush=True)  # 逐块输出
        full_response += chunk.content
    return full_response


def invoke( llm_model,prompt):
    """
    调用模型生成响应。
    :param prompt: 输入的提示文本
    :return: 模型生成的响应内容
    """
    response = llm_model.invoke(prompt)
    print(response)
    return response.content
def build_model(mode="deepseek_32b"):

    config = llm_config[mode]
    model_name = config["model_name"]
    api_key = config["api_key"]    
    api_url = config["api_url"]
    LLM = ChatOpenAI(
        model=model_name,
        openai_api_key=api_key,
        openai_api_base=api_url
    )
    return LLM



def remove_think(answer, split_token='</think>'):
    """
    处理模型响应，分离 think 内容和实际回答。
    :param answer: 模型的完整响应
    :param split_token: 分隔符，默认为 </think>
    :return: 实际回答和 think 内容
    """
    parts = answer.split(split_token)
    content = parts[-1].lstrip("\n")
    think_content = None if len(parts) <= 1 else parts[0]
    return content









if __name__ == "__main__":
    llm_model = build_model(mode="qwen3_14b")
    # print(llm_model)
    stream_invoke(llm_model,"解释大语言模型LLM")

知识图谱提取完整代码

from Models.LLM_Models import build_model, stream_invoke
import networkx as nx
from pyvis.network import Network
import json
import re
import os  # 新增：用于处理文件路径

# 初始化大模型
ll_model = build_model()

def extract_triples(text):
    """使用stream_invoke从文本中提取知识三元组"""
    system_prompt = """你是专业知识三元组提取器，严格按以下规则输出：
    1. 仅从文本提取(主体, 关系, 客体)三元组，忽略无关信息。
    2. 必须用JSON数组格式返回，每个元素含"subject"、"relation"、"object"字段。
    3. 输出仅保留JSON数组，** 不要任何解释、说明、代码块标记（如```json）**。
    4. 确保JSON格式正确：引号用双引号，逗号分隔，无多余逗号。
    示例输出：
    [{"subject":"爱因斯坦","relation":"是","object":"物理学家"},{"subject":"爱因斯坦","relation":"提出","object":"相对论"}]
    """
    
    user_input = f"从以下文本提取三元组，严格按示例格式输出：\n{text}"
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_input}
    ]
    
    # 接收流式响应
    print("正在接收大模型流式响应...")
    full_response = ""
    for chunk in stream_invoke(ll_model, messages):
        # 根据实际返回格式调整，有些stream_invoke可能需要chunk["content"]
        full_response += str(chunk)
        print(f"\r已接收 {len(full_response)} 字符...", end="")
    
    print("\n流式响应接收完成，开始解析...")
    full_response = full_response.strip()
    
    # 格式修复
    try:
        return json.loads(full_response)
    except json.JSONDecodeError:
        print("首次解析失败，尝试修复格式...")
        json_match = re.search(r'\[.*\]', full_response, re.DOTALL)
        if json_match:
            cleaned_response = json_match.group()
            cleaned_response = cleaned_response.replace("'", '"')
            cleaned_response = re.sub(r',\s*]', ']', cleaned_response)
            try:
                return json.loads(cleaned_response)
            except json.JSONDecodeError as e:
                print(f"修复后仍解析失败：{e}")
                return []
        else:
            print("未找到有效JSON结构")
            return []

def build_knowledge_graph(triples):
    """构建知识图谱数据结构"""
    if not triples:
        return None  # 新增：处理空三元组情况
    
    entities = set()
    for triple in triples:
        entities.add(triple["subject"])
        entities.add(triple["object"])
    
    entity_attributes = {entity: {"name": entity} for entity in entities}
    relations = [
        {
            "source": triple["subject"],
            "target": triple["object"],
            "type": triple["relation"]
        } for triple in triples
    ]
    
    return {
        "entities": [{"id": entity, **attrs} for entity, attrs in entity_attributes.items()],
        "relations": relations
    }

def visualize_knowledge_graph(graph, output_file="knowledge_graph.html"):
    """修复可视化函数，解决模板渲染错误"""
    if not graph:
        print("无法可视化空图谱")
        return None
    
    # 确保输出目录存在
    output_dir = os.path.dirname(output_file)
    if output_dir and not os.path.exists(output_dir):
        os.makedirs(output_dir, exist_ok=True)
    
    # 初始化图时指定notebook=False（关键修复）
    net = Network(
        directed=True, 
        height="700px", 
        width="100%", 
        bgcolor="#f5f5f5", 
        font_color="black",
        notebook=False  # 新增：明确指定非 notebook 环境
    )
    
    # 添加节点和边
    for entity in graph["entities"]:
        net.add_node(
            entity["id"],
            label=entity["name"],
            title=f"实体: {entity['name']}",
            color="#4CAF50"
        )
    
    for relation in graph["relations"]:
        net.add_edge(
            relation["source"],
            relation["target"],
            label=relation["type"],
            title=relation["type"],
            color="#FF9800"
        )
    
    # 简化配置选项，避免复杂JSON解析问题
    net.set_options("""
    {
      "nodes": {
        "size": 30,
        "font": {"size": 14}
      },
      "edges": {
        "font": {"size": 12},
        "length": 200
      },
      "interaction": {
        "dragNodes": true,
        "zoomView": true,
        "dragView": true
      }
    }
    """)
    
    # 直接使用write_html方法，避免show()的复杂逻辑
    try:
        net.write_html(output_file, open_browser=False)
        print(f"知识图谱已保存至 {os.path.abspath(output_file)}")
        return output_file
    except Exception as e:
        print(f"生成HTML时出错: {e}")
        # 尝试备选方案：使用networkx的基本可视化
        import matplotlib.pyplot as plt
        plt.figure(figsize=(12, 8))
        pos = nx.spring_layout(nx.DiGraph([(r["source"], r["target"]) for r in graph["relations"]]))
        nx.draw_networkx_nodes(pos, node_size=3000, node_color="#4CAF50")
        nx.draw_networkx_labels(pos, labels={e["id"]: e["name"] for e in graph["entities"]})
        nx.draw_networkx_edges(pos, edgelist=[(r["source"], r["target"]) for r in graph["relations"]], arrowstyle="->")
        nx.draw_networkx_edge_labels(pos, edge_labels={(r["source"], r["target"]): r["type"] for r in graph["relations"]})
        plt.savefig(output_file.replace(".html", ".png"))
        print(f"已生成PNG备选可视化: {output_file.replace('.html', '.png')}")
        return output_file.replace(".html", ".png")

def process_text_to_graph(text):
    """端到端处理流程"""
    print("正在从文本中提取知识三元组...")
    triples = extract_triples(text)
    
    if not triples:
        print("未能提取到任何知识三元组")
        return None
    
    print(f"成功提取 {len(triples)} 个知识三元组：")
    for i, triple in enumerate(triples, 1):
        print(f"{i}. ({triple['subject']}, {triple['relation']}, {triple['object']})")
    
    print("\n正在构建知识图谱...")
    graph = build_knowledge_graph(triples)
    
    if not graph:
        print("构建知识图谱失败")
        return None
    
    print("\n正在生成知识图谱可视化...")
    output_file = visualize_knowledge_graph(graph)
    
    return output_file

# 示例用法
if __name__ == "__main__":
    sample_text = """
    爱因斯坦是一位著名的物理学家，他出生于德国。1905年，爱因斯坦提出了相对论。
    相对论彻底改变了人们对时间和空间的理解。爱因斯坦因光电效应获得了1921年诺贝尔物理学奖。
    他后来移居美国，并在普林斯顿大学工作。爱因斯坦与玻尔就量子力学的解释有过著名的争论。
    """
    
    process_text_to_graph(sample_text)