Python获取Word文档中文字、表格及其中内容（Win11）

原创已于 2024-11-20 16:41:24 修改 · 567 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python #word

于 2024-11-15 16:30:41 首次发布

由于在网上找了很多博客，跑了很多代码，却依然无法完整的提取出word中的文字和表格，有的表格能成功提取出文字，有的表格却又提取不出，因此便自己试着写了写。

安装python-docx包后，运行下面这段代码，将从Word中提取出文字、表格，并将表格中的文字按照原文的顺序插入到段落文字中，一并返回。

当前的Word版本为Win11新版，将Word上传到Ubuntu上运行此段代码也能得出正确结果，仅支持.docx文件，不支持.doc文件。

from docx import Document     # python install python-docx

def get_doc_content(file_path):
    """
    Get content in .docx, including text, table.
    """
    if file_path.endswith('.docx'):
        doc = Document(file_path)
        content = ""
        cnt = 0   # 记录当前遍历到的表格数
        for element in doc.element.body:
            if element.tag.endswith("p"):  # 段落元素
                paragraph = element.xpath(".//w:t")
                if paragraph:
                    text = ''.join([node.text for node in paragraph if node.text])
                    content += text + "\n"
            elif element.tag.endswith("tbl"):  # 表格元素
                for row in doc.tables[cnt].rows:
                    row_text = '\t'.join(cell.text.strip() for cell in row.cells if cell.text)
                    content += row_text + "\n"
                cnt += 1
        return content
    else:
        print("Cannot process .doc files, only .docx files are supported.")