文章目录
神器:MinerU
MinerU 是一款一站式的高质量数据提取工具,主要功能包括从PDF、网页和电子书中提取数据,并将其转换为Markdown格式。它包含两个核心模块:Magic-PDF和 Magic-Doc。
安装:
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
pip show magic-pdf
## 下载模型
pip install modelscope
## 进入Scripts目录
python download_models.py
然后编辑magic-pdf.json,既可运行demo下的demo.py验证一下了,正常情况下会把demo1.pdf转为md文件。
Nodejs PDF库
- cheerio
- PDFKit
Python的PDF库
- mupdf
- PyMuPDF: PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Go语言的PDF库
Java PDF库
iText
PDFBox
The Apache PDFBox® library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command-line utilities. Apache PDFBox is published under the Apache License v2.0.
HTML转PDF
Flying Saucer
Flying Saucer is a Java library that allows us to render well-formed XML (or XHTML) with CSS 2.1 for style and formatting, generating output to PDF, pictures, and swing panels.
pdfHTML
pdfHTML is an iText Core add-on for Java and C# (.NET) that allows you to easily convert HTML and CSS into standards compliant PDFs that are accessible, searchable and usable for indexing.
这是iText7提供的add-on,试用下来,能力比Flying Saucer稍弱一点。支持inline CSS和外部CSS。
openhtmltopdf
An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!
wkhtmltopdf
Convert HTML to PDF using Webkit (QtWebKit)
评测
老外的比较:
After introducing each of the libraries we have, we need to know which one suits our applications. First, We need to highlight Flying Saucer based on iText, which means minor changes between them. However, openhtmltopdf is based on another library called PDFBOX. PDFBOX is a well-maintained, open-source library with an LGPL license, while, iText is an AGPL license library. Openhtmltopdf is also considered faster than the Flying Saucer.
iText can be considered much more resource-efficient than PDFBOX as it processes the text chunk by chunk, and it also has an event-oriented architecture.
On the other hand, openhtmltopdf provides a built-in plugin for SVG and MathML and also provides better support for CSS3 transforms, and one of the drawbacks of openhtmltopdf is that there is no support for OpenType fonts.
mPDF,
typeset.sh,
PDFreactor,
wkhtmltopdf,
WeasyPrint,
Prince,
Puppeteer,
openhtmltopdf,
pdfHTML (iText 7 add-on),
Flying Saucer
mPDF
v8.0.6
typeset.sh
0.11.0
PDFreactor
10.1.10722.15
wkhtmltopdf
0.12.5 (with patched qt)
WeasyPrint
51
Prince
13.5
Puppeteer
3.3.0
openhtmltopdf
1.0.3
pdfHTML (iText 7 add-on)
3.0.0
Flying Saucer (with flying-saucer-pdf-itext5)
9.1.20
其它库
- https://pdfkit.org/
- wkhtmltopdf
常见问题
webp图片支持
参考链接
- https://www.baeldung.com/java-html-to-pdf
- html转PDF工具评测
- openhtmltopdf
- Stirling-PDF:Locally hosted web application that allows you to perform various operations on PDF files
- mineru
- MinerU项目安装运行实践