了解一下操作PDF的常用库

文章介绍了Java中用于处理PDF文档的库,如iText和PDFBox,以及HTML转换为PDF的工具,包括FlyingSaucer、pdfHTML和openhtmltopdf。其中,openhtmltopdf基于PDFBox且支持SVG,而iText的pdfHTML插件对HTML和CSS支持良好但资源效率更高。文章还提到了其他一些库,如wkhtmltopdf,并进行了一次简单的性能和功能对比。

神器:MinerU

MinerU 是一款一站式的高质量数据提取工具,主要功能包括从PDF、网页和电子书中提取数据,并将其转换为Markdown格式。它包含两个核心模块:Magic-PDF和 Magic-Doc。
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

安装:

pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
pip show magic-pdf

## 下载模型
pip install modelscope
## 进入Scripts目录
python download_models.py

然后编辑magic-pdf.json,既可运行demo下的demo.py验证一下了,正常情况下会把demo1.pdf转为md文件。

Nodejs PDF库

Python的PDF库

  • mupdf
  • PyMuPDF: PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Go语言的PDF库

  • go-fitz: 可以把pages转为image
  • pdfcpu: A PDF processor written in Go.

Java PDF库

iText

PDFBox

The Apache PDFBox® library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command-line utilities. Apache PDFBox is published under the Apache License v2.0.

HTML转PDF

Flying Saucer

Flying Saucer is a Java library that allows us to render well-formed XML (or XHTML) with CSS 2.1 for style and formatting, generating output to PDF, pictures, and swing panels.

pdfHTML

pdfHTML is an iText Core add-on for Java and C# (.NET) that allows you to easily convert HTML and CSS into standards compliant PDFs that are accessible, searchable and usable for indexing.

这是iText7提供的add-on,试用下来,能力比Flying Saucer稍弱一点。支持inline CSS和外部CSS。

openhtmltopdf

An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!

wkhtmltopdf

Convert HTML to PDF using Webkit (QtWebKit)

评测

老外的比较:

After introducing each of the libraries we have, we need to know which one suits our applications. First, We need to highlight Flying Saucer based on iText, which means minor changes between them. However, openhtmltopdf is based on another library called PDFBOX. PDFBOX is a well-maintained, open-source library with an LGPL license, while, iText is an AGPL license library. Openhtmltopdf is also considered faster than the Flying Saucer.

iText can be considered much more resource-efficient than PDFBOX as it processes the text chunk by chunk, and it also has an event-oriented architecture.

On the other hand, openhtmltopdf provides a built-in plugin for SVG and MathML and also provides better support for CSS3 transforms, and one of the drawbacks of openhtmltopdf is that there is no support for OpenType fonts.

mPDF,
typeset.sh,
PDFreactor,
wkhtmltopdf,
WeasyPrint,
Prince,
Puppeteer,
openhtmltopdf,
pdfHTML (iText 7 add-on),
Flying Saucer

mPDF
v8.0.6
typeset.sh
0.11.0
PDFreactor
10.1.10722.15
wkhtmltopdf
0.12.5 (with patched qt)
WeasyPrint
51
Prince
13.5
Puppeteer
3.3.0
openhtmltopdf
1.0.3
pdfHTML (iText 7 add-on)
3.0.0
Flying Saucer (with flying-saucer-pdf-itext5)
9.1.20

其它库

  • https://pdfkit.org/
  • wkhtmltopdf

常见问题

webp图片支持

参考链接

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

北极象

如果觉得对您有帮助,鼓励一下

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值