I’m trying to extract text from this pdf https://openreview.net/pdf?id=g90RNzs8wX using pymupdf4llm.to_markdown(pdf_path), is there a way to fix a font error? Thanks!
Interesting, I see the error I think on page 26:
[========================================e=RuntimeError('code=4: no font file for digest')
I was running the following command:
md_text = pymupdf4llm.to_markdown("1522_Unifying_Unsupervised_Gra.pdf", page_chunks=False, extract_words=False, show_progress=True)
If I extract that page then it works. ( see my 1522_Unifying_Unsupervised_Gra-edit.pdf file )
@HaraldLieder What do you think is “wrong” with page 26 here?
1522_Unifying_Unsupervised_Gra-26.pdf (720.9 KB)
1522_Unifying_Unsupervised_Gra-edit.pdf (1.0 MB)
Also @eamag Welcome to the forum and thanks for your post!!!
This is caused by an upstream (MuPDF) problem. Recent versions of PyMuPDF4LLM make active use of MuPDF’s advanced detection of “faked” bold text. This is text written with a standard (non-bold) font such that it appears bold by writing the same text twice … with a small displacement.
This algorithm is quite complex and only works for non-Type3 fonts. The error you report currently happens because of a missing check for text in a Type 3 font.
MuPDF bug report has already been submitted.