How to fix code=4: no font file for digest?

I’m trying to extract text from this pdf https://openreview.net/pdf?id=g90RNzs8wX using pymupdf4llm.to_markdown(pdf_path), is there a way to fix a font error? Thanks!

Interesting, I see the error I think on page 26:
[========================================e=RuntimeError('code=4: no font file for digest')

I was running the following command:

md_text = pymupdf4llm.to_markdown("1522_Unifying_Unsupervised_Gra.pdf", page_chunks=False, extract_words=False, show_progress=True)

If I extract that page then it works. ( see my 1522_Unifying_Unsupervised_Gra-edit.pdf file )

@HaraldLieder What do you think is “wrong” with page 26 here?

1522_Unifying_Unsupervised_Gra-26.pdf (720.9 KB)
1522_Unifying_Unsupervised_Gra-edit.pdf (1.0 MB)

Also @eamag Welcome to the forum and thanks for your post!!! :folded_hands:

This is caused by an upstream (MuPDF) problem. Recent versions of PyMuPDF4LLM make active use of MuPDF’s advanced detection of “faked” bold text. This is text written with a standard (non-bold) font such that it appears bold by writing the same text twice … with a small displacement.

This algorithm is quite complex and only works for non-Type3 fonts. The error you report currently happens because of a missing check for text in a Type 3 font.
MuPDF bug report has already been submitted.