Graphic wrongly placed in md file output from pymupdf4llm.to_markdown

Sorry, it is me nitpicking again :grinning_face:

My PDF file has this content on page 61
image

But the resulting markup from pymupdf4llm.to_markdown renders like this :
image

This is IMHO a small bug. Here is the wrong and a a proposed improved markup :

# Current output #
**Important information**

![](fiasp-epar-product-information_en.pdf-60-0.png)
Pay special attention to these notes as they are important for correct use of the pen.

# Corrected #

![](fiasp-epar-product-information_en.pdf-60-0.png) 
**Important information**\
Pay special attention to these notes as they are important for correct use of the pen.

The corrected markup gives this
image

I note that in the input PDF file the Y coordinate of the text “Important information” is slightly bigger than the Y coordinate of the graphic.

The issue happens the first time on page 61 in the PDF file below.

PS. My use case which requires all this precision is to extract the text and graphics content from a PDF file to a markup file, use an LLM to edit it and then convert the changed markup file back to a PDF which looks as similar as possible to the original.

PPS

I use the demo code below.

import pymupdf4llm

md_text = pymupdf4llm.to_markdown(“/pleaflet/samples/fiasp-epar-product-information_en.pdf”, write_images=True, force_text=False, image_size_limit=0))
import pathlib
pathlib.Path(“noutput.md”).write_bytes(md_text.encode())

The PDF file I used is can be found here www.ema.europa.eu/en/documents/product-information/fiasp-epar-product-information_en.pdf (not typed as link because this system does not accept it)

Fair point - not sure why it doesn’t find the image first in that line and present the MD in the oder which you suggested. @HaraldLieder any insights into this one?

1 Like

When we write the page content, then before writing a text line we look at any images (or tables) that end above the text’s top coordinate. That image occurs here in the page:
Rect(70.89993286132812, 297.0294494628906, 82.99993133544922, 307.37945556640625)

The text line Important information has this bbox:
Rect(99.248779296875, 295.7320861816406, 207.26747131347656, 311.0114440917969)

So y0 = 295.7320861816406 of the text is smaller than y1 = 307.37945556640625 of the image.

This case only seems to go wrong. The actual problem is your expectation … :smiling_face_with_sunglasses:

Thanks a lot for explaining your algorithm.

My expectation is at a higher level : That the markup renders as close to the PDF as possible.

I note that the algorithm you describe will always ruin the layout of any graphics and text which are aligned on the same horizontal line. In particular when graphics are used as a bullet point for text. In documents reading from left to right where a reader of the PDF first sees the graphic and then the text, the described algorithm will result in an “inversed” markup document where the reader first sees the text and then the graphic. In the context of a medical document where the graphic represents a warning related to the following text this is quite serious.

Perhaps it would be better to look at the Y-axis mid points of the two objects instead of the top and bottom. When these are close, the X coordinate could be considered according to the reading direction.

Best regards
Steen

Here is something else that looks strange, even taking the algorithm you explained into consideration.

On page 62 there is a page number at the bottom of the page far below the graphic picture. However, in the markup, the page number is written before the graphic. See below (rendered markup is on the left and the PDF is on the right)

Another example showing how the graphics in the markup (left) renders very differently to the original PDF (right). This is very disturbing to the reading of the document.

An LLM which has to summarize the important points could get these wrong due to this markup IMHO.

I would make the point that there is no way to state an unambiguous / failsafe rule. At least I didn’t come across one yet.

Neither comparing the top, bottom nor center point coordinates will always work.
There is no nonsense that can not be found in PDFs.

An idea could be to also look at the left-most coordinate and just confirm that the image / text vertical intervals overlap if image.x0 < text.x0.
If image.x1 > text.x1 (and the verticals overlap), then the image should also be written first.

How about this?

I am far from an expert, but yes, PDF is a crazy format which was definitely not made to be interpreted and reformatted outside a printer. :joy:

That sounds like an excellent idea which is better and simpler than my proposal. Perhaps the method to “sort” images and text can be an option.

They are already sorted in all 4 directions and at zillions of places, believe me.

1 Like

All improvement ideas are prone to be perception-biased.
Things can get very complicated when text in multi-column format is present. How do we make sure that images belong to one of the text columns - as opposed to interrupting the the multi-column structure.
Plus, what do we do with images where text flows around it.
…

1 Like

This effect comes from the logic we already discussed. Images that were not picked up during writing of the text lines are collectively written at the end of the page.

My changes that I am testing currently will prevent this, too.

1 Like