Both the image in docx and pdf will not be converted to base64 encoded content #58

GamerNoTitle · 2024-12-16T11:03:37Z

Hello, I've found this repo and it is awesome to use this to convert some document to markdown. But when I use this tool to convert docx or pdf to markdown, the image in the file cannot be convertted currectly.

(venv) PS F:\Git\GamerNoTitle\Doc2Markdown> python
Python 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from markitdown import MarkItDown
>>> handler = MarkItDown()
>>> docx_result = handler.convert(r"C:\Users\GamerNoTitle\Desktop\testdoc.docx")
>>> docx_result.text_content
'\n\n![卡通人物\n\n低可信度描述已自动生成](data:image/jpeg;base64...)\n\nThe content above is a image.\n\n'
>>> pdf_result = handler.convert(r"C:\Users\GamerNoTitle\Desktop\testdoc.pdf")
>>> pdf_result.text_content
'The content above is a image.\n\n'

The file as follows

The version of MarkItDown I used is 0.0.1a2, which is installed from pypi.org using pip install markitdown

The text was updated successfully, but these errors were encountered:

afourney · 2024-12-16T18:29:38Z

Better handling of images in documents is certainly something I would like to support, but just to be clear, what is your expected behavior here?

Should the images be saved to disk, and referenced in Markdown?
Should they be included inline as base64? (I would argue against this, since it makes the text unsuitable for a variety of downstream applications like indexing, or use in llms)
Should the images be sent through the same pipeline as .jpg etc, where we extract tags and other metadata?

zrh535 · 2024-12-16T19:53:52Z

My preference would be for the first option. Save to disk and reference in Markdown. I've been the issues base64 can cause and would suggest avoiding that one as well. Perhaps it could be a flag to toggle between extract or LLM pipeline?

gagb · 2024-12-17T01:52:58Z

Duplicate of #56

GamerNoTitle · 2024-12-17T05:31:50Z

Better handling of images in documents is certainly something I would like to support, but just to be clear, what is your expected behavior here?

Should the images be saved to disk, and referenced in Markdown?

Should they be included inline as base64? (I would argue against this, since it makes the text unsuitable for a variety of downstream applications like indexing, or use in llms)

Should the images be sent through the same pipeline as .jpg etc, where we extract tags and other metadata?

I think it should be encoded as base64 format image, since this wheel is used to convert document to markdown, the images shoule be displayed correctly. Saving the image to file with reference is also acceptable. The final goal I think is to display the image correctly.
After all, thank you for all your work on this repo. It is really awesome to have such a tool to convert the document so easily.🎉

MoonS11 · 2024-12-19T01:31:07Z

How to read a DOCX file, the OCR of an image is recognized as text or image description

gagb marked this as a duplicate of #56 Dec 17, 2024

gagb added the duplicate This issue or pull request already exists label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Both the image in docx and pdf will not be converted to base64 encoded content #58

Both the image in docx and pdf will not be converted to base64 encoded content #58

GamerNoTitle commented Dec 16, 2024

afourney commented Dec 16, 2024

zrh535 commented Dec 16, 2024

gagb commented Dec 17, 2024

GamerNoTitle commented Dec 17, 2024

MoonS11 commented Dec 19, 2024

Both the image in docx and pdf will not be converted to base64 encoded content #58

Both the image in docx and pdf will not be converted to base64 encoded content #58

Comments

GamerNoTitle commented Dec 16, 2024

afourney commented Dec 16, 2024

zrh535 commented Dec 16, 2024

gagb commented Dec 17, 2024

GamerNoTitle commented Dec 17, 2024

MoonS11 commented Dec 19, 2024