Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Both the image in docx and pdf will not be converted to base64 encoded content #58

Open
GamerNoTitle opened this issue Dec 16, 2024 · 5 comments
Labels
duplicate This issue or pull request already exists

Comments

@GamerNoTitle
Copy link

Hello, I've found this repo and it is awesome to use this to convert some document to markdown. But when I use this tool to convert docx or pdf to markdown, the image in the file cannot be convertted currectly.

(venv) PS F:\Git\GamerNoTitle\Doc2Markdown> python
Python 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from markitdown import MarkItDown
>>> handler = MarkItDown()
>>> docx_result = handler.convert(r"C:\Users\GamerNoTitle\Desktop\testdoc.docx")
>>> docx_result.text_content
'\n\n![卡通人物\n\n低可信度描述已自动生成](data:image/jpeg;base64...)\n\nThe content above is a image.\n\n'
>>> pdf_result = handler.convert(r"C:\Users\GamerNoTitle\Desktop\testdoc.pdf")
>>> pdf_result.text_content
'The content above is a image.\n\n'

The file as follows

The version of MarkItDown I used is 0.0.1a2, which is installed from pypi.org using pip install markitdown

@afourney
Copy link
Member

Better handling of images in documents is certainly something I would like to support, but just to be clear, what is your expected behavior here?

  • Should the images be saved to disk, and referenced in Markdown?
  • Should they be included inline as base64? (I would argue against this, since it makes the text unsuitable for a variety of downstream applications like indexing, or use in llms)
  • Should the images be sent through the same pipeline as .jpg etc, where we extract tags and other metadata?

@zrh535
Copy link

zrh535 commented Dec 16, 2024

My preference would be for the first option. Save to disk and reference in Markdown. I've been the issues base64 can cause and would suggest avoiding that one as well. Perhaps it could be a flag to toggle between extract or LLM pipeline?

@gagb
Copy link
Contributor

gagb commented Dec 17, 2024

Duplicate of #56

@gagb gagb marked this as a duplicate of #56 Dec 17, 2024
@gagb gagb added the duplicate This issue or pull request already exists label Dec 17, 2024
@GamerNoTitle
Copy link
Author

Better handling of images in documents is certainly something I would like to support, but just to be clear, what is your expected behavior here?

  • Should the images be saved to disk, and referenced in Markdown?
  • Should they be included inline as base64? (I would argue against this, since it makes the text unsuitable for a variety of downstream applications like indexing, or use in llms)
  • Should the images be sent through the same pipeline as .jpg etc, where we extract tags and other metadata?

I think it should be encoded as base64 format image, since this wheel is used to convert document to markdown, the images shoule be displayed correctly. Saving the image to file with reference is also acceptable. The final goal I think is to display the image correctly.
After all, thank you for all your work on this repo. It is really awesome to have such a tool to convert the document so easily.🎉

@MoonS11
Copy link

MoonS11 commented Dec 19, 2024

How to read a DOCX file, the OCR of an image is recognized as text or image description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

5 participants