This repository includes an optional feature that uses the GPT vision model to generate responses based on retrieved content. This feature is useful for answering questions based on the visual content of documents, such as photos and charts.
When this feature is enabled, the following changes are made to the application:
- Search index: We added a new field to the Azure AI Search index to store the embedding returned by the multimodal Azure AI Vision API (while keeping the existing field that stores the OpenAI text embeddings).
- Data ingestion: In addition to our usual PDF ingestion flow, we also convert each PDF document page to an image, store that image with the filename rendered on top, and add the embedding to the index.
- Question answering: We search the index using both the text and multimodal embeddings. We send both the text and the image to gpt-4o, and ask it to answer the question based on both kinds of sources.
- Citations: The frontend displays both image sources and text sources, to help users understand how the answer was generated.
For more details on how this feature works, read this blog post or watch this video.
- Create a AI Vision account in Azure Portal first, so that you can agree to the Responsible AI terms for that resource. You can delete that account after agreeing.
- The ability to deploy a gpt-4o model in the supported regions. If you're not sure, try to create a gpt-4o deployment from your Azure OpenAI deployments page.
- Ensure that you can deploy the Azure OpenAI resource group in a region where all required components are available:
- Azure OpenAI models
- gpt-35-turbo
- text-embedding-ada-002
- gpt-4o
- Azure AI Vision
- Azure OpenAI models
-
Enable GPT vision approach:
First, make sure you do not have integrated vectorization enabled, since that is currently incompatible:
azd env set USE_FEATURE_INT_VECTORIZATION false
Then set the environment variable for enabling vision support:
azd env set USE_GPT4V true
When set, that flag will provision a Azure AI Vision resource and gpt-4o model, upload image versions of PDFs to Blob storage, upload embeddings of images in a new
imageEmbedding
field, and enable the vision approach in the UI. -
Clean old deployments (optional): Run
azd down --purge
for a fresh setup. -
Start the application: Execute
azd up
to build, provision, deploy, and initiate document preparation. -
- Access the developer options in the web app and select "Use GPT vision model".
- New sample questions will show up in the UI that are based on the sample financial document.
- Try out a question and see the answer generated by the GPT vision model.
- Check the 'Thought process' and 'Supporting content' tabs.