RAG for GPT-4-Vision on Documents: A Quick Guide

Spencer Reitsma, 2024-04-18 (updated 2024-04-23)

The Problem 🤔

Vision-language models can generate text based on multimodal inputs. However, they have a very limited useful context window. Documents often contain more information than can be processed in a single pass of a transformer. This can lead to poor responses in the best case to hallucinated information in the worst case. Retrieval-Augmented Generation (RAG) is a technique that allows you to equip a large language model with a large knowledge base. While you may have tried to apply this technique for your data with GPT-4-Vision, you may find it incredibly difficult to do with other existing frameworks. Here, we show you how to set this up on your documents using The Pipe and ChromaDB in just 40 lines of Python.

The Idea 💡

With RAG, documents are stored in a database indexed by their vector embeddings. Indexed this way, contextually relevant information can be given "on-the-fly" to the LLM to improve response quality based on the user's exact query. This guide assumes you already have an understanding of RAG, how it works, and why it's useful (if you don't, click here). Here's a quick guide on how to implement RAG for GPT-4-Vision on your documents using The Pipe and ChromaDB:

Create a Collection: We start by setting up a collection in ChromaDB with a multimodal embedding function.
Extract Data: Using The Pipe, we extract data from a specified source into prompt messages.
Prepare Chunks: These messages are then chunked into RAG-ready segments.
Embed Text: Each chunk's text is embedded into the collection with its corresponding prompt message as metadata.
Retrieve Prompts: Retrieve relevant prompt messages from ChromaDB.
Generate Response: These messages are then fed into GPT-4-Vision to generate a response.

The Code 👨‍💻

The following scripts are designed to be run independently. The first script adds new documents to the vector database, and the second script queries the database to retrieve relevant content and generates a response with GPT-4-Vision.

Script 1: Add docs to your vector database

from thepipe_api import thepipe
import chromadb
import json

def add_documents_to_collection(data_source, collection_name):
    # Initialize ChromaDB client
    chroma_client = chromadb.PersistentClient(path="/path/to/save/database")
    collection = chroma_client.get_or_create_collection(name=collection_name)
    # Prepare RAG-ready chunks from a data_source
    messages = thepipe.extract(data_source)
    chunks = thepipe.core.create_chunks_from_messages(messages)
    # Embed the text for each chunk, with the prompt message as metadata
    for i, (chunk, message) in enumerate(zip(chunks, messages)):
        if chunk.text:
            collection.add(
                ids=[data_source + str(i)],
                documents=[chunk.text],
                metadatas=[{"message": json.dumps(message)}]
            )

if __name__ == "__main__":
    data_source = "https://arxiv.org/pdf/0806.1525.pdf"
    collection_name = 'vectordb'
    add_documents_to_collection(data_source, collection_name)

Script 2: Retrieve and generate response

from openai import OpenAI
import chromadb
import json

def query_vector_db(collection_name, query):
    # Initialize ChromaDB client
    chroma_client = chromadb.PersistentClient(path="/path/to/save/database")
    collection = chroma_client.get_collection(name=collection_name)
    # Retrieve prompt from ChromaDB related to the user query
    retrieved_messages = collection.query(query_texts=[query], n_results=4)['metadatas'][0]
    retrieval_messages = [json.loads(md['message']) for md in retrieved_messages]
    # Prepare prompt message for the user query in OpenAI format
    user_message = [{"role": "user", "content": [{"type": "text", "text": query}]}]
    # Generate response from GPT-4-Vision using prompt messages
    openai_client = OpenAI()
    response = openai_client.chat.completions.create(
        model = "gpt-4-turbo",
        messages = retrieval_messages + user_message
    )
    print(response.choices[0].message.content)

if __name__ == "__main__":
    collection_name = 'vectordb'
    query = "What probability distributions do turbulent flows follow?"
    query_vector_db(collection_name, query)

For more details, see The Pipe documentation by clicking here. Happy coding! 🚀