Give PDF and talk about it

using lanchain to perform RAG using Ollama embedings and FAISS vectorstore

yeah maybe many tools already provide this kind of feature, But this is different, this is about knowing behind the scene. what actually they do to the pdf files? what actually we do to the text in it? and how the LLM is know what context they need?.

first thing first what is RAG? RAG is Retrieval Augmented Generation

RAG is like asking a person who doesn’t just rely on memory (llms default) but goes to check a book (pdf in this case) or notes before answering. First, it finds the right information (retrieval), then it uses that to give us a clear answer (generation). It’s smarter and targeted.

In this state we are gonna learn langchain that will get pdf files and we ask something about it and basically perform RAG. langchain_community is in awesome growth.

lets go:

LANGCHAIN PYTHON API REFERENCE

thats reference is make us easy to refer what the function or method is actually do.

this is the basic flow chart from what we are gonna build

graph TD A[User Input: PDF Document] -->|Preprocess PDF| B[Extracted Text] B -->|Embed Text| C[Document Embeddings] C -->|Store Embeddings| D[FAISS Vector Database] E[User Query] -->|Input Query| F[Vector Database Search] D -->|Retrieve Matching Vectors| F subgraph ide1 [rag] F -->|Find Matching Vectors| G[Relevant Document Embeddings] G -->|Convert Vectors to Text| H((("`**Retrieved Text Documents**`"))) H -->|Provide Context| I[Language Model LLM] end I -->|Generate Response| J[Final Answer]

what we are gonna do is 3 step.

1. Load pdf files and split to chunks

load the pdf file. since we are using langchain in the langchain_community there is pdfloader. and split the documents to chunks.

# Step 1: Load the PDF
def load_pdf(pdf_path):
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()
    return documents

# Split the text into chunks
def split_text(documents):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )
    return text_splitter.split_documents(documents)

2. do embedings -> vectorstores

we need to split the pdf text to chunks so it could translate our docs to vecrorstore. we could say its a data that could understand by the LLM’s. and because we not yet big lets use vectorestore FAISS using in memory. In python and in this AI world. everything is already wraped. but if you want to learn more FAISS here also there is a paper in it faiss paper

the embedding: since we use ollama we use OllamaEmbeddings there should be OpenAIEmbbedings, also full docs could read here ollama embedings

# Step 2: generate embeddings using OllamaEmbedings
def create_vectorstore(chunks):
    embeddings = OllamaEmbeddings(model="mistral") #just use mistral its run on local btw
    vectorstore = FAISS.from_documents(chunks, embeddings)
    return vectorstore

we also could use another vectorstore but now we use FAISS

3. build the RAG chains

Build the RAG pipeline. honestly i am using AI to find this code cause reading this still strugling chain there is many chain and i am using the RetrievalQA

def build_rag(vectorstore):
    retriever = vectorstore.as_retriever()
    llm = OllamaLLM(model="mistral")  # Specify Mistral model
    qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever)
    return qa_chain

Combine that together as Q&A system prompt.

# Combine all to main
def main():
    pdf_path = "/Users/prima/Downloads/jefferey-ollama.pdf"  # Path to your PDF file
    documents = load_pdf(pdf_path)
    chunks = split_text(documents)
    vectorstore = create_vectorstore(chunks)
    qa_chain = build_rag(vectorstore)
    
    # Chat with the PDF
    while True:
        query = input("Ask a question about the document: ")
        if query.lower() == "exit":
            break
        response = qa_chain.invoke(query)
        print(f"Answer: {response}")

Result and Learnings

with this we are able to ask something about the pdf that we are give to the system. in this example i am creating pdf files base on Jeffrey Morgan linked in profile export pdf jeffrey Morgan.

chatgpt (not mentioning ollama) jeffrey morgan is:

jeffrey is an actor

langchain ollama rag

olama mistral without pdf (clean mistral) not using RAG

jeffrey is an actor

langchain ollama rag

when i ask our RAG ollama and ask about jefferey morgan the reult is success:

jeffrey is an engineer

langchain ollama rag

this is the full code and enjoy

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

# Step 1: Load the PDF
def load_pdf(pdf_path):
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()
    return documents

# Split the text into chunks
def split_text(documents):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )
    return text_splitter.split_documents(documents)

# Step 2: generate embeddings using OllamaEmbedings
def create_vectorstore(chunks):
    embeddings = OllamaEmbeddings(model="mistral") #just use mistral its run on local btw
    vectorstore = FAISS.from_documents(chunks, embeddings)
    return vectorstore

# Step 3: Build the RAG system
def build_rag(vectorstore):
    retriever = vectorstore.as_retriever()
    llm = OllamaLLM(model="mistral")  # Specify Mistral model
    qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever)
    return qa_chain

# Combine all to main
def main():
    pdf_path = "/Users/prima/Downloads/jefferey-ollama.pdf"  # Path to your PDF file
    documents = load_pdf(pdf_path)
    chunks = split_text(documents)
    vectorstore = create_vectorstore(chunks)
    qa_chain = build_rag(vectorstore)
    
    # Chat with the PDF
    while True:
        query = input("Ask a question about the document: ")
        if query.lower() == "exit":
            break
        response = qa_chain.invoke(query)
        print(f"Answer: {response}")

if __name__ == "__main__":
    main()

make sure have pip and python also install all requirements

langchain has move their community module to langchain_community since they are have enterprise package and products.