Build Your Free AI Assistant-RAG Python &Ollama

4 min readDec 18, 2024

Firstly, I explained the details and meaning of RAG (Retrieval-Augmented Generation) in another article

If you’d like to use your own local AI assistant or document-querying system, I’ll explain how in this article, and the best part is, you won’t need to pay for any AI requests.

Modern applications demand robust solutions for accessing and retrieving relevant information from unstructured data like PDFs. Retrieval-Augmented Generation (RAG) is a cutting-edge approach combining AI’s language generation capabilities with vector-based search to provide accurate, context-aware answers.

In this article, I will walk through the development of a Flask application that implements a RAG pipeline using LangChain, Chroma. The application allows users to upload PDF documents, store embeddings, and query them for information retrieval — all powered by Ollama.

If you’re ready to create a simple RAG application on your computer or server, this article will guide you. In other words, you’ll learn how to build your own local assistant or document-querying system.

What We’ll Build

Our application includes the following features:

Querying a general-purpose AI for any question. You can ask local LLM model direct question by using flask post method

Uploading PDF documents to process and store. You can upload your PDFs to local storage and local vector db.

You can ask anything about your documents. Retrieving answers specific to the content of uploaded PDFs.

Tools and Libraries

We’ll use the following stack to build this application:

Flask: A lightweight Python framework for creating REST APIs.
LangChain: For constructing the retrieval and generation pipeline.
Chroma: A vector store for storing and querying document embeddings.
PDFPlumber: For extracting text from PDFs.
Ollama LLM: Open source a large language model to generate AI responses.

Step-by-Step Implementation

1. Setting Up Flask

Flask serves as the backbone of our API. It exposes the endpoints required to handle user queries, PDF uploads, and AI responses.

First we access to local LLM than we will return the answers by using Json.

cached_llm = Ollama(model="llama3.2:latest")

@app.route("/askAi", methods=["POST"])
def ask_ai():
    print("Post /ai called")
    json_content = request.json
    query = json_content.get("query")
    print(f"query: {query}")
    response = cached_llm.invoke(query)
    print(response)
    response_answer = {"answer": response}
    return response_answer

2. Uploading and Processing PDFs

Using PDFPlumber, we extract text from PDFs. LangChain’s RecursiveCharacterTextSplitter splits the text into manageable chunks, which are embedded and stored in Chroma for efficient querying.

from langchain_community.document_loaders import PDFPlumberLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PDFPlumberLoader("example.pdf")
docs = loader.load_and_split()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=80)
chunks = text_splitter.split_documents(docs)

3. Storing Data in Chroma

Chroma serves as our vector database, storing embeddings for the document chunks. These embeddings allow similarity-based search during retrieval.

from langchain_community.vectorstores import Chroma

vector_store = Chroma.from_documents(chunks, embedding=FastEmbedEmbeddings())
vector_store.persist(directory="db")

4. Querying the PDFs

The /askPdf endpoint retrieves context from the vector store and uses a prompt template to generate answers via the AI model.

@app.route("/askPdf", methods=["POST"])
def ask_pdf():
    query = request.json["query"]
    retriever = vector_store.as_retriever(search_type="similarity_score_threshold")
    chain = create_retrieval_chain(retriever, document_chain)
    result = chain.invoke({"input": query})
    return {"answer": result["answer"]}

Endpoints Overview

The Flask application provides three main endpoints:

/askAi: Directly queries the AI for general-purpose answers.
/askPdf: Queries the uploaded PDFs for context-specific answers.
/addPdf: Uploads and processes a PDF document for storage.

Each endpoint is designed for simplicity and integrates tightly with the LangChain framework for RAG operations.

Running the Application

First you have to serve Ollama than install the requirements libs.

1- ollama serve

2- pip install -r requirements.txt   

3- python app.pyhttp://127.0.0.1:5000

Challenges and Solutions

1. Efficient Text Chunking

Splitting documents without losing context is critical. Using LangChain’s RecursiveCharacterTextSplitter, we ensure chunks retain coherence while remaining small enough for embedding.

2. Scalable Vector Storage

Chroma provides a fast, lightweight vector store ideal for handling large datasets while enabling quick queries.

3. Context-Aware Responses

A custom prompt template ensures that the AI model integrates retrieved document content seamlessly into its answers.

Conclusion

By combining AI, vector search, and document processing, this application demonstrates the power of RAG in solving real-world problems. Whether you’re a researcher , a developer or a company employee, this guide equips you with the tools to build a custom document-querying system.

Want to take this further? Clone the GitHub repository and start building today!