Hybrid RAG-SQL Agent for Enterprise Knowledge Systems

Reliable Question Answering over Unstructured and Structured Enterprise Data

Applied Machine Learning
Large Language Models (LLMs)
Retrieval-Augmented Generation
AI Systems Engineering
Vector Databases
SQL Agents
Author

Mulimbika Makina

Published

February 4, 2026

1 Overview

Large Language Models (LLMs) are trained on vast amounts of public text such as websites, books, news articles, and online conversations. This gives them strong general language skills.

In real organizations, the most valuable knowledge lives in internal documents, databases, and systems that change over time. Because this information isn’t part of an LLM’s training data, standalone models quickly fall short. They may sound confident, but their answers are often incomplete, outdated, or simply wrong.

This shows up in practical ways:

  • They can’t reliably answer questions over internal documents
  • They struggle to query enterprise databases
  • They tend to hallucinate when context is missing
  • They don’t naturally respect security and access boundaries

Modern AI systems solve this not by training bigger models, but by connecting models to data carefully.

This project implements a hybrid knowledge access system that combines two proven patterns:

  • Retrieval-Augmented Generation (RAG) for unstructured data like PDFs, reports, and internal documentation
  • A tool-driven SQL agent for structured data stored in PostgreSQL databases

Instead of treating the LLM as a source of truth, the system uses it as a reasoning layer. The model retrieves evidence from private sources, reasons over it, and produces answers that are grounded in real data.

At a high level:

  1. A user asks a natural-language question
  2. The system determines whether the answer requires documents, databases, or both
  3. Relevant information is retrieved from internal sources
  4. The LLM generates a response based only on that retrieved context

The goal of this project is not just to build a working RAG pipeline, but to engineer a reliable, modular, and production-ready hybrid architecture that makes LLMs useful inside real systems.


2 Problem Definition

Organizations don’t store knowledge in one place. It’s spread across:

  • Unstructured documents such as PDFs, manuals, and internal policies
  • Structured databases containing operational and analytical data

LLMs provide a powerful natural-language interface, but on their own they cannot safely operate over private or evolving data. Without external grounding, models hallucinate, return stale information, and offer answers that can’t be verified.

The real challenge is not generation, but controlled access to knowledge.

A practical system must:

  • Ground every answer in retrieved evidence
  • Work across both documents and databases
  • Enforce read-only access and security constraints
  • Scale to large data volumes
  • Keep latency and cost predictable
  • Support debugging and evaluation

This project addresses those requirements by combining:

  • A RAG pipeline for document retrieval and context construction
  • A tool-mediated SQL agent that safely translates questions into validated, read-only queries

The result is a hybrid architecture with RAG and SQL Agent where the LLM reasons, tools retrieve, and data remains protected, producing answers that are explainable, auditable, and grounded.


3 RAG Stream: From Raw Documents to Grounded Answers

This section walks through the Retrieval-Augmented Generation (RAG) stream of the project, explaining how raw documents are transformed into reliable, context-aware answers.

The goal of this stream is to allow an LLM to answer questions using private documents without hallucinating or guessing.


3.1 Document Ingestion & Parsing

The pipeline begins with raw documents parsing, where documents are converted into clean, structured text before any downstream processing is possible. This project uses llama-parse to convert a wide range of document formats into structured markdown, preserving headings and layout information.

3.1.1 Core Features & Parsing Fundamentals

LlamaParse is a GenAI-native document parser designed for converting complex documents into LLM-ready data for Retrieval-Augmented Generation (RAG) systems and it excels at:

  • Complex Document Handling: Financial reports, research papers, scanned PDFs
  • Precise Extraction: Tables, charts, images, and diagrams
  • LLM-Ready Output: Clean markdown, text, or structured JSON

3.1.1.1 Supported File Formats (70+)

Category Formats
Documents PDF, DOC, DOCX, RTF, TXT, EPUB
Spreadsheets XLSX, XLS, CSV, ODS
Presentations PPTX, PPT
Images JPG, PNG, GIF, BMP, TIFF, WEBP, SVG
Web HTML, HTM
Audio MP3, MP4, WAV, WEBM, M4A (≤20MB)

3.1.1.2 Pricing

  • Free Tier: 1,000 pages daily
  • Paid Tier: 7,000 pages/week + $0.003/additional page

3.1.1.3 API Key Setup

LlamaParse requires an API key from LlamaCloud.

Steps to get your API key:

  1. Go to https://cloud.llamaindex.ai/
  2. Sign up or log in
  3. Navigate to API Keys section
  4. Create a new API key (starts with llx-)

3.1.1.4 Architecture

  • LlamaParse provides both synchronous and asynchronous methods:
Sync Method Async Method Description
load_data() aload_data() Parse and return Documents
parse() aparse() Parse and return JobResult
get_images() aget_images() Get extracted images
  • LlamaParse uses a job-based architecture:
Document → Submit Job → Poll Status → Get Results
  • Key Components:

  • LlamaParse Client: Main interface for document parsing

  • Job: Represents a parsing task (can be async)

  • JobResult: Contains parsed content, pages, images, charts, layout

  • Document: LlamaIndex Document object with text and metadata

  • Available pre-optimized configurations (presets) for different document types

Preset Best For Description
fast Quick extraction No OCR, fastest processing
balanced General documents Balance of speed and accuracy
premium Complex documents Best quality, uses advanced models
structured Forms, tables Optimized for structured data
auto Mixed content Automatic mode selection
scientific Research papers LaTeX, equations, citations
invoice Invoices, receipts Financial document extraction
slides Presentations PowerPoint, slide content

3.1.1.5 Parse Modes

  • 1. High-Level Modes (Boolean flags):

  • fast_mode: Skip OCR, fastest processing

  • premium_mode: Best available parser

  • auto_mode: Automatic mode selection

  • 2. Granular Parse Modes (Page-level):

  • parse_page_without_llm: Fast extraction without AI

  • parse_page_with_llm: Uses LLM for each page

  • parse_page_with_lvm: Uses vision model for pages

  • parse_page_with_agent: Agentic reasoning per page

  • parse_page_with_layout_agent: Layout-aware agent


3.2 Chunking the Document

Large documents are split into smaller, semantically meaningful chunks. This is essential because:

  • Embedding models have input limits
  • Smaller chunks improve retrieval accuracy

This project uses Chonkie’s RecursiveChunker, which is character-based, LLM-agnostic and production-safe. Chunks are created with:

  • A fixed size (2048 characters)
  • Minimum length enforcement
  • Stable character offsets for traceability

3.2.1 Why this choice

Chonkie is a production-ready text chunking library designed specifically for RAG applications. It provides:

  • 9 specialized chunkers (RecursiveChunker, TokenChunker, SentenceChunker, TableChunker, CodeChunker, SemanticChunker, LateChunker, NeuralChunker, SlumberChunker)
  • Local processing - your data never leaves your infrastructure
  • High performance - optimized for speed and efficiency
  • Thread-safe - suitable for concurrent processing
  • Flexible embeddings - works with OpenAI, Gemini, Sentence Transformers, and more

3.3 Embedding Generation

In this project, each chunk is converted into a numerical vector (embedding) using Sentence Transformers (local) to capture each chunk’s semantic meaning (other alternatives include OpenAI, Gemini). The same process is later applied to user queries. 768-dimensional embeddings are generated locally and cached to avoid recomputation.

3.3.1 Why this choice

  • No external API dependency
  • Predictable cost (zero per-request fees)
  • Easy to swap models
  • Suitable for private environments

3.4 Vector Storage & Retrieval

All chunk embeddings are stored in a vector database. At query time, the system retrieves the most relevant chunks using similarity search (Top-K document chunks are ranked by semantic similarity). This project uses Qdrant as the vector store, with the following key features:

  • Cosine similarity
  • Metadata filtering (namespace, filename)
  • Payload indexing
  • Async client for scalability

3.4.1 Why this choice

  • Easy to self-host (Open-source) and cloud hosted (with 1GB free cluster).
  • Strong filtering and indexing support
  • Clean Python API

3.5 Context Construction

Retrieved chunks are formatted into a single context block, preserving:

  • Source filename
  • Relevance score
  • Optional heading hierarchy

This context becomes the grounding evidence for the LLM and is used to generate the final answer.

3.5.1 Why this matters

  • Improves answer quality
  • Enables traceability
  • Reduces hallucination risk

3.6 Augmented Prompt Generation

The user’s question and the retrieved context are combined into a single prompt. The prompt explicitly instructs the LLM to use only the provided context and admit when information is missing. A grounded, constrained prompt is then sent to the LLM.


3.7 Answer Generation

The LLM generates the final answer using the augmented prompt. This implementation uses Groq LLM APIs for fast inference, but the architecture is model-agnostic. A clear, grounded answer with optional source citations is returned to the user.


3.8 Artifact & Cache Storage

To improve performance and reproducibility, intermediate artifacts are stored in Cloudflare R2 which is a low-cost object storage service and is S3-compatible. The following artifacts are stored:

  • Original documents
  • Chunks
  • Embeddings
  • Metadata

3.8.1 Why this matters

  • Faster reloads
  • Stateless services
  • Easy reindexing
  • Lower compute cost (with 10 GB free storage)

3.9 Summary

This RAG stream turns raw, private documents into grounded, auditable answers by combining:

  • Robust document parsing
  • Deterministic chunking
  • Local embedding generation
  • Vector-based retrieval
  • Context-aware LLM prompting

Most importantly, the LLM is never treated as a source of truth. It is a reasoning layer operating strictly on retrieved evidence.


4 SQL Agent Stream: Safe Natural Language Access to Databases

While the RAG stream handles unstructured knowledge like documents and manuals, many real-world questions require direct access to structured data stored in relational databases. This stream focuses on enabling safe, reliable, and auditable database querying using natural language, without giving the LLM unrestricted access to the database.


flowchart TD

    %% ======================
    %% USER
    %% ======================
    A[User Question]

    %% ======================
    %% LLM
    %% ======================
    subgraph LLMBOX["<B>LLM (Reasoning Engine)</B>"]
        L1[Understand Question]
        L2[Request Tools]
        L3[Interpret Query Results]
    end

    %% ======================
    %% SQL AGENT
    %% ======================
    subgraph AGENT["<B>SQL Agent Control</B>"]
        P[Agent Control Loop]
        G[Guardrails & Constraints]
    end

    %% ======================
    %% TOOLS
    %% ======================
    subgraph TOOLS["<B>Database Tools</B>"]
        T1[List Tables]
        T2[Describe Tables]
        T3[Sample Data]
        T4[Execute Read-Only SQL]
    end

    %% ======================
    %% DATABASE
    %% ======================
    subgraph DBLAYER[" <B>Protected Database</B>"]
        D[(PostgreSQL Database)]
    end

    %% ======================
    %% RESPONSE
    %% ======================
    R[<B>Final Answer to User</B>]

    %% ======================
    %% FLOW
    %% ======================
    A --> L1
    L1 --> L2
    L2 --> P

    %% Guarded execution
    P --> G
    G --> TOOLS

    TOOLS --> D
    D --> P

    %% Output control
    P --> L3
    L3 --> R

    %% ======================
    %% STYLES (LIGHT GRAY)
    %% ======================
    style LLMBOX fill:#f5f5f5,stroke:#d1d5db,stroke-width:2px
    style AGENT fill:#f5f5f5,stroke:#d1d5db,stroke-width:2px
    style TOOLS fill:#f5f5f5,stroke:#d1d5db,stroke-width:2px
    style DBLAYER fill:#f9fafb,stroke:#d1d5db,stroke-width:2px
    style A fill:#f9fafb,stroke:#d1d5db,stroke-width:2px
    style R fill:#f9fafb,stroke:#d1d5db,stroke-width:2px


4.1 User Question

A user submits a natural language question that requires database access. At this point, the system does not assume the question is valid or executable.


4.2 LLM as a Reasoning Engine

The LLM receives the user question along with a strict system prompt that defines its role. In this project, the LLM is not a data source and not allowed to answer directly.

Its responsibilities are limited to:

  • Understanding the user’s intent
  • Planning how to explore the database
  • Deciding which tools to call and in what order
  • Explaining its reasoning for every tool call

This separation ensures that:

  • The LLM cannot hallucinate results
  • All answers must be grounded in actual database queries
  • Every step is explainable and auditable

4.3 Agent Control Loop

The SQL Agent runs inside a controlled reasoning loop, where each iteration performs the following steps:

  1. The LLM proposes tool calls
  2. Tools are executed by the system (not the LLM)
  3. Tool outputs are fed back to the LLM
  4. The loop continues until a final answer is produced

This loop is bounded by a maximum number of iterations and hard failure conditions. This design prevents:

  • Infinite reasoning loops
  • Unbounded database exploration
  • Uncontrolled compute costs

4.4 Guardrails & Constraints

Before any SQL is executed, the system enforces hard safety rules that the LLM cannot override.

These include:

  • Read-only enforcement

    • No INSERT, UPDATE, DELETE, DROP, ALTER, or CREATE
  • Mandatory LIMIT clause

    • Prevents large result sets
  • Schema validation

    • Only known tables and columns are allowed
  • Tool usage enforcement

    • The LLM must call at least one database tool
  • Temporal query guards

    • Prevents ambiguous “current” or “latest” queries without validation

If any rule is violated, execution stops immediately.


4.5 Database Tools

The LLM can request a limited set of explicit tools, each designed for a specific purpose:

  • List Tables

    • Discover available tables
  • Describe Tables

    • Inspect schema and columns
  • Sample Data

    • Preview small subsets of rows
  • Execute Read-Only SQL

    • Run validated, constrained queries

Each tool:

  • Requires a reasoning parameter
  • Runs inside controlled Python code
  • Uses a secure PostgreSQL connection

4.5.1 Why this matters

  • Tools act as the only gateway to the database.

  • The LLM never touches SQL execution directly.


4.6 Protected Database Access

All database operations go through a protected PostgreSQL connection with:

  • Read-only enforcement
  • Cursor-level control
  • Automatic rollback on failure
  • Limited result sizes

The database is treated as a protected system, not an LLM playground.


4.7 Interpreting Query Results

Tool results are returned to the LLM as structured messages then the LLM:

  • Interpret the data
  • Summarize insights
  • Translate technical results into business-friendly language

Crucially, the LLM does not fabricate numbers and if no relevant data is found, it must say so explicitly.


4.8 Final Answer to the User

Once the LLM has:

  • Used at least one database tool
  • Passed all guardrails
  • Interpreted real query results

It produces a final answer that:

  • Is grounded in actual data
  • Is formatted in clear Markdown
  • Avoids exposing raw SQL unless requested
  • Is understandable to non-technical users

This SQL Agent stream complements the RAG stream by enabling safe, explainable access to structured data, while preserving all the reliability guarantees required for real-world systems. Together, the RAG stream and SQL Agent stream form a unified knowledge access layer where:

  • Documents are retrieved, not guessed
  • Databases are queried, not hallucinated
  • LLMs reason, but never act unchecked

5 Unified Hybrid Flow: RAG + SQL Agent

Modern organizations store knowledge in multiple forms. Some of it lives in documents like reports, manuals, and policies. Some of it lives in databases like transaction tables, analytics schemas, and operational systems. Treating these two worlds separately limits what users can ask.

This project implements a unified hybrid flow that allows a single natural-language question to be answered using:

  • Retrieval-Augmented Generation (RAG) for unstructured documents
  • A tool-driven SQL Agent for structured relational data
  • Or both together, when the question requires combined reasoning

The system is designed so the LLM acts as a reasoning layer, while all data access is explicit, controlled, and auditable.


flowchart TD

    %% ======================
    %% USER
    %% ======================
    subgraph USER["<B>User</B>"]
        A[User Asks a Question]
    end

    %% ======================
    %% API GATEWAY
    %% ======================
    subgraph API["<B>API Gateway</B>"]
        B[API Receives the Question]
    end

    %% ======================
    %% ROUTING SERVICE
    %% ======================
    subgraph ROUTING["<B>Routing Service</B>"]
        C[Decides the Best Path]
    end

    %% ======================
    %% EXECUTION PATHS
    %% ======================
    subgraph EXEC["<B>Execution Paths</B>"]
        D[Query Business Database]
        E[Search Knowledge Base]
        F[Combine Sources if Needed]
    end

    %% ======================
    %% DATA & KNOWLEDGE
    %% ======================
    subgraph DATA["<B>Data & Knowledge</B>"]
        G[(Business Database)]
        H[(Knowledge Base)]
        K[(Knowledge + Database)]
    end

    %% ======================
    %% RESPONSE
    %% ======================
    subgraph RESPONSE["Response"]
        I[Generate Clear Answer]
        J[Return Answer to User]
    end

    %% ======================
    %% FLOW
    %% ======================
    A --> B
    B --> C

    C -->|Data Question| D
    C -->|Knowledge Question| E
    C -->|Mixed Question| F

    D --> G
    E --> H
    F --> K

    G --> I
    H --> I
    K --> I

    I --> J

    %% ======================
    %% STYLES (LIGHT GRAY)
    %% ======================
    style USER fill:#f9fafb,stroke:#d1d5db,stroke-width:2px
    style API fill:#f9fafb,stroke:#d1d5db,stroke-width:2px
    style ROUTING fill:#f5f5f5,stroke:#d1d5db,stroke-width:2px
    style EXEC fill:#f5f5f5,stroke:#d1d5db,stroke-width:2px
    style DATA fill:#f9fafb,stroke:#d1d5db,stroke-width:2px
    style RESPONSE fill:#f9fafb,stroke:#d1d5db,stroke-width:2px

5.1 User Asks a Question

The flow begins when a user submits a natural-language question. At this point, the system does not assume where the answer should come from.


5.2 API Gateway Receives the Request

The API Gateway serves as the single entry point for all queries. Its responsibilities include:

  • Accept the user question
  • Attach session and request metadata
  • Forward the question to the routing service

No reasoning or data access happens here.


5.3 Routing Service Decides the Best Path

The Routing Service analyzes the question and determines which execution path is required:

  • Data Question → Requires structured database access → Route to the SQL Agent

  • Knowledge Question → Requires document understanding → Route to the RAG pipeline

  • Mixed Question → Requires both documents and database facts → Route to the hybrid execution path


5.4 Execution Paths

5.4.1 SQL Agent Path (Structured Data)

For data-driven questions:

  • The SQL Agent uses the LLM as a reasoning engine

  • The LLM plans database exploration steps

  • Database tools are executed under strict guardrails:

    • Read-only queries
    • Schema validation
    • Automatic limits
    • Mandatory tool usage
  • The database remains fully protected

The LLM never executes SQL directly, it only reasons about which tool should run next.


5.4.2 RAG Path (Unstructured Knowledge)

For document-driven questions:

  • Relevant documents are parsed and chunked
  • Embeddings are generated using a local embedding model
  • Similar chunks are retrieved from the vector database
  • Retrieved context is combined with the user question
  • The LLM generates an answer grounded strictly in retrieved content

This ensures answers are evidence-based, not hallucinated.


5.4.3 Hybrid Path (Knowledge + Database)

For mixed questions:

  • The system retrieves relevant document context and

  • Executes validated SQL queries via the SQL Agent

  • Both sources are combined into a single augmented context

  • The LLM generates a unified answer that:

    • Explains what the documents say
    • References what the data shows

This allows true cross-source reasoning without compromising safety.


5.5 Data & Knowledge Layer

All execution paths ultimately draw from the same Data & Knowledge layer, which includes:

  • Business Database (PostgreSQL)
  • Knowledge Base (vector store)
  • Combined Knowledge + Database context (for hybrid queries)

This layer is treated as the source of truth, never the LLM.


5.6 Generate a Clear Answer

Once relevant data is retrieved:

  • The LLM interprets results
  • Converts technical outputs into human-readable explanations
  • Avoids fabricating missing information
  • Explicitly states when data is insufficient

The answer is formatted for clarity and trust, not verbosity.


5.7 Return Answer to User

The final response is returned to the user with:

  • Clear reasoning
  • Grounded evidence
  • No hidden assumptions
  • No hallucinated facts

From the user’s perspective, this feels like a single intelligent assistant but internally, it is a carefully orchestrated system.


6 System Overview

6.1 Database Connection view

6.2 Document Retrieval view

6.3 Vector Search view

Contact me to request access to the live demo.

7 Skills & Concepts Demonstrated

  • End-to-end RAG pipeline implementation
  • SQL agent with tool-based database access
  • LLM prompt engineering for grounding and reasoning
  • Vector database integration for retrieval
  • Document parsing and chunking for unstructured data
  • API design for hybrid knowledge access
  • Production-ready architecture for reliability and safety

8 References