1 Overview

Large Language Models (LLMs) are trained on vast amounts of public text such as websites, books, news articles, and online conversations. This gives them strong general language skills.

In real organizations, the most valuable knowledge lives in internal documents, databases, and systems that change over time. Because this information isn’t part of an LLM’s training data, standalone models quickly fall short. They may sound confident, but their answers are often incomplete, outdated, or simply wrong.

This shows up in practical ways:

They can’t reliably answer questions over internal documents
They struggle to query enterprise databases
They tend to hallucinate when context is missing
They don’t naturally respect security and access boundaries

Modern AI systems solve this not by training bigger models, but by connecting models to data carefully.

This project implements a hybrid knowledge access system that combines two proven patterns:

Retrieval-Augmented Generation (RAG) for unstructured data like PDFs, reports, and internal documentation
A tool-driven SQL agent for structured data stored in PostgreSQL databases

Instead of treating the LLM as a source of truth, the system uses it as a reasoning layer. The model retrieves evidence from private sources, reasons over it, and produces answers that are grounded in real data.

At a high level:

A user asks a natural-language question
The system determines whether the answer requires documents, databases, or both
Relevant information is retrieved from internal sources
The LLM generates a response based only on that retrieved context

The goal of this project is not just to build a working RAG pipeline, but to engineer a reliable, modular, and production-ready hybrid architecture that makes LLMs useful inside real systems.

2 Problem Definition

Organizations don’t store knowledge in one place. It’s spread across:

Unstructured documents such as PDFs, manuals, and internal policies
Structured databases containing operational and analytical data

LLMs provide a powerful natural-language interface, but on their own they cannot safely operate over private or evolving data. Without external grounding, models hallucinate, return stale information, and offer answers that can’t be verified.

The real challenge is not generation, but controlled access to knowledge.

A practical system must:

Ground every answer in retrieved evidence
Work across both documents and databases
Enforce read-only access and security constraints
Scale to large data volumes
Keep latency and cost predictable
Support debugging and evaluation

This project addresses those requirements by combining:

A RAG pipeline for document retrieval and context construction
A tool-mediated SQL agent that safely translates questions into validated, read-only queries

The result is a hybrid architecture with RAG and SQL Agent where the LLM reasons, tools retrieve, and data remains protected, producing answers that are explainable, auditable, and grounded.

3 RAG Stream: From Raw Documents to Grounded Answers

This section walks through the Retrieval-Augmented Generation (RAG) stream of the project, explaining how raw documents are transformed into reliable, context-aware answers.

The goal of this stream is to allow an LLM to answer questions using private documents without hallucinating or guessing.

3.1 Document Ingestion & Parsing

The pipeline begins with raw documents parsing, where documents are converted into clean, structured text before any downstream processing is possible. This project uses llama-parse to convert a wide range of document formats into structured markdown, preserving headings and layout information.

3.1.1 Core Features & Parsing Fundamentals

LlamaParse is a GenAI-native document parser designed for converting complex documents into LLM-ready data for Retrieval-Augmented Generation (RAG) systems and it excels at:

Complex Document Handling: Financial reports, research papers, scanned PDFs
Precise Extraction: Tables, charts, images, and diagrams
LLM-Ready Output: Clean markdown, text, or structured JSON

3.1.1.1 Supported File Formats (70+)

Category	Formats
Documents	PDF, DOC, DOCX, RTF, TXT, EPUB
Spreadsheets	XLSX, XLS, CSV, ODS
Presentations	PPTX, PPT
Images	JPG, PNG, GIF, BMP, TIFF, WEBP, SVG
Web	HTML, HTM
Audio	MP3, MP4, WAV, WEBM, M4A (≤20MB)

3.1.1.2 Pricing

Free Tier: 1,000 pages daily
Paid Tier: 7,000 pages/week + $0.003/additional page

3.1.1.3 API Key Setup

LlamaParse requires an API key from LlamaCloud.

Steps to get your API key:

Go to https://cloud.llamaindex.ai/
Sign up or log in
Navigate to API Keys section
Create a new API key (starts with llx-)

3.1.1.4 Architecture

LlamaParse provides both synchronous and asynchronous methods:

Sync Method	Async Method	Description
`load_data()`	`aload_data()`	Parse and return Documents
`parse()`	`aparse()`	Parse and return JobResult
`get_images()`	`aget_images()`	Get extracted images

LlamaParse uses a job-based architecture:

Document → Submit Job → Poll Status → Get Results

Key Components:
LlamaParse Client: Main interface for document parsing
Job: Represents a parsing task (can be async)
JobResult: Contains parsed content, pages, images, charts, layout
Document: LlamaIndex Document object with text and metadata
Available pre-optimized configurations (presets) for different document types

Preset	Best For	Description
`fast`	Quick extraction	No OCR, fastest processing
`balanced`	General documents	Balance of speed and accuracy
`premium`	Complex documents	Best quality, uses advanced models
`structured`	Forms, tables	Optimized for structured data
`auto`	Mixed content	Automatic mode selection
`scientific`	Research papers	LaTeX, equations, citations
`invoice`	Invoices, receipts	Financial document extraction
`slides`	Presentations	PowerPoint, slide content

3.1.1.5 Parse Modes

1. High-Level Modes (Boolean flags):
fast_mode: Skip OCR, fastest processing
premium_mode: Best available parser
auto_mode: Automatic mode selection
2. Granular Parse Modes (Page-level):
parse_page_without_llm: Fast extraction without AI
parse_page_with_llm: Uses LLM for each page
parse_page_with_lvm: Uses vision model for pages
parse_page_with_agent: Agentic reasoning per page
parse_page_with_layout_agent: Layout-aware agent

3.2 Chunking the Document

Large documents are split into smaller, semantically meaningful chunks. This is essential because:

Embedding models have input limits
Smaller chunks improve retrieval accuracy

This project uses Chonkie’s RecursiveChunker, which is character-based, LLM-agnostic and production-safe. Chunks are created with:

A fixed size (2048 characters)
Minimum length enforcement
Stable character offsets for traceability

3.2.1 Why this choice

Chonkie is a production-ready text chunking library designed specifically for RAG applications. It provides:

9 specialized chunkers (RecursiveChunker, TokenChunker, SentenceChunker, TableChunker, CodeChunker, SemanticChunker, LateChunker, NeuralChunker, SlumberChunker)
Local processing - your data never leaves your infrastructure
High performance - optimized for speed and efficiency
Thread-safe - suitable for concurrent processing
Flexible embeddings - works with OpenAI, Gemini, Sentence Transformers, and more

3.3 Embedding Generation

In this project, each chunk is converted into a numerical vector (embedding) using Sentence Transformers (local) to capture each chunk’s semantic meaning (other alternatives include OpenAI, Gemini). The same process is later applied to user queries. 768-dimensional embeddings are generated locally and cached to avoid recomputation.

3.3.1 Why this choice

No external API dependency
Predictable cost (zero per-request fees)
Easy to swap models
Suitable for private environments

3.4 Vector Storage & Retrieval

All chunk embeddings are stored in a vector database. At query time, the system retrieves the most relevant chunks using similarity search (Top-K document chunks are ranked by semantic similarity). This project uses Qdrant as the vector store, with the following key features:

Cosine similarity
Metadata filtering (namespace, filename)
Payload indexing
Async client for scalability

3.4.1 Why this choice

Easy to self-host (Open-source) and cloud hosted (with 1GB free cluster).
Strong filtering and indexing support
Clean Python API

3.5 Context Construction

Retrieved chunks are formatted into a single context block, preserving:

Source filename
Relevance score
Optional heading hierarchy

This context becomes the grounding evidence for the LLM and is used to generate the final answer.

3.5.1 Why this matters

Improves answer quality
Enables traceability
Reduces hallucination risk

3.6 Augmented Prompt Generation

The user’s question and the retrieved context are combined into a single prompt. The prompt explicitly instructs the LLM to use only the provided context and admit when information is missing. A grounded, constrained prompt is then sent to the LLM.

3.7 Answer Generation

The LLM generates the final answer using the augmented prompt. This implementation uses Groq LLM APIs for fast inference, but the architecture is model-agnostic. A clear, grounded answer with optional source citations is returned to the user.

3.8 Artifact & Cache Storage

To improve performance and reproducibility, intermediate artifacts are stored in Cloudflare R2 which is a low-cost object storage service and is S3-compatible. The following artifacts are stored:

Original documents
Chunks
Embeddings
Metadata

3.8.1 Why this matters

Faster reloads
Stateless services
Easy reindexing
Lower compute cost (with 10 GB free storage)

3.9 Summary

This RAG stream turns raw, private documents into grounded, auditable answers by combining:

Robust document parsing
Deterministic chunking
Local embedding generation
Vector-based retrieval
Context-aware LLM prompting

Most importantly, the LLM is never treated as a source of truth. It is a reasoning layer operating strictly on retrieved evidence.

4 SQL Agent Stream: Safe Natural Language Access to Databases

While the RAG stream handles unstructured knowledge like documents and manuals, many real-world questions require direct access to structured data stored in relational databases. This stream focuses on enabling safe, reliable, and auditable database querying using natural language, without giving the LLM unrestricted access to the database.

flowchart TD

    %% ======================
    %% USER
    %% ======================
    A[User Question]

    %% ======================
    %% LLM
    %% ======================
    subgraph LLMBOX["<B>LLM (Reasoning Engine)</B>"]
        L1[Understand Question]
        L2[Request Tools]
        L3[Interpret Query Results]
    end

    %% ======================
    %% SQL AGENT
    %% ======================
    subgraph AGENT["<B>SQL Agent Control</B>"]
        P[Agent Control Loop]
        G[Guardrails & Constraints]
    end

    %% ======================
    %% TOOLS
    %% ======================
    subgraph TOOLS["<B>Database Tools</B>"]
        T1[List Tables]
        T2[Describe Tables]
        T3[Sample Data]
        T4[Execute Read-Only SQL]
    end

    %% ======================
    %% DATABASE
    %% ======================
    subgraph DBLAYER[" <B>Protected Database</B>"]
        D[(PostgreSQL Database)]
    end

    %% ======================
    %% RESPONSE
    %% ======================
    R[<B>Final Answer to User</B>]

    %% ======================
    %% FLOW
    %% ======================
    A --> L1
    L1 --> L2
    L2 --> P

    %% Guarded execution
    P --> G
    G --> TOOLS

    TOOLS --> D
    D --> P

    %% Output control
    P --> L3
    L3 --> R

    %% ======================
    %% STYLES (LIGHT GRAY)
    %% ======================
    style LLMBOX fill:#f5f5f5,stroke:#d1d5db,stroke-width:2px
    style AGENT fill:#f5f5f5,stroke:#d1d5db,stroke-width:2px
    style TOOLS fill:#f5f5f5,stroke:#d1d5db,stroke-width:2px
    style DBLAYER fill:#f9fafb,stroke:#d1d5db,stroke-width:2px
    style A fill:#f9fafb,stroke:#d1d5db,stroke-width:2px
    style R fill:#f9fafb,stroke:#d1d5db,stroke-width:2px

4.1 User Question

A user submits a natural language question that requires database access. At this point, the system does not assume the question is valid or executable.

4.2 LLM as a Reasoning Engine

The LLM receives the user question along with a strict system prompt that defines its role. In this project, the LLM is not a data source and not allowed to answer directly.

Its responsibilities are limited to:

Understanding the user’s intent
Planning how to explore the database
Deciding which tools to call and in what order
Explaining its reasoning for every tool call

This separation ensures that:

The LLM cannot hallucinate results
All answers must be grounded in actual database queries
Every step is explainable and auditable

4.3 Agent Control Loop

The SQL Agent runs inside a controlled reasoning loop, where each iteration performs the following steps:

The LLM proposes tool calls
Tools are executed by the system (not the LLM)
Tool outputs are fed back to the LLM
The loop continues until a final answer is produced

This loop is bounded by a maximum number of iterations and hard failure conditions. This design prevents:

Infinite reasoning loops
Unbounded database exploration
Uncontrolled compute costs

4.4 Guardrails & Constraints

Before any SQL is executed, the system enforces hard safety rules that the LLM cannot override.

These include:

Read-only enforcement
- No INSERT, UPDATE, DELETE, DROP, ALTER, or CREATE
Mandatory LIMIT clause
- Prevents large result sets
Schema validation
- Only known tables and columns are allowed
Tool usage enforcement
- The LLM must call at least one database tool
Temporal query guards
- Prevents ambiguous “current” or “latest” queries without validation

If any rule is violated, execution stops immediately.

4.5 Database Tools

The LLM can request a limited set of explicit tools, each designed for a specific purpose:

List Tables
- Discover available tables
Describe Tables
- Inspect schema and columns
Sample Data
- Preview small subsets of rows
Execute Read-Only SQL
- Run validated, constrained queries

Each tool:

Requires a reasoning parameter
Runs inside controlled Python code
Uses a secure PostgreSQL connection

4.5.1 Why this matters

Tools act as the only gateway to the database.
The LLM never touches SQL execution directly.

4.6 Protected Database Access

All database operations go through a protected PostgreSQL connection with:

Read-only enforcement
Cursor-level control
Automatic rollback on failure
Limited result sizes

The database is treated as a protected system, not an LLM playground.

4.7 Interpreting Query Results

Tool results are returned to the LLM as structured messages then the LLM:

Interpret the data
Summarize insights
Translate technical results into business-friendly language

Crucially, the LLM does not fabricate numbers and if no relevant data is found, it must say so explicitly.

4.8 Final Answer to the User

Once the LLM has:

Used at least one database tool
Passed all guardrails
Interpreted real query results

It produces a final answer that:

Is grounded in actual data
Is formatted in clear Markdown
Avoids exposing raw SQL unless requested
Is understandable to non-technical users

This SQL Agent stream complements the RAG stream by enabling safe, explainable access to structured data, while preserving all the reliability guarantees required for real-world systems. Together, the RAG stream and SQL Agent stream form a unified knowledge access layer where:

Documents are retrieved, not guessed
Databases are queried, not hallucinated
LLMs reason, but never act unchecked

5 Unified Hybrid Flow: RAG + SQL Agent

Modern organizations store knowledge in multiple forms. Some of it lives in documents like reports, manuals, and policies. Some of it lives in databases like transaction tables, analytics schemas, and operational systems. Treating these two worlds separately limits what users can ask.

This project implements a unified hybrid flow that allows a single natural-language question to be answered using:

Retrieval-Augmented Generation (RAG) for unstructured documents
A tool-driven SQL Agent for structured relational data
Or both together, when the question requires combined reasoning

The system is designed so the LLM acts as a reasoning layer, while all data access is explicit, controlled, and auditable.

flowchart TD

    %% ======================
    %% USER
    %% ======================
    subgraph USER["<B>User</B>"]
        A[User Asks a Question]
    end

    %% ======================
    %% API GATEWAY
    %% ======================
    subgraph API["<B>API Gateway</B>"]
        B[API Receives the Question]
    end

    %% ======================
    %% ROUTING SERVICE
    %% ======================
    subgraph ROUTING["<B>Routing Service</B>"]
        C[Decides the Best Path]
    end

    %% ======================
    %% EXECUTION PATHS
    %% ======================
    subgraph EXEC["<B>Execution Paths</B>"]
        D[Query Business Database]
        E[Search Knowledge Base]
        F[Combine Sources if Needed]
    end

    %% ======================
    %% DATA & KNOWLEDGE
    %% ======================
    subgraph DATA["<B>Data & Knowledge</B>"]
        G[(Business Database)]
        H[(Knowledge Base)]
        K[(Knowledge + Database)]
    end

    %% ======================
    %% RESPONSE
    %% ======================
    subgraph RESPONSE["Response"]
        I[Generate Clear Answer]
        J[Return Answer to User]
    end

    %% ======================
    %% FLOW
    %% ======================
    A --> B
    B --> C

    C -->|Data Question| D
    C -->|Knowledge Question| E
    C -->|Mixed Question| F

    D --> G
    E --> H
    F --> K

    G --> I
    H --> I
    K --> I

    I --> J

    %% ======================
    %% STYLES (LIGHT GRAY)
    %% ======================
    style USER fill:#f9fafb,stroke:#d1d5db,stroke-width:2px
    style API fill:#f9fafb,stroke:#d1d5db,stroke-width:2px
    style ROUTING fill:#f5f5f5,stroke:#d1d5db,stroke-width:2px
    style EXEC fill:#f5f5f5,stroke:#d1d5db,stroke-width:2px
    style DATA fill:#f9fafb,stroke:#d1d5db,stroke-width:2px
    style RESPONSE fill:#f9fafb,stroke:#d1d5db,stroke-width:2px

5.1 User Asks a Question

The flow begins when a user submits a natural-language question. At this point, the system does not assume where the answer should come from.

5.2 API Gateway Receives the Request

The API Gateway serves as the single entry point for all queries. Its responsibilities include:

Accept the user question
Attach session and request metadata
Forward the question to the routing service

No reasoning or data access happens here.

5.3 Routing Service Decides the Best Path

The Routing Service analyzes the question and determines which execution path is required:

Data Question → Requires structured database access → Route to the SQL Agent
Knowledge Question → Requires document understanding → Route to the RAG pipeline
Mixed Question → Requires both documents and database facts → Route to the hybrid execution path

5.4 Execution Paths

5.4.1 SQL Agent Path (Structured Data)

For data-driven questions:

The SQL Agent uses the LLM as a reasoning engine
The LLM plans database exploration steps
Database tools are executed under strict guardrails:
- Read-only queries
- Schema validation
- Automatic limits
- Mandatory tool usage
The database remains fully protected

The LLM never executes SQL directly, it only reasons about which tool should run next.

5.4.2 RAG Path (Unstructured Knowledge)

For document-driven questions:

Relevant documents are parsed and chunked
Embeddings are generated using a local embedding model
Similar chunks are retrieved from the vector database
Retrieved context is combined with the user question
The LLM generates an answer grounded strictly in retrieved content

This ensures answers are evidence-based, not hallucinated.

5.4.3 Hybrid Path (Knowledge + Database)

For mixed questions:

The system retrieves relevant document context and
Executes validated SQL queries via the SQL Agent
Both sources are combined into a single augmented context
The LLM generates a unified answer that:
- Explains what the documents say
- References what the data shows

This allows true cross-source reasoning without compromising safety.

5.5 Data & Knowledge Layer

All execution paths ultimately draw from the same Data & Knowledge layer, which includes:

Business Database (PostgreSQL)
Knowledge Base (vector store)
Combined Knowledge + Database context (for hybrid queries)

This layer is treated as the source of truth, never the LLM.

5.6 Generate a Clear Answer

Once relevant data is retrieved:

The LLM interprets results
Converts technical outputs into human-readable explanations
Avoids fabricating missing information
Explicitly states when data is insufficient

The answer is formatted for clarity and trust, not verbosity.

5.7 Return Answer to User

The final response is returned to the user with:

Clear reasoning
Grounded evidence
No hidden assumptions
No hallucinated facts

From the user’s perspective, this feels like a single intelligent assistant but internally, it is a carefully orchestrated system.

6 System Overview

6.1 Database Connection view

6.2 Document Retrieval view

6.3 Vector Search view

Contact me to request access to the live demo.

7 Skills & Concepts Demonstrated

End-to-end RAG pipeline implementation
SQL agent with tool-based database access
LLM prompt engineering for grounding and reasoning
Vector database integration for retrieval
Document parsing and chunking for unstructured data
API design for hybrid knowledge access
Production-ready architecture for reliability and safety