Quran-RAG | LLM-Lab

Abstract

This project presents Quran-RAG, a domain-specific Retrieval-Augmented Generation (RAG) framework created to provide accurate, context-grounded, and evidence-backed answers about the Quran, Hadith, and classical and contemporary Islamic scholarship. We build a curated knowledge base from authoritative textual sources — including the Quranic text, major classical commentaries (tafsir), canonical Hadith collections, scholarly publications, and verified news and educational resources — and combine dense retrieval with generative models to reduce hallucination and improve factual precision in religiously and historically sensitive contexts.

🌐 Vision

Quran-RAG is designed to be a responsible, transparent, and auditable RAG system for Quranic knowledge. The long-term goal is a modular retrieval ecosystem that supports research, education, and ethically aligned user-facing applications while ensuring every generated response cites and can be traced to reliable primary and secondary sources.

Quran-RAG: Retrieval-Augmented Generation for Quranic Knowledge

Quran-RAG couples dense retrieval over a domain-specific vector database with LLM-based response synthesis to produce answers that are contextually precise and supported by evidence. This reduces risky generative behaviors on sensitive religious topics and produces outputs that can be audited and verified by users and researchers.

⚙️ Core Concept

The pipeline follows a retriever–generator pattern:

The retriever searches a Quranic vector store built from semantically encoded chunks of the curated corpus (Quran, tafsir, hadith, scholarly articles, etc.).
The generator synthesizes retrieved passages into a coherent, source-backed response. Retrieved passages (and citations) are attached to the output to maximize transparency.

🔍 Objectives

Provide accurate, evidence-backed answers about the Quran, its interpretations, and related Islamic texts.
Reduce misinformation and ambiguous claims by grounding responses in primary sources and peer-reviewed scholarship.
Offer an open research platform for evaluating truthful reasoning and context-aware retrieval in religious and historical domains.
Build tools for educators, students, researchers, and developers that require traceable, auditable outputs.

🧠 Key Features

1. Retrieval-Augmented Knowledge Grounding

The knowledge base contains:
- The Quranic text (segmented and annotated).
- Major classical tafsir (exegesis) and modern commentaries.
- Canonical Hadith collections with narration metadata.
- Scholarly articles, legal texts, and verified educational resources.
Dense semantic embeddings (we include artifacts such as embeddings created with models like BGE and e5 variants in the repository) power the retriever.
Retrieved passages are returned alongside model outputs so users can inspect the exact evidence.

2. Factually Verified Question Answering

Responses include: explanations supported by retrieved evidence, cited excerpts, and optional hierarchical context when conversations span multiple turns.
The system is designed to prefer conservative, well-sourced answers on contested or sensitive interpretive questions.

3. Interactive Web Interface

A modern chat-style interface (React + TypeScript in the repo’s ui/ folder) supports: streaming responses, chat history, user feedback, and a reference panel displaying retrieved excerpts and source metadata.

🧩 System Architecture

Quran-RAG has four main components:

1. Data Curation and Processing

Aggregation from canonical sources and scholarly repositories.
Text normalization, deduplication, sentence/verse segmentation, and chunking.
Embedding generation (the repository contains example embeddings and scripts used to build vector indexes).

2. Vector Database Layer

Chunked passages are encoded and stored in a vector store optimized for semantic similarity search (FAISS and other indexed stores are present in the workspace).

3. Retrieval and Context Assembly

Semantic search returns top-k relevant passages which are de-duplicated and assembled into a context window for the generator.

4. Generative Response Synthesis

A local or hosted LLM consumes the assembled context and produces a structured answer that includes citations and optional quoted excerpts.

📊 Evaluation Framework

We include an evaluation suite inspired by cognitive benchmarks that measure:

Factual recall (e.g., verse identification and factual queries).
Interpretive understanding (comparing generated interpretations with established commentaries).
Analytical reasoning across documents (e.g., cross-referencing tafsir and hadith).

🧭 Impact

Quran-RAG aims to:

Provide a transparent, auditable AI framework for religious and historical scholarship.
Support educators, researchers, and the public with reliable, evidence-backed tools.
Demonstrate how domain-specific retrieval can improve factuality and ethical alignment compared with generic LLM outputs.

Open this project page

You can open this project page directly in your browser:

View: https://llm-lab.qcri.org/quranic/

Click the link above to open the page.