What it does
Haiku RAG is an agentic retrieval-augmented generation system for indexing documents and answering questions with citations. It combines LanceDB for vector storage, Pydantic AI for multi-agent orchestration, and Docling for document parsing. The system supports hybrid search (vector plus full-text), multimodal retrieval (embedding both text and figures in a shared vector space), and vision-aware QA when documents contain images. Beyond simple question answering, it provides research agents for iterative planning and synthesis, analysis agents for complex computational tasks via sandboxed Python, and conversational interfaces with multi-turn memory. Indices are local-first via embedded LanceDB, though cloud and object-storage backends are available.
Who it's for
Document analysts and researchers who need to extract structured insights from large collections. Teams building conversational document search features. Engineers integrating RAG capabilities into Claude Desktop or other AI assistants without running a separate backend.
Common use cases
- Index PDFs and web documents, then search by keyword or semantic similarity with page-specific citations
- Run multi-turn research workflows using agentic planning: decompose a research question into steps, execute searches, and synthesize results
- Analyze document collections programmatically—count mentions, compute aggregations, compare claims across sources
- Build conversational chatbots over proprietary documents with session memory and visual grounding
- Expose document search tools to Claude via MCP for use within Claude Desktop or API calls
Setup pitfalls
- Requires Python 3.12 or newer; existing Python 3.11 environments will not work
- Needs an embedding provider configured (Ollama, OpenAI, VoyageAI, LM Studio, or vLLM); indexing will fail if none is available
- Reads and writes to the filesystem for document cache and LanceDB indices; requires appropriate permissions and disk space for large document collections
- Makes network calls for remote document fetching and embedding API calls; runs with
highrisk classification and should be sandboxed in security-sensitive environments