Enterprise-Level Custom Chatbot Architecture

This document updates the enterprise-level custom chatbot architecture for an organization with 20,000+ employees, incorporating Retrieval-Augmented Generation (RAG) with hybrid search (vector-based semantic search combined with raw text keyword search) and Optical Character Recognition (OCR) for document processing. It addresses the user's feedback to provide more details on the Data Ingestion Pipeline and DevOps, while retaining the recommended technologies: NGINX as API Gateway, Groq for LLM Inference, Jina Embeddings, and Pinecone as the vector database. The architecture is modular, scalable, and secure, designed to process user queries, leverage NLP, search SharePoint documents, handle smart prompts, and integrate within a robust system design. Additional concepts like monitoring and error handling ensure completeness.

Data Ingestion Pipeline Flow

+---------------------------------------------+
|               Apache Airflow                 |
|   (Orchestrates and schedules pipeline)      |
+---------------------------------------------+
                   |
                   v
+---------------------------------------------+
|           SharePoint (Data Source)          |
|   (Documents: PDFs, Images, Word, etc.)     |
+---------------------------------------------+
                   |
                   v
+---------------------------------------------+
|       Azure Document Intelligence (OCR)      |
|   (Extracts text, tables, structured data)  |
+---------------------------------------------+
                   |
                   v
+---------------------------------------------+
|        Data Cleaning (Python)               |
|   (Removes noise: headers, footers, etc.)   |
+---------------------------------------------+
                   |
                   v
+---------------------------------------------+
|         Chunking (Python/NLTK)              |
|   (Splits text into semantic chunks)        |
+---------------------------------------------+
                   |
                   v
+---------------------------------------------+       +---------------------------------------------+
|         Jina Embeddings                     | ----> |       Pinecone (Vector Database)            |
|   (Generates dense vector embeddings)       |       |   (Stores embeddings for semantic search)   |
+---------------------------------------------+       +---------------------------------------------+
                   |
                   v
+---------------------------------------------+
|       PostgreSQL (Relational Database)      |
|   (Stores raw text, metadata, tables)       |
+---------------------------------------------+
                   |
                   v
+---------------------------------------------+
|       Chatbot Retrieval (Hybrid Search)     |
|   (Combines keyword and semantic search)    |
+---------------------------------------------+

Overview

The architecture is a cloud-native, RAG-based chatbot designed for large enterprises, integrating with Microsoft’s ecosystem (e.g., SharePoint, Teams). It addresses the limitations of traditional chatbots by using hybrid search (vector-based semantic search + raw text keyword search) to retrieve accurate, context-aware information from diverse internal knowledge bases. OCR enhances document processing by extracting text from scanned documents, making them searchable. The updated architecture incorporates:

NGINX as API Gateway with load balancing.
Groq for low-latency LLM inference (Llama3, Mixtral).
Jina Embeddings for vector generation, with alternatives like Sentence Transformers.
Pinecone for vector storage, supporting hybrid search.
OCR using cloud-based services (e.g., Azure Document Intelligence) for digitizing documents.
Enhanced Data Ingestion Pipeline and DevOps details for developer implementation.

The updated table is:

Layer	Tech Tools	Role
UI (Web/Teams)	SPFx, React, HTML/CSS	User Chat Interface
API Gateway	NGINX	Handle requests, load balancing
LLM Inference	Groq (Llama3, Mixtral)	Generate replies
Data Connector	Graph API, SharePoint API	Extract enterprise data
Vector Store	Pinecone (Jina Embeddings)	Store embeddings, hybrid search
Memory Management	LangChain / MongoDB/Redis	Track chat history
Data Ingestion	Apache Airflow, Azure Document Intelligence (OCR), Python, NLTK	Process and index data
DevOps / Deploy	Docker, Kubernetes, Azure DevOps, Terraform, ArgoCD	Build, deploy, scale, infrastructure as code
Security	OAuth2, SSO, Permissions	Protect data + users

Data Ingestion Pipeline Flow

Table of Contents

Overview