This document updates the enterprise-level custom chatbot architecture for an organization with 20,000+ employees, incorporating Retrieval-Augmented Generation (RAG) with hybrid search (vector-based semantic search combined with raw text keyword search) and Optical Character Recognition (OCR) for document processing. It addresses the user's feedback to provide more details on the Data Ingestion Pipeline and DevOps, while retaining the recommended technologies: NGINX as API Gateway, Groq for LLM Inference, Jina Embeddings, and Pinecone as the vector database. The architecture is modular, scalable, and secure, designed to process user queries, leverage NLP, search SharePoint documents, handle smart prompts, and integrate within a robust system design. Additional concepts like monitoring and error handling ensure completeness.

Data Ingestion Pipeline Flow

+---------------------------------------------+
|               Apache Airflow                 |
|   (Orchestrates and schedules pipeline)      |
+---------------------------------------------+
                   |
                   v
+---------------------------------------------+
|           SharePoint (Data Source)          |
|   (Documents: PDFs, Images, Word, etc.)     |
+---------------------------------------------+
                   |
                   v
+---------------------------------------------+
|       Azure Document Intelligence (OCR)      |
|   (Extracts text, tables, structured data)  |
+---------------------------------------------+
                   |
                   v
+---------------------------------------------+
|        Data Cleaning (Python)               |
|   (Removes noise: headers, footers, etc.)   |
+---------------------------------------------+
                   |
                   v
+---------------------------------------------+
|         Chunking (Python/NLTK)              |
|   (Splits text into semantic chunks)        |
+---------------------------------------------+
                   |
                   v
+---------------------------------------------+       +---------------------------------------------+
|         Jina Embeddings                     | ----> |       Pinecone (Vector Database)            |
|   (Generates dense vector embeddings)       |       |   (Stores embeddings for semantic search)   |
+---------------------------------------------+       +---------------------------------------------+
                   |
                   v
+---------------------------------------------+
|       PostgreSQL (Relational Database)      |
|   (Stores raw text, metadata, tables)       |
+---------------------------------------------+
                   |
                   v
+---------------------------------------------+
|       Chatbot Retrieval (Hybrid Search)     |
|   (Combines keyword and semantic search)    |
+---------------------------------------------+

Table of Contents

  1. Overview
  2. Architecture Components
  3. Additional Concepts
  4. System Design and Integration
  5. Resources for Further Learning
  6. Conclusion

Overview

The architecture is a cloud-native, RAG-based chatbot designed for large enterprises, integrating with Microsoft’s ecosystem (e.g., SharePoint, Teams). It addresses the limitations of traditional chatbots by using hybrid search (vector-based semantic search + raw text keyword search) to retrieve accurate, context-aware information from diverse internal knowledge bases. OCR enhances document processing by extracting text from scanned documents, making them searchable. The updated architecture incorporates:

The updated table is:

Layer Tech Tools Role
UI (Web/Teams) SPFx, React, HTML/CSS User Chat Interface
API Gateway NGINX Handle requests, load balancing
LLM Inference Groq (Llama3, Mixtral) Generate replies
Data Connector Graph API, SharePoint API Extract enterprise data
Vector Store Pinecone (Jina Embeddings) Store embeddings, hybrid search
Memory Management LangChain / MongoDB/Redis Track chat history
Data Ingestion Apache Airflow, Azure Document Intelligence (OCR), Python, NLTK Process and index data
DevOps / Deploy Docker, Kubernetes, Azure DevOps, Terraform, ArgoCD Build, deploy, scale, infrastructure as code
Security OAuth2, SSO, Permissions Protect data + users