This document updates the enterprise-level custom chatbot architecture for an organization with 20,000+ employees, incorporating Retrieval-Augmented Generation (RAG) with hybrid search (vector-based semantic search combined with raw text keyword search) and Optical Character Recognition (OCR) for document processing. It addresses the user's feedback to provide more details on the Data Ingestion Pipeline and DevOps, while retaining the recommended technologies: NGINX as API Gateway, Groq for LLM Inference, Jina Embeddings, and Pinecone as the vector database. The architecture is modular, scalable, and secure, designed to process user queries, leverage NLP, search SharePoint documents, handle smart prompts, and integrate within a robust system design. Additional concepts like monitoring and error handling ensure completeness.
+---------------------------------------------+
| Apache Airflow |
| (Orchestrates and schedules pipeline) |
+---------------------------------------------+
|
v
+---------------------------------------------+
| SharePoint (Data Source) |
| (Documents: PDFs, Images, Word, etc.) |
+---------------------------------------------+
|
v
+---------------------------------------------+
| Azure Document Intelligence (OCR) |
| (Extracts text, tables, structured data) |
+---------------------------------------------+
|
v
+---------------------------------------------+
| Data Cleaning (Python) |
| (Removes noise: headers, footers, etc.) |
+---------------------------------------------+
|
v
+---------------------------------------------+
| Chunking (Python/NLTK) |
| (Splits text into semantic chunks) |
+---------------------------------------------+
|
v
+---------------------------------------------+ +---------------------------------------------+
| Jina Embeddings | ----> | Pinecone (Vector Database) |
| (Generates dense vector embeddings) | | (Stores embeddings for semantic search) |
+---------------------------------------------+ +---------------------------------------------+
|
v
+---------------------------------------------+
| PostgreSQL (Relational Database) |
| (Stores raw text, metadata, tables) |
+---------------------------------------------+
|
v
+---------------------------------------------+
| Chatbot Retrieval (Hybrid Search) |
| (Combines keyword and semantic search) |
+---------------------------------------------+
The architecture is a cloud-native, RAG-based chatbot designed for large enterprises, integrating with Microsoft’s ecosystem (e.g., SharePoint, Teams). It addresses the limitations of traditional chatbots by using hybrid search (vector-based semantic search + raw text keyword search) to retrieve accurate, context-aware information from diverse internal knowledge bases. OCR enhances document processing by extracting text from scanned documents, making them searchable. The updated architecture incorporates:
The updated table is:
Layer | Tech Tools | Role |
---|---|---|
UI (Web/Teams) | SPFx, React, HTML/CSS | User Chat Interface |
API Gateway | NGINX | Handle requests, load balancing |
LLM Inference | Groq (Llama3, Mixtral) | Generate replies |
Data Connector | Graph API, SharePoint API | Extract enterprise data |
Vector Store | Pinecone (Jina Embeddings) | Store embeddings, hybrid search |
Memory Management | LangChain / MongoDB/Redis | Track chat history |
Data Ingestion | Apache Airflow, Azure Document Intelligence (OCR), Python, NLTK | Process and index data |
DevOps / Deploy | Docker, Kubernetes, Azure DevOps, Terraform, ArgoCD | Build, deploy, scale, infrastructure as code |
Security | OAuth2, SSO, Permissions | Protect data + users |