Ashok 1 - What is an LLM? Artificial intelligence is the broad field in which a subset of machine learning. In that, one subset is deep learning, and large language models (LLMs) is a type of deep learning model and it is based on neural networks or transferable networks. Some of the popular machine learning algorithms include:
Ashok 2 - Explain RAG RAG stands for Retrieval-Augmented Generation. It’s a technique that combines the power of a language model with relevant external knowledge. Instead of relying only on what the model was trained on, RAG retrieves real-time or domain-specific documents—like PDFs, support articles, or internal knowledge bases—and feeds that context into the LLM before generating a response. This makes outputs more accurate, up-to-date, and grounded in facts. It’s especially useful in enterprise use cases like customer support, internal search, or knowledge assistants.
RAG vs Fine-Tuning vs Prompt Engineering: Optimizing AI Models
3**. ⁉️ Products at Uniphore -**
• I’d examine how the user initiates conversations — are they onboarded correctly?
• Sometimes, latency isn’t just technical — users might ask irrelevant or malformed queries that cause unnecessary processing.
• Once a query is submitted, it typically passes through a multi-layered system:
• RAG (Retrieval-Augmented Generation)
• Multiple AI agents working together
• LLM call with: User query, System prompt, Retrieved knowledge base content
• Caching (if enabled)
⚙️ 2. Prioritize Areas Based on Speed, Quality & Cost
“Latency is a trade-off across speed, quality, and compute cost, so I’d prioritize optimizations based on the type of queries causing delay.”
• Start by identifying where the latency occurs:
• Is it only when RAG is used?
• Is it related to specific data sources (e.g., external APIs)?
• Is the LLM call disproportionately slow?
⸻
🔍 3. Optimize by Layer
🧠 a. LLM & Prompt Optimization • Review prompt length and structure — trim redundancy, remove unnecessary few-shot examples. • Use lightweight fallback models for simpler queries (e.g., Claude Haiku, OpenAI GPT-3.5-turbo). • Stream responses token-by-token to reduce perceived latency.
🧩 b. Agentic AI Orchestration • If multiple agents (e.g., knowledge agent + sentiment agent) run in sequence, parallelize them to reduce total hops. • Avoid unnecessary agent chaining — route only critical agents based on intent.
📚 c. RAG & Knowledge Layer • Investigate whether latency comes from retrieval quality: • Are embeddings poorly created? • Is the source data unstructured or outdated? • Is enrichment (metadata, ranking) done properly? • Improve data pipelines before retrieval.
⸻
🎨 4. UX-Level Improvements for Graceful Handling
“In parallel, while the AI/ML teams work on technical tuning, I’d enhance the user experience to minimize frustration.”
• Show typing indicators, progress bars, or chat animations to reassure users.
• Use streaming output so users see responses as they’re being generated.
• Set smart fallbacks on timeouts — e.g. 'Still working on it. Would you like to escalate this or raise a support ticket?’
⸻
🎯 Wrap-Up