<aside>
To build a fully autonomous AI agent capable of handling multimodal user interactions (text, voice, image) inside Telegram, with memory and access to external tools like email, calendar, contacts, and web search.
Most chatbots are limited to one input type (text) and lack context, memory, or connection to external tools. Real-life productivity assistants need to understand multiple input types, store short-term memory, and execute real-world tasks autonomously.
Set up a Telegram bot using BotFather and connect it to n8n with Telegram Trigger.
Create a Switch node to classify incoming messages into 3 types: text, voice, image.
For image input, create a branch that downloads the image, fixes the file extension, and sends it to OpenAI Vision for analysis.
For voice input, download the audio file and use OpenAI Whisper to transcribe it to text.
For text input, capture and normalize it using a Set node to structure the prompt.
Pass all text into the AI Agent node (LangChain Agent) using an OpenAI GPT model with Window Buffer Memory.
Define a powerful system prompt that explains the agentβs tools and expected response formats.
Connect tools to the AI Agent: Get Emails, Send Email, Get Calendar, Set Calendar, Contacts (Airtable), Google Search (SerpAPI).
Use LangChain Tools node types to expose all tools in the agentβs environment.
Set up the email tools using Gmail API with OAuth for reading and sending messages.
Configure Google Calendar tools to check and schedule meetings based on LLM intent.
Connect Airtable as a contact database and allow AI to perform filtered searches by name.
Connect SerpAPI for live web search via Google, returning snippets and URLs.
Add conditional logic (If) to decide whether the AI response should be sent as voice or text.
If audio response is needed, use OpenAI TTS to convert the reply into an mp3.
Send voice reply through Telegram using sendAudio.
If no audio is required, send a text message using sendMessage.
Group sticky notes visually to separate Voice Chat, Image Chat, Agent Core, and Response Handler blocks.
Test multimodal prompts: image captioning, audio commands, contact queries, and email actions.
Enable logging and memory persistence by session (chat.id as memory key).
This project showcases a fully modular and extensible AI assistant capable of handling dynamic human interaction. It simulates real-world use cases like managing a calendar, sending emails, understanding images, and replying by voice. Perfect for showcasing advanced AI x automation architecture.

π Download & Explore the Workflow
View GitHub Repository :