Overview
A Retrieval-Augmented Generation (RAG) agent that retrieves context from company reports, peer disclosures, and the Greenhouse Gas Protocol, then generates grounded sustainability insights using Gemini 2.5 Flash.
Technical Solution
The AI Agent is made up of three major components:
- Document Ingestion: PDFs and CSVs are converted into embeddings and stored in a Pinecone vector database using deterministic IDs to prevent duplication.
- Pinecone was selected because semantic search enables fast and accurate retrieval from multiple documents at once.
- AI agent: Gemini 2.5 Flash is used with advanced prompt engineering techniques such as ReAct, chain-of-thought prompting, few-shot examples, and the CO-STAR framework.
- Retrieved text and variables are injected dynamically into the prompt to maintain context relevance.
- Web Application: A Streamlit application serves as the web interface, streaming responses to reduce latency and exposing a temperature slider to balance creativity with deterministic outputs.

Agent Workflow
Architecture decisions
- Choice of Pinecone: As previously mentioned, vector search was preferred over passing full documents to the LLM because it is faster and more cost-effective.
- Streaming responses: Streaming was implemented to reduce perceived latency and improve the overall user experience.
- Temperature control: The temperature parameter was exposed on the UI to provide deterministic results for calculation-heavy queries, tackling a common weakness of LLMs in numerical reasoning.
- Top-k Context Retrieval: Configured retrieval with
top_k = 20
to balance relevance and efficiency, i.e., limiting context to the 20 most semantically relevant chunks reduced latency and prevented the model from being overloaded with irrelevant information, while still ensuring high-quality answers.
Key observations about the data
Some key observations based on the datasets provided:
- Boiler operations and cement manufacturing together accounted for over 90% of the company’s Scope 1 emissions.
- Approximately 40% of Scope 3 emissions, which represent the majority of total emissions, were derived from proxy or industry average sources, underscoring the need for cautious interpretation.
- The eight suppliers contributed relatively evenly to Scope 3 emissions, each accounting for between 10% and 13%.
PDF and CSV integration into the LLM context
PDFs and CSVs were converted into embeddings and stored in Pinecone. The agent retrieves only the top_k relevant chunks as context. This approach has three advantages: (1) handles multiple documents, (2) reduces latency, and (3) lowers token cost.