Updated approach note has been moved here


Discord Community: https://discord.gg/wtyuDSyx


Project EKΛ is India’s most ambitious open-source AI initiative to date—an effort to build a sovereign, multilingual, and massively scalable Foundation Model at 120 billion parameters. Our goal is to not just participate but lead globally, by targeting top-5 performance across standard benchmarks in the 120B category, competing with the best of global SOTA models. The model is designed for both breadth and depth: linguistically inclusive for Indian languages and architecturally optimised for reasoning, instruction-following, and downstream customisation. At the heart of this initiative lies a vision to empower India’s AI ecosystem with a robust, open-source base that can drive real-world innovation across science, work economy, agriculture, education, and defence.

Guided by the Project Charter outlined at eka.soket.ai, our development principles include open-sourcing all model weights, training code, and datasets; building in public with transparent updates; maintaining high energy-efficiency through optimised training; and curating high-quality, culturally representative data. We commit to producing a Foundation Model that’s safe, inclusive, and strategically relevant, built in collaboration with research institutes, developers, and government partners.

The core outcomes of Project EKΛ include: (1) a state-of-the-art 120B multilingual Foundation Model, (2) domain-specialised instruction-tuned models for generated tasks, (3) publicly available high-quality training datasets across modalities, and (4) modular training pipelines, all hosted under COOM. These resources will enable researchers and startups alike to build derivative AI models for use cases ranging from local governance to scientific discovery, and from courtroom assistance to smart agriculture.

Our data strategy is three-pronged: (1) Pretraining data will prioritise Indian languages and regional knowledge, including resources like government websites, textbooks, and court records. In parallel, a second leg of pretraining will cover global data in English and other languages, including scientific literature, programming code, and international corpora. (2) Post-training data will include Supervised Fine-Tuning (SFT) datasets for various domain-specific tasks—currently under planning—and will likely incorporate complex reasoning tasks such as Chain-of-Thought (CoT), though the precise design and scope of reasoning training are still under discussion. (3) Evaluation data will be a new contribution to the ecosystem, creating Indic-specific benchmark datasets for domains like law, agriculture, and education where current evaluations fall short.

Our training strategy will begin with a series of architecture experiments to validate data alignment and model convergence. This bag-of-experiments approach will span from smaller 1B to 7B models, followed by a mid-scale ~30B model, and finally scaling to the full 120B Sparse Mixture of Experts (MoE) architecture. For baseline reliability, we will initially adopt the DeepSeek training recipe and incrementally adapt it for our needs. Training code under COOM will be built on Megatron-LM and optimised with algorithmic refinements—kernel fusion, CUDA-level improvements, and integration of optimisers like Muon. We will also experiment with progressive context window expansion: starting with small windows and large batch sizes, then gradually increasing sequence length while reducing batch size. Post-training strategy remains undecided; one potential pathway is to fuse SFT into pretraining and proceed directly to CoT and RLHF.

Our evaluation strategy includes an initial pipeline to benchmark on existing global datasets. However, we recognise that most current evaluations lack relevance to Indian languages, cultural context, and specific domains (e.g., legal interpretation or NEP-aligned educational QA). We are launching efforts to study and curate new evaluation datasets in partnership with domain experts. For rapid scaling, synthetic evaluations will be bootstrapped using outputs from larger models, with human evaluations layered in to validate performance, fairness, and factual grounding.

To ensure safety and human alignment, we will adopt a multi-tiered red-teaming approach. This includes adversarial testing through prompt injections, jailbreaking attempts, and edge-case scenarios to identify potential failure modes. These will be conducted both automatically and via expert-led manual audits. Identified vulnerabilities will inform fine-tuning, particularly around ethical guardrails, refusal behaviour, and instruction following. Alignment strategies will further be reinforced using human preference modelling and reward-tuned objectives. All findings—successful or otherwise—will be openly documented to guide safe and responsible model deployment.

Finally, our domain adaptation strategy envisions a suite of smaller, distilled models tailored to specific sectors. These domain-specific agents will be built via LoRA fine-tuning on the base model and optimised for defence applications (e.g., real-time multilingual ops), education (e.g., AI tutors in regional languages), legal systems (e.g., judgement summarisation), and agriculture (e.g., pest advisory bots). Further details will be shared as these downstream verticals are scoped with partner institutions.