EXPLAINER | AIKosha: Building block for a made-in-India LLM

The market potential is huge given that global AI firms have not been able to fully cater to the country’s linguistic diversity.

A recent EY India survey found that 74% of financial firms have initiated GenAI proof-of-concept (PoC) projects, with 11% already running in production.
A recent EY India survey found that 74% of financial firms have initiated GenAI proof-of-concept (PoC) projects, with 11% already running in production.

Last week, the government launched AIKosha — a national datasets platform. This marks the beginning of the process to make India-specific data across multiple Indian languages easily available to enable startups to build indigenous large language models, explains Jatin Grover

What is AIKosha?

AIKosha is a platform that provides a repository of India-specific anonymous and non-personal datasets, models and use-cases that are key to building large language models (LLMs) and AI applications. The datasets and models are from government institutions such as the Indian Council of Medical Research, Bhashini as well as verified private entities such as AI startups (Sarvam and Ola Krutrim) who have their Indic models listed on the platform. AI firms can use such multilingual language models for applications such as translation, etc. Currently, it has 315 datasets and 84 models across 13 sectors from 12 organisations. Datasets like 2011 Population Census village-level geometries, Aviation Grievance dataset and other data from the AirSewa platform, BhasaAnuvaad Speech Translation, daily data of soil moisture, among others, are available. Each dataset, model, or any other resource in AIKosha is governed by specific permission settings — open dataset, restricted dataset, and private dataset — that define usage rights. Restricted dataset will be discoverable by all the users on the platform but require explicit approval from the data owners before they can be downloaded.

Why is this platform needed?

The Indian datasets platform is one of the seven pillars in the Rs 10,372-crore IndiaAI mission. The government has earmarked around Rs 200 crore for this. Given that the government is enabling the creation of India-origin foundational models, the same would require Indic datasets across various languages to train the models, keeping in view Indian culture. Along with computing, AI model training on right datasets makes the models intelligent and is key to providing end-user services. Currently, global models such as OpenAI, Gemini, Grok, and others, lack training on Indian datasets and languages.

With the government stepping in to make available these datasets, the development of indigenous LLMs would be expedited as it is difficult for AI companies otherwise to access reliable datasets across various Indian languages. Government departments are the biggest holders of anonymous and non-personal data, and therefore, a common repository would help Indian AI companies.

How can the datasets be accessed?

For accessing the datasets, users need to register on the AIKosha IndiaAI platform. Registration can be done as individuals and as organisations with information such as entity type, sector, organisation name, website link, and registered address. IndiaAI says every user needs to first register as an explorer on the platform.

Explorers can view and download open datasets directly without raising any requests. For restricted datasets, users must submit a request stating the reason for download. Users can also become contributors of data after receiving contributor rights subject to approval from the organisation admin. The decision to provide datasets free-of-cost or not will be taken by the respective data owners. The government has said role-based and permission-based access controls have been  implemented to regulate access to artefacts. Additionally, encryption is applied to both data at rest and data in transit.

Progress on India-specific LLMs & compute

The government is also focusing on developing its own foundational AI models within the next eight to ten months. It has received 67 proposals from startups, academic institutions, and private enterprises interested in building foundational AI models. A technical committee is evaluating these proposals, and selected projects will receive government funding. Of the 67 proposals, 22 proposals are for large language models (LLMs) and 45 proposals for smaller domain-specific models in sectors such as healthcare, education, and agriculture. The selection criteria will include the technical credentials of the teams involved, the intended purpose of the models, and the expected impact of their deployment. Initially, the government would select three to five mature proposals to move forward. The AI compute platform, which provides GPUs access to the startups, is currently live with 14,000 GPUs.

Market potential of made-in-India LLMs

The market potential is huge given that global AI firms have not been able to fully cater to the country’s linguistic diversity. One big opportunity is in the home-grown voice-based AI models compared to text-only models. Voice models would enhance accessibility, enabling large sections of the population to interact with AI in their native languages through speech rather than text. While global tech firms have introduced voice capabilities in their AI assistants, their models are primarily optimised for English and a handful of other languages.

“The internet will become more voice-enabled, and there will be many people who will prefer accessing services using voice commands,” Abhishek Singh, additional secretary, ministry of electronics and IT (MeitY), told FE recently. Any AI model trained on Indian datasets and designed specifically for the country’s linguistic and cultural nuances would outperform existing global models in this domain, he said.

Get live Share Market updates, Stock Market Quotes, and the latest India News
This article was first uploaded on March eleven, twenty twenty-five, at zero minutes past three in the night.
X