By Amith Singhee

Should India build its own large language models (LLMs) or adapt existing ones? This debate is more than academic—it strikes at the heart of India’s digital future. Some argue that homegrown LLMs are vital to serve India’s unique linguistic and cultural diversity, while others see greater value in leveraging established models to save time and resources. But framing this as an either-or question misses the point entirely.

The real issue is strategic autonomy. Just as India’s decision to develop indigenous supercomputers bolstered innovation and reduced reliance on foreign technology, mastering LLMs is not a luxury—it is a necessity. In an increasingly digital world, self-reliance in critical technologies is not optional; it is the foundation of sovereignty and progress.

Get the ball rolling

Numerous applications for developing LLMs will emerge, and we must leverage internationally available technologies and methodologies to build our own models, rather than waiting for proprietary technologies. Existing large models based on Generative Pre-trained Transformer (GPT) and techniques like Self-Supervised Learning are readily accessible and can be implemented to advance our efforts.

These advanced models can be used to train smaller, specialised models, along with efforts in Indian data curation and synthetic data generation such as InstructLab, which can be integrated into various applications within the country to enhance their capability of comprehending Indian languages and cultural nuances. Many notable initiatives exist in this domain, like BharatGen, the first government-funded Multimodal LLM project focused on creating efficient and inclusive artificial intelligence (AI) systems for Indian languages. Sarvam 1, developed by Indian gen AI startup Sarvam, and Dhenu 1.0, an LLM concentrating on agricultural solutions, are among many other such Indic LLM efforts. These initiatives span across startups, research groups and non-profits, focusing on creating LLMs for Indian language use cases essential for India. These would serve a population of over 1.4 billion, with 22 official languages and innumerable regional dialects. Recently, many of the key stakeholders in the AI ecosystem in India came together as members of the AI Alliance to develop models, data tools and responsible AI solutions in the open.

As AI continues to evolve, its profound impact on nations and communities will establish it as a vital national digital resource. Imagine a future where advanced AI powered by large models becomes ubiquitous for a variety of use cases that cannot be addressed at high quality by smaller language models. If these large models are  owned and served by other countries, and stay optimised for global languages, a significant portion of India’s population will be unable to fully leverage this foundational technology. A part of the population using foreign models across consumer, government, and industry use cases in India would rely on other nations for access to these strategic digital resources. To ensure long-term resilience and self-reliance, it is important for India to develop the capability to train its own large models when needed, reducing dependencies and enhancing our digital sovereignty.

Where is the data source?

The insufficient data availability for training LLMs within the frameworks of Indian culture, languages, and nuances is a challenge. This can be effectively addressed with the right investment and time. By engaging local communities, we can either synthesise data or source it from various regions across the country. India has abundant cost-effective human resources, making it feasible to generate and curate a significant volume of data through such initiatives. Efforts like the 10 Trillion Token project from People+ai are already working toward these objectives.

While specific large models extend to hundreds of billions or even trillions of parameters and efforts to build smaller models (<10 billion parameters) are already underway, a notable opportunity exists within the 20-70 billion parameter range. This range is a starting point for undertaking essential data programs and identifying business use cases that can justify the required investment. Once proficiency is achieved in this range, scaling up to hundreds of billions of parameters will become significantly more manageable.

Our cultural and linguistic diversity makes it increasingly important to address the challenges associated with data and artificial intelligence. By leveraging AI, we can bridge language and cultural barriers, unlocking significant value in the process.

Tap the talent 

India has a large pool of computer scientists developing LLMs, both in multinational corporations and startups, ensuring the availability of the necessary skills and capabilities. These computer engineers and scientists are well-versed in artificial intelligence’s fundamental principles and proficient in high-performance and parallel programming. These professionals can access the latest libraries and tools to develop effective strategies for scaling up our capabilities. 

Quick rundown

The development of LLMs is crucial for national pride and to establish strategic self-reliance. More importantly, we have much to gain from the profound positive impact that AI can have on communities, societies, and nations. Adopting a phased approach to scale up LLMs is essential. Meanwhile, we should identify the value that will justify the substantial investments required. We must take proactive steps towards this journey and leverage the government’s ambitious initiatives such as Digital India and the IndiaAI Mission to support India becoming a global leader in the digital and AI ecosystem. 

The author is Director, IBM Research India and CTO, IBM India & South Asia

Disclaimer: Views expressed are personal and do not reflect the official position or policy of FinancialExpress.com. Reproducing this content without permission is prohibited.