India's Multilingual GenAI Language Models 💬 🇮🇳

Gyan AI’s Paramanu: A Family of Novel Efficient Indic Generative Foundation Language Models

We present Gyan AI Paramanu (“atom”), a family of novel language models for Indian languages.

It is a collection of auto-regressive monolingual, bilingual, and multilingual Indic language models pretrained from scratch on a single GPU for 10 Indian languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts (Bangla, Devanagari, Odia, Tamil, Telugu) of varying sizes ranging from 13.29M to 367.5M.

The models are pretrained with a context size of 1024 on a single GPU. The models are very efficient, small, fast, and powerful. We have also developed an efficient most advanced Indic tokenizer that can even tokenize unseen languages.

In order to avoid the “curse of multi-linguality” in our multilingual mParamanu model, we pretrained on comparable corpora by typological grouping using the same script. We performed human evaluation of our pretrained models for open end text generation on grammar, coherence, creativity, and factuality metrics for Bangla, Hindi, and Sanskrit.

Our Bangla, Hindi, and Sanskrit models outperformed GPT-3.5-Turbo (ChatGPT), Bloom 7B, LLaMa-2 7B, OPT 6.7B, GPT-J 6B, GPTNeo 1.3B, GPT2-XL large language models (LLMs) by a large margin despite being smaller in size by 66 to 20 times compared to standard 7B LLMs.

To run inference on our pretrained models, CPU is enough, and GPU is not needed.

We also instruction-tuned our pretrained Bangla, Hindi, Marathi, Tamil, and Telugu models on 23k instructions in respective languages.

Our pretrained and instruction-tuned models which are first of its kind, most powerful efficient small generative language models ever developed for Indic languages.

The various results in our research lead to the conclusion that high quality generative language models are possible without high amount of compute power and humongous number of parameters.

Our models: Paramanu-Assamese, Paramanu-Bangla, Paramanu-Hindi, Paramanu-Tamil, Paramanu-Telugu, Paramanu-Konkani-Maithili, Paramanu-Odia, Paramanu-Sanskrit, and multilingual mParamanu.

Bangla Evaluation

ModelMMLUHellaSwagARC
Bloom 7B28.232.829.2
Bloomz 7B25.931.528.2
Paramanu-Bangla 108.5M31.733.4532.5

Hindi Evaluation

ModelMMLUHellaSwagARC
Bloom 7B27.536.429.2
Bloomz 7B25.934.028.2
Open Hathi 7B32.2725.5938.48
Airavata 7B34.9625.3744.96
Paramanu-Hindi 367.5M38.4737.6541.7

Zero-shot XNLI and XStoryCloze for Hindi

XNLIXStoryCloze
33.4952.42

Tamil Evaluation

ModelMMLUHellaSwagARC
Bloom 7B26.629.424.2
Bloomz 7B26.729.525.6
Paramanu-Tamil 207M30.7032.4233.8

Telugu Evaluation

ModelMMLUHellaSwagARC
Bloom 7B26.229.224.3
Bloomz 7B25.730.725.8
Paramanu-Telugu 208M30.2332.232.9

Generative AI technology for multilingual India and World.

German Engineering in India