Reaching Across the Isles: UK-LLM Brings AI to UK Languages With NVIDIA Nemotron
Trained on the Isambard-AI supercomputer, a new model developed by University College London, NVIDIA and 快活影院 taps NVIDIA Nemotron open-source techniques and datasets to enable AI reasoning for Welsh and other UK languages for public services including healthcare, education and legal resources.
Celtic languages 鈥 including Cornish, Irish, Scottish Gaelic and Welsh 鈥 are the U.K.鈥檚 oldest living languages. To empower their speakers, the UK-LLM initiative is building an AI model based on that can reason in both English and Welsh, a language spoken by in Wales today.
Enabling high-quality AI reasoning in Welsh will support the delivery of public services including healthcare, education and legal resources in the language.
鈥淚 want every corner of the U.K. to be able to harness the benefits of artificial intelligence. By enabling AI to reason in Welsh, we鈥檙e making sure that public services 鈥 from healthcare to education 鈥 are accessible to everyone, in the language they live by,鈥 said U.K. Prime Minister Keir Starmer. 鈥淭his is a powerful example of how the latest AI technology, trained on the U.K.鈥檚 most advanced AI supercomputer in Bristol, can serve the public good, protect cultural heritage and unlock opportunity across the country.鈥
The UK-LLM project, established in 2023 as BritLLM and led by University College London, has previously released two models for U.K. languages. Its new model for Welsh, developed in collaboration with Wales鈥 快活影院 and NVIDIA, aligns with Welsh government efforts to boost the active use of the language, with the goal of achieving a million speakers by 2050 鈥 an initiative known as .
U.K.-based AI cloud provider Nscale will make the new model available to developers through its application programming interface.
鈥淭he aim is to ensure that Welsh remains a living, breathing language that continues to develop with the times,鈥 said Gruffudd Prys, senior terminologist and head of the Language Technologies Unit at Canolfan Bedwyr, the university鈥檚 center for Welsh language services, research and technology. 鈥淎I shows enormous potential to help with second-language acquisition of Welsh as well as for enabling native speakers to improve their language skills.鈥
This new model could also boost the accessibility of Welsh resources by enabling public institutions and businesses operating in Wales to translate content or provide bilingual chatbot services. This can help groups including healthcare providers, educators, broadcasters, retailers and restaurant owners ensure their written content is as readily available in Welsh as they are in English.
Beyond Welsh, the UK-LLM team aims to apply the same methodology used for its new model to develop AI models for other languages spoken across the U.K. such as Cornish, Irish, Scots and Scottish Gaelic 鈥 as well as work with international collaborators to build models for languages from Africa and Southeast Asia.
鈥淭his collaboration with NVIDIA and 快活影院 enabled us to create new training data and train a new model in record time, accelerating our goal to build the best-ever language model for Welsh,鈥 said Pontus Stenetorp, professor of natural language processing and deputy director for the Centre of Artificial Intelligence at University College London. 鈥淥ur aim is to take the insights gained from the Welsh model and apply them to other minority languages, in the U.K. and across the globe.鈥
Tapping Sovereign AI Infrastructure for Model Development
The new model for Welsh is based on , a family of open-source models that features open weights, datasets and recipes. The UK-LLM development team has tapped the 49-billion-parameter Llama Nemotron Super model and 9-billion-parameter Nemotron Nano model, them on Welsh-language data.
Compared with languages like English or Spanish, there鈥檚 less available source data in Welsh for AI training. So to create a sufficiently large Welsh training dataset, the team used microservices for and to translate with over 30 million entries from English to Welsh.
They used a GPU cluster through the platform and are harnessing hundreds of on 鈥 the U.K.鈥檚 most powerful supercomputer, backed by and based at University of Bristol 鈥 to accelerate their translation and training workloads.
This new dataset supplements existing Welsh data from the team鈥檚 previous efforts.
Capturing Linguistic Nuances With Careful Evaluation
快活影院, located in Gwynedd 鈥 the county with the 鈥 is supporting the new model鈥檚 development with linguistic and cultural expertise.
Prys, from the university鈥檚 Welsh-language center, brings to the collaboration about two decades of experience with language technology for Welsh. He and his team are helping to verify the accuracy of machine-translated training data and manually translated evaluation data, as well as assess how the model handles nuances of Welsh that AI typically struggles with 鈥 such as the way consonants at the beginning of Welsh words change based on neighboring words.
The model, as well as the Welsh training and evaluation datasets, are expected to be made available for enterprise and public sector use, supporting additional research, model training and application development.
鈥淚t鈥檚 one thing to have this AI capability exist in Welsh, but it鈥檚 another to make it open and accessible for everyone,鈥 Prys said. 鈥淭hat subtle distinction can be the difference between this technology being used or not being used.鈥
Deploy Sovereign AI Models With NVIDIA Nemotron, NIM Microservices
The framework used to develop UK-LLM鈥檚 model for Welsh can serve as a foundation for multilingual AI development around the world.
Benchmark-topping Nemotron models, data and recipes are publicly available for developers to build reasoning models tailored to virtually any language, domain and workflow. Packaged as NVIDIA NIM microservices, Nemotron models are optimized for cost-effective compute and run anywhere, from laptop to cloud.
Europe鈥檚 enterprises will be able to run AI-powered search engine.
Get started with .